Research

I am a PhD candidate in the Information and Languages Processing Systems group at the University of Amsterdam. I work under supervision of Christof Monz and co-supervision of Arianna Bisazza on statistical machine translation (SMT). I'm currently in the final stages of my PhD, planning to defend my thesis on December 8, 2017.

In 2016, I stepped away from SMT for a while to do a summer internship at Microsoft Health with Ryen White.

During my PhD, I've focused on the following topics:

  • Data selection for NMT:
    With the recent popularity of neural machine translation (NMT), I have explored the applicability of an existing data selection approach for NMT. Since this method performs substantially worse for NMT than for traditional phrase-based SMT, I have proposed a dynamic variant of data selection, in which the selected training data is varied between training iterations.
  • Conversational SMT:
    Dialogues differ substantially from formal text. While formal texts are typically written by a single writer with a clear intention (for instance informing or persuading) and editorially controlled language, dialogues by definition involve multiple speakers who have different intentions and language use. While such aspects have been studied in dialogue research, we were the first to study their impact on SMT quality. In a study on fictional dialogues, we found that translation quality differences between speakers, male and female characters, dialogue acts, or different register levels are often much larger than randomly expected.
  • SMT for user-generated content such as found on microblogs, weblogs, or in SMS/chat messages:
    It is widely accepted that translating user-generated (UG) text is a difficult task for modern statistical machine translation (SMT) systems. To find out what aspects of SMT make translating UG text particularly challenging, I have performed a detailed analysis on five types of UG text for two language pairs. SMT errors for UG not only differ substantially from SMT errors for a more formal genre such as news, but also between various types of UG. In future work I plan to move beyond error analysis and work towards improvement of SMT for UG text, in particular such as found in SMS or chat messages.
  • Domain, genre and topic adaptation for SMT:
    Domain adaptation is an active field of research in statistical machine translation (SMT), but most work ignores the distinction between the topic and genre of documents. In my work, I have disentangled the concepts of topic and genre in the context of domain adaptation for SMT, and quantified the impact of both concepts on translation quality using a phrase-based SMT system. The results show that while re-scoring of existing translation candidates seems a profitable approach for topic adaptation, adaptation towards different genres might benefit more from improved model coverage. However, a multi-genre test set can also benefit from translation model adaptation by means of re-scoring existing translation candidates.
  • Bilingual resource acquisition for low-resourced language pairs:
    Research in SMT is typically evaluated on major language pairs such as French-English, Arabic-English, and Chinese-English. However, with the vast (and growing) amount of available text on the internet in many different languages, it is possible to create well-performing SMT systems for less-resourced language pairs from scratch. In my work I have harvested parallel and comparable training and evaluation data for various genres and topics in low-resourced language pairs, such as Pashto-English or Bulgarian-English. In addition, I worked on the challenging task of translating Romanized Arabic, or Arabizi, into English.