I am a PhD candidate in the Information and Languages Processing Systems group at the University of Amsterdam. I work under supervision of Christof Monz and co-supervision of Arianna Bisazza on statistical machine translation (SMT).
In 2016, I stepped away from SMT for a while to do a summer internship at Microsoft with Ryen White.
To date, I've focused on the following topics:
- Conversational SMT:
Dialogues differ substantially from formal text. While formal texts are typically written by a single writer with a clear intention (for instance informing or persuading) and editorially controlled language, dialogues by definition involve multiple speakers who have different intentions and language use. While such aspects have been studied in dialogue research, we were the first to study their impact on SMT quality. In a study on fictional dialogues, we found that translation quality differences between speakers, male and female characters, dialogue acts, or different register levels are often much larger than randomly expected.
A second aspect to dialogue translation is its sensitivity to context. To translate dialogues properly, an MT system has to be aware of context rather than translating utterances in isolation.
- SMT for user-generated content such as found on microblogs, weblogs, or in SMS/chat messages:
It is widely accepted that translating user-generated (UG) text is a difficult task for modern statistical machine translation (SMT) systems. To find out what aspects of SMT make translating UG text particularly challenging, I have performed a detailed analysis on five types of UG text for two language pairs. SMT errors for UG not only differ substantially from SMT errors for a more formal genre such as news, but also between various types of UG. In future work I plan to move beyond error analysis and work towards improvement of SMT for UG text, in particular such as found in SMS or chat messages.
- Domain, genre and topic adaptation for SMT:
Domain adaptation is an active field of research in statistical machine translation (SMT), but most work ignores the distinction between the topic and genre of documents. In my work, I have disentangled the concepts of topic and genre in the context of domain adaptation for SMT, and quantified the impact of both concepts on translation quality using a phrase-based SMT system. The results show that while re-scoring of existing translation candidates seems a profitable approach for topic adaptation, adaptation towards different genres might benefit more from improved model coverage. However, a multi-genre test set can also benefit from translation model adaptation by means of re-scoring existing translation candidates.
- Bilingual resource acquisition for low-resourced language pairs:
Research in SMT is typically evaluated on major language pairs such as French-English, Arabic-English, and Chinese-English. However, with the vast (and growing) amount of available text on the internet in many different languages, it is possible to create well-performing SMT systems for less-resourced language pairs from scratch. In my work I have harvested parallel and comparable training and evaluation data for various genres and topics in low-resourced language pairs, such as Pashto-English or Bulgarian-English. In addition, I worked on the challenging task of translating Romanized Arabic, or Arabizi, into English.