Software & Data
Together with Jörg Tiedemann (University of Helsinki), I worked on the release and maintenance of the OpenSubtitles dataset. The OpenSubtitles 2018 dataset contains over 3.7 million subtitles from movies and TV series covering 60 languages. The full dataset amounts to over 3.4 billion sentences (22.2 billion tokens)
The subtitles are aligned with one another across languages at the document and sentence levels, yielding a total of 1782 bitexts. The dataset is currently the world's largest open collection of parallel corpora and is used by hundreds of institutions across the globe for research on dialogue modelling, machine translation and cross-lingual NLP.
I developed for my PhD thesis a Java-based software toolkit called OpenDial released under an MIT open-source license. OpenDial is a Java-based, domain-independent toolkit for developing spoken dialogue systems. The primary focus of OpenDial is on dialogue management, but OpenDial can also be used to build full-fledged, end-to-end dialogue systems, integrating e.g. speech recognition, language understanding, generation, speech synthesis, multimodal processing and situation awareness.
The representation of dialogue domains is based on probabilistic rules encoded in a simple XML format, allowing dialogue developers to combine the benefits of logical and statistical approaches to dialogue modelling within a single, unified framework. In addition to the download packages, the OpenDial website also includes some practical examples of dialogue domains and a step-by-step documentation on how to use the toolkit.