Dr. Scott Mackie (UBC) will give a colloquium in person on Nov 8 – 3:30-5:00pm, 2024.
Title: Exploring Amazon’s M.A.S.S.I.V.E dataset
Abstract: MASSIVE (FitzGerald et al. 2023) is a large corpus of 1M+ sentences, created by translating ~20K English sentences into 49 other languages. To date, MASSIVE has only been used for machine learning tasks, but it contains number of features of interest to linguists. It is a parallel corpus, which facilitates cross-linguistic comparisons. Translations are natural and high-quality, collected from expert speakers with multiple rounds of human review. The dataset contains utterances directed at a voice assistant (e.g. “set an alarm”, “what’s the weather”), so it primarily consists of interrogatives and imperatives. This makes it a rich source of data for syntactic and pragmatic contexts that are rarely available in other corpora. Sentences are also semantically annotated, with 18 domains, 60 intents and 55 slots.
In this talk, I’ll discuss how MASSIVE was created, and the challenges of collecting such a large and diverse set. I’ll highlight some linguistically interesting aspects of the data, and suggest potential areas of research. I will also introduce a new tool called Proust, which provides an interface for searching MASSIVE.