Linguistics doctoral candidate Farhan Samir will give an LOC colloquium in Swing Space 221 on Friday 29th at 3:30.
Title: Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia
Abstract: Go to any page on Wikipedia about a prominent public figure, and you’ll find that they are written about in several languages. Indeed, Wikipedia has several hundred language editions, each one maintained by different editor communities. These communities vary in how they might portray a culturally significant public figure. However, this cross-linguistic variation in content has never been quantified at scale. We introduce the InfoGap method — an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We apply InfoGap to percolate cross-linguistic information differences in biographies of LGBT-identifying public figures, where prior content differences had been identified. We find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.
The talk will be based on this article published in the Empirical Methods in Natural Language Processing conference proceedings (2024): https://aclanthology.org/2024.emnlp-main.384.pdf