UBC Linguistics PhD candidate Ife Adebara successfully defended her dissertation in computational linguistics entitled Towards Afrocentric Natural Language Processing on January 23rd, 2024. Congratulations to Dr. Adebara! Great job!
Ife’s primary advisor was Muhammad Abdul-Mageed at the Linguistics Department and the School of Information at UBC.
Below is an abstract of Ife’s dissertation.
Abstract: This dissertation delves into the realm of Natural Language Processing (NLP) with a particular focus on African languages, endeavoring to unravel the progress, challenges, and future prospects within this unique linguistic context. The research encompasses various facets of NLP, ranging from language identification and Natural Language Understanding (NLU) to Natural Language Generation (NLG), culminating in a comprehensive case study on machine translation.
The first chapter introduces the problem statement, articulates the motivation for addressing the issue, and presents the innovative solutions developed throughout this research. Chapter two delves into the intricate details of African languages, offering insights into the genealogical classification, linguistic landscape, and the challenges of multilingual NLP. Building upon this foundation, the third chapter advocates for an Afrocentric approach to technology development, emphasizing the significance of aligning technology with the cultural values and linguistic diversity of African communities. It addresses challenges such as data scarcity and representation bias, spotlighting community-driven initiatives aimed at advancing NLP in the region.
The fourth chapter unveils AfroLID, an advanced neural language identification tool designed for 517 African languages and language varieties. Addressing the scarcity of language identification resources, AfroLID facilitates robust language identification without user training, establishing itself as the new state-of-the-art solution for African language identification.
Chapter five introduces Serengeti, a massively multilingual language model tailored to support 517 African languages and language varieties. Evaluation on AfroNLU, an extensive benchmark for African NLP, showcases Serengeti’s superior performance, thereby paving the way for transformative research and development across a diverse linguistic landscape.
The sixth chapter addresses NLG challenges in African languages, presenting Cheetah, an encoder-decoder language model designed for 517 African languages. Comprehensive evaluations underscore Cheetah’s capacity to generate coherent and contextually relevant text across various African languages, fostering linguistic diversity and bridging the digital divide.
The seventh chapter presents a case study on machine translation, focusing on Bare Nouns (BNs) translation from Yorùbá to English. This study illuminates the challenges posed by information asymmetry in machine translation and provides insights into the linguistic capabilities of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) systems. Emphasizing the importance of fine-grained linguistic considerations, the study encourages further research in addressing translation challenges faced by languages with BNs, analytic languages, and low-resource languages.
In summary, this dissertation not only explores the intricacies of NLP in the African linguistic landscape but also contributes pioneering solutions, benchmarks, and models that transcend linguistic barriers, empowering African communities and fostering the inclusive advancement of NLP.