The Master of Data Science in Computational Linguistics is a unique degree tailored to those with a passion for language and data.
Program Overview
This 10-month intensive program is aimed at providing those with a degree in linguistics or another language-focused discipline the technical skills to pursue a career in language technology and natural language processing. This specialization is a joint program with the Computer Science and Statistics departments at UBC. Through this program, we equip our graduates with the skills to turn language-related data into applicable knowledge and to build Artificial Intelligence (AI) that can interpret human language.
Our courses are taught by faculty members who are key experts in their fields including: linguistics, computer science, and statistics. Additionally, students will be required to complete a capstone project in order to apply their acquired knowledge, while working with their peers and real-life data sets.
Curriculum*
The program structure includes 24 one-credit courses offered in four-week segments. Courses are lab-oriented and delivered in-person.
At the end of the six segments, an eight-week, six-credit capstone project is also included, allowing students to apply their newly acquired knowledge, while working alongside other students with real-life data sets. Please note that instructors are subject to change.
* subject to change at the discretion of the MDS Computational Linguistics program
Fall: September – December
Programming for Data Science | DSCI 511: Program design and data manipulation with Python. Overview of data structures, iteration, flow control, and program design relevant to data exploration and analysis. When and how to exploit pre-existing libraries.
Computing Platforms for Data Science | DSCI 521: How to install, maintain, and use the data scientific software stack. The Unix shell, version control, and problem solving strategies. Literate programming documents.
Programming for Data Manipulation | DSCI 523: Program design and data manipulation with R. Organizing, filtering, sorting, grouping, reformatting, converting, and cleaning data to prepare it for further analysis.
Descriptive Statistics and Probability for Data Science | DSCI 551: Fundamental concepts in probability including conditional, joint, and marginal distributions. Statistical view of data coming from a probability distribution.
Algorithms & Data Structures | DSCI 512: How to choose and use appropriate algorithms and data structures to help solve data science problems. Key concepts such as recursion and algorithmic complexity (e.g., efficiency, scalability).
Data Visualization I | DSCI 531: Exploratory data analysis. Design of effective static visualizations. Plotting tools in R and Python.
Statistical Inference and Computation I | DSCI 552: The statistical and probabilistic foundations of inference. Large sample results. The frequentist paradigm.
Supervised Learning I | DSCI 571: Introduction to supervised machine learning. Basic machine learning concepts such as generalization error and overfitting. Various approaches such as K-NN, decision trees, linear classifiers.
Corpus Linguistics | COLX 521: Basic processing of text corpora using Python. Includes string manipulation, corpus readers, linguistic comparison of corpora, structured text formats, and text preprocessing tools.
Databases & Data Retrieval | DSCI 513: How to work with data stored in relational database systems. Storage structures and schemas, data relationships, and ways to query and aggregate such data.
Regression I | DSCI 561: Linear models for a quantitative response variable, with multiple categorical and/or quantitative predictors. Matrix formulation of linear regression. Model assessment and prediction.
Feature and Model Selection | DSCI 573: How to evaluate and select features and models. Cross-validation, ROC curves, feature engineering, and regularization.
Winter: January - April
Parsing for Computational Linguistics | COLX 535: The identification of syntactic structure in natural language. Parsing algorithms for popular grammar formalisms, application of statistical information to parsing, parser evaluation, and extraction of parse features.
Computational Semantics | COLX 561: How meaning is represented by computers. An overview of popular semantic resources, and techniques for building new resources from unstructured text data.
Unsupervised Learning | DSCI 563: How to find groups and other structure in unlabeled, possibly high dimensional data. Dimension reduction for visualization and data analysis. Clustering, association rules, model fitting via the EM algorithm and Large Language Models (LLMs).
Supervised Learning II | DSCI 572: Introduction to optimization. Gradient descent and stochastic gradient descent. Roundoff error and finite differences. Neural networks and deep learning.
Advanced Corpus Linguistics | COLX 523: Text corpora collection and curation. How to pull representative datasets from internet sources. Techniques for efficient and reliable annotation.
Computational Morphology | COLX 525: Approaches to sub-word phenomenon in language processing. Automatic morphological analysis of diverse languages, part of speech tagging, word segmentation, and character-level neural network models.
Machine Translation | COLX 531: Key methodologies for automatic translation between languages, with a focus on statistical and neural machine translation approaches. Applying Machine Translation (MT) architectures to analogous monolingual tasks. MT evaluation.
Sentiment Analysis | COLX 565: Identification and analysis of opinion, especially in social media. Text polarity and emotion classification, fine-grained (e.g. aspectual) opinion mining, argumentation mining, agentic workflows, and sentiment in social networks.
Advanced Computational Semantics | COLX 563: Application of machine learning to various semantic tasks. Likely topics include: information extraction, semantic role labelling, semantic parsing, discourse parsing, question answering, summarization, retrieval augmented generation (RAG), and natural language inference.
Natural Language Processing for Low-Resource Languages | COLX 58: Building automatic language tools when data is scarce. Rule-based and hybrid systems, semi-supervised learning, active learning. Knowledge transfer from other (related) languages.
Trends in Computational Linguistics | COLX 585: Cutting-edge techniques in natural language processing. For this iteration, the latest innovations in neural network architectures including artificial intelligence (AI).
Privacy, Ethics & Security | DSCI 541: The legal, ethical, and security issues concerning data, including aggregated data. Proactive compliance with rules and, in their absence, principles for the responsible management of sensitive data. Case studies.
Spring: May - June
Capstone Project | COLX 595: A mentored group project based on real data and questions from a partner within or outside the university. Students will formulate questions and design and execute a suitable analysis plan. The group will work collaboratively to produce a project report, presentation, and possibly other products, such as a web application.