Natural Language Processing with Transformers

This is a new master level course that is being offered for the first time in the winter semester 2023/24. Parts of that course originate from the course Text Analytics (ITA) that has been offered in the winter semester 2020/21, primarily as a master-level course, and is not offered anymore. ITA has been split into two courses, the bachelor-level course Data Science for Text Analytics (IDSTA) and the follow-up master-level course described here. That is, ITA basically has been split into two courses, a bachelor course and a master course. Thus, students who have taken the Text Analytics ITA course in the winter semester 2020/21 are not eligible to take this new course.

This course is programming intensive (well, it is about NLP, Data Science, and AI)! In addition to a final group project there will be assignments that focus on conceptual (theoretical) aspects of NLP and transformer models but also concentrate on mapping these concepts into programs using real world text corpora. For the projects, we will exclusively use Python and frameworks such as Huggingface, LangChain, Google Colab, and OpenAI. We also assume that students are already familiar with frameworks such as spaCy, gensim, Opensearch (IR backend component for text data storage and retrieval), and typical project tools such as Github and Docker. If you are uncomfortable with Python and programming (in a team) in general, this class is very likely not the best fit for you.

In the following we give some information about this new course as it will appear in the module handbook.

Credit Points: 6 LP / CP (2h lecture + 2h exercise session)

Language of instruction: English

Workload: 180 h; thereof 60 h lecture and 120 h self-study and working on assignments/projects (optionally in groups)

Time and Location:

Lectures: TBA
Exercise Sessions: TBA
First Lecture: M, October 16, 2023

Applicable to courses of study: Master Data and Computer Science

Course Objectives: Students

fully understand the principles and methods underlying word embedding approaches
are familiar with traditional sequence-to-sequence machine learning methods
can describe the key concepts and techniques underlying attention mechanism and different transformer architectures
understanding training and fine-tuning approaches to improve the performance of different transformer architectures for different downstream NLP tasks
know the key methods and architectural components for building QA and text summarization pipelines
can build and deploy QA and text summarization pipelines using common software frameworks
know key metrics in evaluating transformer architectures for different applications
can implement diverse transformer-based NLP applications using common Python frameworks and libraries
can deploy transformer-based NLP applications through Web interfaces

Course Content:

Word embeddings (review of simple neural network architectures and concepts)
Sequence-to-sequence models (Recurrent Neural Networks, LSTM, GRU)
Attention mechanism
Transformer components (encoder, decoder) and common transformer architectures (BERT, GPT, T5)
Training and fine-tuning transformers, including zero- and few-shot learning
Text summarization approaches
Question answering and building a QA pipeline
Transformer architectures for conversational AI
Programming and model frameworks such as Huggingface, LangChain, OpenAI and and (cloud-based) vector databases

Suggested Prerequisites: Recommended courses: Data Science for Text Analytics (IDSTA), Foundations of Machine Learning (IML)

Recommended background: solid knowledge of basic calculus, statistics, and linear algebra; good Python programming skills; familiarity with frameworks such as Huggingface, Google Colab, and cloud-based services, in particular vector databases

Assessments: Assignments (40%) and Programming Project (60%); about 4-5 assignments focusing on the material learned in class on a conceptual and formal level; group project in which 3-4 students develop a prototypical transformer-based application, including design and evaluation, a written project documentation as well as the code need to be submitted at the end of classes, clearly indicating what student is responsible for what part of the project. Both assignments and project must be at least satisfactory (4,0) in order to pass the class.

Suggested Literature: The following textbook and texts are useful but not required.

For the different topics, several research papers will be provided to students via the Moodle platform. The following textbooks are useful but not required.

Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Natural Language Processing with Transformers, 2022 (revised edition)
Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)

Furthermore, during the course of this lecture, several papers covering topics discussed in class will be provided.

Instructors:

Prof. Dr. Michael Gertz