Data Science for Text Analytics
This is a new bachelor level course that is being offered the first time in the winter semester 2022/23. Parts of that course originate from the course Text Analytics (ITA) that has been offered in the winter semester 2020/21, primarily as a master-level course. ITA has now been split into two courses, the bachelor-level course described on this page and a follow-up master-level course that is being offered in an upcoming semester. That is, ITA basically has been split into two courses, a bachelor course and a master course. Thus, students who have taken the Text Analytics course in the winter semester 2020/21 are not eligible to take this new course.
This course is programming intensive (well, it is about Data Science)! In addition to a final group project there will be assignments that focus on conceptual (theoretical) aspects of text analytics models but often concentrate on mapping these concepts into programs and text analytics pipelines, using real world text corpora. For the projects, we will exclusively use Python, text analytics frameworks such as spaCy, gensim, and Scikit-Learn, the OpenSearch framework as backend component for text data storage and retrieval, and typical project tools such as Github and Docker. If you are uncomfortable with Python and programming (in a team) in general, this class is very likely not the best fit for you.
In the following we give some information about this new course as it will appear in the module handbook.
Credit Points: 6 LP / CP (2h lecture + 2h exercise session)
Language of instruction: English
Workload: 180 h; thereof 60 h lecture and 120 h self-study and working on assignments/projects (optionally in groups)
Time and Location:
- Lectures: M 2-4pm, large lecture hall, Mathematikon
- Exercise Sessions: R 2-4pm, large lecture hall, Mathematikon
First Lecture: M, October 17
Applicable to courses of study: BSc Computer Science
Course Objectives: Students
- can implement and apply different text analytics methods using popular open source NLP and machine learning frameworks
- can describe different document and text representation models and can compute and analyze characteristic parameters of these models
- know the concepts and techniques underlying Information Retrieval (IR) systems and search engines
- know how to determine, apply, and interpret use-case specific document similarity measures and underlying ranking concepts
- know the concepts and techniques underlying different text classification and clustering approaches, such as Naïve Bayes and Logistic Regression
- understand the principles of evaluating results of text analytics components and tasks
- can implement a full stack text analytics pipeline, from backend IR component to frontend UI component
- are aware of ethical issues arising from applying text analytics in different domains
- are able to apply standard software engineering tools
Course Content:
- Text analytics in the context of data science
- Open source text analytics frameworks (e.g., spaCy, NLTK, gensim)
- Open source Information Retrieval (IR) systems and search engines (e.g., Elasticsearch, Opensearch)
- Components of text analytics pipelines
- Document and text representation models (incl. TF-IDF, n-grams, and embeddings)
- Document and text similarity metrics (e.g., BM25)
- Text classification and clustering approaches (e.g., Naïve Bayes, logistic regression, kNN)
- Techniques for information extraction
- Approaches, techniques and corpora for benchmarking text analytics tasks
- Ethical and legal aspects of text analytics methods
- Text Analytics project management
Suggested Prerequisites: Recommended are solid knowledge of basic calculus, statistics, and linear algebra; good Python programming skills
Assessments: Assignment (40%) and Programming Project (60%); about 4-5 assignments focusing on the material learned in class on a conceptual and formal level; group project in which 3-4 students develop a prototypical text analytics framework using an open source search engine, including design and evaluation, a written project documentation as well as the code need to be submitted at the end of classes, clearly indicating what student is responsible for what part of the project. Both assignments and project must be at least satisfactory (4,0) in order to pass the class.
Suggested Literature: The following textbook and texts are useful but not required.
- Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)
Furthermore, during the course of this lecture, several papers covering topics discussed in class will be provided.
Instructors:
- Prof. Dr. Michael Gertz
- Dennis Aumiller (Research and Teaching Assistant)