Text Analytics (ITA)

This is a new MSc-level class focusing on the various methods, techniques, and tools for text analytics from a Data Science perspective. We start off with the phases of text analytics projects, continue with text processing tasks and the various ways to text representation up to text analytics models, both traditional ones as well as those based on advanced techniques such as neural networks.

This course is programming intensive! In addition to a final group project there will be assignments that focus on conceptual (theoretical) aspects of text analytics models but often concentrate on mapping these concepts into programs and text analytics pipelines, using real world text corpora. For the projects, we will exclusively use Python and frameworks such as NLTK, spaCy, gensim, Scikit-Learn, and Keras. If you are uncomfortable with Python and programming in general, this class is likely not the best fit for you.

Below you will find some more information about this class based on the module handbook (course is not included there yet).

Time and Location:

Lectures: M 2-4pm, R 9-11am, online using WebEx (link will be posted via Müsli)
Exercise Sessions: R 2-4pm, online using WebEx

First Lecture: M, November 2nd (tentative)

Applicable to courses of study: BSc Applied Computer Science, MSc Applied Computer Science, MSc Scientific Computing

Class Web Site: in Moodle; Link and Key will be provided through Müsli mailing list prior to the first lecture

Course Objectives: Students

can implement and apply different text analytics methods using open source NLP and machine learning frameworks
can describe different document and text representation models and can compute and analyze characteristic parameters of these models
know how to determine, apply, and interpret use-case specific document similarity measures and underlying ranking concepts
know the concepts and techniques underlying different text classification and clustering approaches
know different models for phrase extraction and text summarization and are able to apply respective models and concepts using NLP and machine learning frameworks
know the fundamental methods for the extraction of document outlines at different levels of granularity
are familiar with basic concepts of topic models and their application in different text analytics tasks
understand the principles of evaluating results of text analytics tasks
know the theoretical background of machine learning methods at sufficient depths to be able to choose parameters and adapt an algorithm to a given text analytics problem
are aware of ethical issues arising from applying text analytics in different domains

Course Content:

Text analytics in the context of Data Science
Open source text analytics, NLP, and machine learning frameworks
Fundamentals of NLP pipeline components
Text Analytics project management
Document and text representation models
Document and text similarity metrics
Approaches, techniques and corpora for benchmarking text analytics tasks
Traditional and recent text classification and clustering approaches
Information extraction and topic detection approaches
Fundamentals of keyword and phrase extraction
Text summarization techniques
Generating document and text outlines
Ethical and legal aspects of text analytics methods

Suggested Prerequisites: Recommended are solid knowledge of basic calculus, statistics, and linear algebra; good Python programming skills

Assessments: Assignments (40%) and Programming Project (60%); about 4-5 assignments focusing on the material learned in class on a conceptual and formal level; group project in which 3-4 students develop a prototypical text analytics framework, including design and evaluation, a written project documentation as well as the code need to be submitted at the end of classes, clearly indicating what student is responsible for what part of the project. Both assignments and project must be at least satisfactory (4,0) in order to pass the class.

Instructors

Prof. Dr. Michael Gertz
Satya Almasian (Research and Teaching Assistant)
Dennis Aumiller (Research and Teaching Assistant)