Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.
Required
- Online Natural Language Processing Course, Stanford University
- Week 7: Information Retrieval, Term-Document Incidence Matrix
- Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
- Week 7: Ranked Information Retrieval, Term Frequency Weighting
- Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
- Week 7: Ranked Information Retrieval, TF-IDF weighting
Recommended
- A full-text visualization of the Iraq war logs, Jonathan Stray
- Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
- Probabilistic Topic Models, David M. Blei
- General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King
Examples
- Watchwords: Reading China Through its Party Vocabulary, Qian Gang
- Message Machine, ProPublica
Assignment: TF-IDF analysis of State of the Union speeches.