Week 2: Text Analysis

Slides.

Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.

Required

Online Natural Language Processing Course, Stanford University
- Week 7: Information Retrieval, Term-Document Incidence Matrix
- Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
- Week 7: Ranked Information Retrieval, Term Frequency Weighting
- Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
- Week 7: Ranked Information Retrieval, TF-IDF weighting

Recommended

A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
Probabilistic Topic Models, David M. Blei
General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King

Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang
Message Machine, ProPublica

Assignment: TF-IDF analysis of State of the Union speeches.

Computational Journalism

At the Tow Center for Digital Journalism, Columbia University, as taught by Jonathan Stray

Leave a Reply Cancel reply