Week 2: Clustering | Computational Journalism

A vector of numbers is a fundamental data representation which forms the basis of very many algorithms in data mining, language processing, machine learning, and visualization. This week we will looked in depth at two things: representing objects as vectors, and clustering them, which might be the most basic thing you can do with this sort of data. This requires a distance metric and a clustering algorithm — both of which involve editorial choices. In journalism we can use clusters to find groups of similar documents, analyze how politicians vote together, or automatically detect groups of crimes.

Slides for week 2 are here.

See also a discussion of (and files for reproducing) the UK House of Lords voting analysis we did in class, this one:

Here are the main references on this material (from the syllabus):

Cluster Analysis, Wikipedia
The Challenges of Clustering High Dimensional Data, Steinbach, Ertöz, Kumar
General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King
Survey of clustering data mining techniques, Pavel Berkhin

And here are some of the other things we looked at today:

Interactive K-means clustering demonstration
The NOMINATE analysis of U.S. congressional voting
A related analysis of voting, done with a completely different, Bayesian technique
An Introduction to Multidimensional Scaling

Other uses of clustering in journalism:

Message Machine, ProPublica
‘GOP 5′ make strange bedfellows in budget fight, Chase Davis, California Watch
Data mining in politics, Aleks Jakulin
A House Divided, Delaware Online