Assignment 2

For this assignment you will design a hybrid filtering algorithm. You will not implement it, but you will explain your design criteria and provide a filtering algorithm in sufficient technical detail to convince me that it might actually work — including psuedocode.

You may choose to filter:

  • Facebook status updates, like the Facebook news feed
  • Tweets, like Trending Topics or the many Tweet discovery tools
  • The whole web, like Prismatic
  • something else, but ask me first

Your filtering algorithm can draw on the content of the individual items, the user’s data, and other users’ data. The assignment goes like this:

1. List all available information that you have available during the debate. If you want to filter Facebook or Twitter, you may pretend that you are either of these companies and have access to all of their tweets etc. You also also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but you must be specific and realistic about what data you are operating with.

2. Argue for the design factors that you would like to influence the filtering, in terms of what is desirable to the user, what is desirable to the publisher (e.g. Facebook or Prismatic), and what is desirable socially. Explain as concretely as possible how each of these (probably conflicting) might be achieved through in software. Since this is a hybrid filter, you can also design social software that asks the user for certain types of information (e.g. likes, votes, ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.

3. Write psuedo-code for a function that produces a “top stories” list. This function will be called whenever the user loads your page or opens your app, so it must be fast and frequently updated. You can assume that there are background processes operating on your servers if you like. Your psuedo-code does not have to be executable, but it must be specific and unambiguous, such that a good programmer could actually go and implement it. You can assume that you have libraries for classic text analysis and machine learning algorithms.

4. Write up steps 1-3. The result should be no more than five pages. However, you must be specific and plausible. You must be clear about what you are trying to accomplish, what your algorithm is, and why you believe your algorithm meets your design goals (though of course it’s impossible to know for sure without testing; but I want something that looks good enough to be worth trying.)

The assignment is due before class on  Monday, October 29.


Week 6: Hybrid filters

In previous weeks we discussed filters that are purely algorithmic (such as NewsBlaster) and filters that are purely social (such as Twitter.) This week we discussed how to create a filtering system that uses both social interactions and algorithmic components.


Here are all the sources of information such an algorithm can draw on.

We looked at two concrete examples of hybrid filtering. First, the Reddit comment ranking algorithm, which takes the users’ upvotes and downvotes and sorts not just by the proporition of upvotes, but by how certain we are about proportion, given the number of people who have actually voted so far. Then we looked at item-based collaborative filtering, which is one of several classic techniques based on a matrix of users-item ratings. Such algorithms power everything from Amazon’s “users who bought A also bought B” to Netflix movie recommendations to Google News’ personalization system.

Evaluating the performance of such systems is a major challenge. We need some metric, but not all problems have an obvious way to measure whether we’re doing well. There are many options. Business goals — revenue, time on site, engagement — are generally much easier to measure than editorial goals.

Finally, we saw a presentation from Dr. Aria Haghighi, co-founder of the news personalization service Prismatic, on how his system crawls the web to find diverse articles that match user interests.

The readings for this week were:

This concludes our work on filtering systems — except for Assignment 2.

Week 4: Information overload and algorithmic filtering

This is the first of three weeks on “filtering.” We define that word by looking at a feedback model of journalism: a journalist observes something happening in the world, produces a story about it, the user consumes the story, and then they potentially act in some way that changes the world (such as voting, choosing one product over another, protesting, or many other possible outcomes.) This follows David Bornstein’s comment that “journalism is a feedback mechanism to help society self-correct.

This diagram is missing something obvious: there are lots and lots of topics in the world, hence many stories. Not every potential story is written, and not every written story is consumed by every user.

This is where “filtering” comes in, the arrow on the bottom right. Somehow, the user sees only a subset of all produced stories. The sheer, overwhelming logic of the amount of journalism produced versus hours in the day requires this (and we illustrated this with some numerical examples in the slides.)

(Incidentally, journalism as an industry has mostly been involved with story production, the upper-right arrow, and more recently has been very concerned about how fewer reporters result in more stories not covered, the upper left arrow. The profession has, historically, payed much less attention to the effects of its work, bottom left, and the filtering problem, bottom right.)

(There is another major thing missing from this diagram: users now often have access to the same sources as journalists, and in any case journalism is now deeply participatory. We’ll talk a lot more about this next week.)

This week we focussed on purely algorithmic filtering. As a concrete example, we examined the inner workings of the Columbia Newsblaster system, a predecessor of Google News which is conveniently well documented.

Slides for week 4.

The readings (from the syllabus) were mostly to get you thinking about the general problem of information overload and algorithmic filtering, but the Newsblaster paper is also in there.

Actually, much of the guts of Newsblaster is in this paper on their on-line clustering algorithm that groups together all stories which are about the same underlying event. Note the heavy reliance on our good friends from last week: TF-IDF and cosine distance. The graphs in this paper show that for this problem, you can do better than TF-IDF by adding features corresponding to extracted entities (people, places, dates) but really not by very much.

We wrapped up with a discussion about the problem of algorithmic filter design. We defined this problem on two levels. In terms of functional form,

and in terms of the much more abstract desirable attributes

 The great challenge is to connect these two levels of description: to express our design criteria in terms of an algorithm. Here are the notes from our brief discussion about how to do this.

On the right we have interest, effects, agency, my proposed three criteria for “when should a user see a story.” Somehow, these have to be expressed in terms of computational building blocks like TF-IDF and all of the various signals available to the algorithm.  That’s what the fuzzy arrow is… there’s a gap here, and it’s a huge gap.

On the left are some of the factors to consider in trying to assess whether a particular story is interesting, will effect, or can be acted on by a particular user: geo information (location of user and story), user’s industry and occupation, other user demographics, the people in the user’s social network, the “content” they’ve produced (everything they’ve ever tweeted, blogged, etc.), and the time or duration of the story event. We can also offload parts of the selection process to the user, by showing multiple stories or types of stories and having the user pick. Similarly we can offload parts of the problem to the story producer, who might use various techniques to try to target a particular story to a particular group of people. We’ll talk extensively about including humans in the filtering system in the next two weeks.

The bracket and 2^N notation just means that any combination of these factors might be relevant. E.g. location and occupation together might be a key criteria.

In the center of the board I recorded one important suggestion: we can use machine learning to teach the computer which are the right articles for each factor. For example, suppose we’re trying to have the algorithm decide which stories are about events that affect people in different occupations. For each occupation, a human can collect many stories that someone in that occupation would want to read, then we can take the average of the TF-IDF vectors of those stories to define a subject category. The computer can then compare each incoming story to the corresponding coordinate for each user’s occupation.

I don’t know whether this particular scheme will work, but having the humans teach the computers is an essential idea — and one that is very common in search engines and filtering systems of all kinds.


Assignment 1

I’m altering the structure of the assignment a little bit from the version in the original syllabus, in the hopes of making it more interesting. We may even be able to learn and document somethings that seem to be missing from the literature.

The assignment goes like this:

1) Get your data into a standard format. You have all now chosen a dataset that contains, at least in part, a significant text corpus. Your first task will be to scrape, format, or otherwise coerce this information into a convenient format for Python to read.

I recommend a very simple format: plain text, one document per line, all documents in one file.

It’s not completely trivial to get the documents in this format, but it shouldn’t be hard either. The first task is to extract plain text from your documents. If your source material is PDF, you can use the pdftotext command (pre-installed on Macs, available for Windows and Linux as part of xpdf). If it’s HTML, you may want to delete matches to the regex /<[^>]*>/ to remove all tags; you may also want to scrape the content of a particular div, as opposed to the whole page source. If you need to scrape your data from web pages, I heartily recommend writing a little script within the ScraperWiki framework.

Obviously this one-document-per-line format can’t represent the newlines in the original document, but that’s ok, because we’re going to throw them out during tokenization anyway. So you can just replace them with spaces.

2) Feed the data into gensim. Now you need to load the documents into Python and feed them into the gensim package to generate tf-idf weighted document vectors. Check out the gensim example code here. You will need to go through the file twice: once to generate the dictionary (the code snippet starting with “collect statistics about all tokens”) and then again to convert each document to what gensim calls the bag-of-words representation, which is un-normalized term frequency (the code snippet starting with “class MyCorpus(object)”

Note that there is implicitly another step here, which is to tokenize the document text into individual word features — not as straightforward in practice as it seems at first, but the example code just does the simplest, stupidest thing, which is to lowercase the string and split on spaces. You may want to use a better stopword list, such as this one.

Once you have your Corpus object, tell gensim to generate tf-idf scores for you like so.

3) Do topic modeling. Now you can apply Latent Semantic Indexing or Latent Dirichlet Allocation to the tf-idf vectors, like so. You will have to supply the number of dimensions to keep. Figuring out a good number is part of the assignment. Note that you dont have to do topic modeling — this is really a dimensionality reduction / de-noising step, and depending on your documents and application, may not be needed. If you want to try working with the original tf-idf vectors, that’s OK too. That’s what Overview does.

4) Analyze the vector space. So now you have a bunch of vectors in a space of some dimension, each of which represents a document, and we hope that similar documents are close in this space (as measured by cosine distance.) Have we gotten anywhere? There are several things we could do at this point:

  1. Choose particular document, then find the k closest documents. Are they related? How? (Read the text of the documents to find out) How big do you have to make k before you see documents that seem unrelated?
  2. Run a clustering algorithm, such as any of those in the python cluster package. Then look at the documents in each cluster, again reading their text and reporting on the results. Non-hierarchical clustering algorithms generally take the number of clusters as a parameter, while with hierarchical clusterings you have a choice of what level of the tree to examine. How do these choices affect what you see?
  3. Or, you could run multi-dimensional scaling to plot the entire space in 2D, perhaps with some other attribute (time? category?) as a color indicator variable. This is probably best done in R. To get your document vectors into R, write them out of gensim in the MatrixMarket format, then load them in R (remember you’ll need to do “library(Matrix)” first to make readMM available in R.) Then you’ll want to compute a distance matrix of the documents, run mdscale on it, and plot the result like we did in the House of Lords example.
  4. If you did LSI or LDA topic modeling, what do the extracted topics look like? Do they make any sort of human sense? Can you see examples of polysemy or synonymy? If you pull out the k docs with the highest score on a particular topic, what do you find? How many documents have no clear primary topic? What do the low-order topics (far down the dimension list) look like? How many dimensions until it just seems like noise?

Your assignment is to do one of these things, whichever you think will be most interesting. You may also discover that it is hard to interpret the results without trying some of the other techniques. Actually, figuring out how, exactly, to evaluate the clustering is part of this assignment. Hint: one useful idea is to ask, how might a human reporter organize your documents? Where did the computer go wrong?

You will of course need to implement cosine distance either in Python or R to make any of these go. This should be only a few lines…

5) Compare to a different algorithmic choice. Now do steps 3 and 4 again, with a different choice of algorithm or parameter. The point of this assignment is to learn how different types of clustering give qualitatively different results on your document set… or not. So repeat the analysis, using either:

  • a different topic modeling algorithm. If you used plain tf-idf before, try LSI or LDA. Or if you tried LSI, try LDA. Etc.
  • a different number of clusters, or a different level in the hierarchical clustering tree.
  • a different number of output dimensions to LSI or LDA.
  • a different distance function
  • etc.

I want to know which of your two cases gives “better” results. What does “better” mean? It depends very much on what the interesting questions are for your data set. Again, part of the assignment is coming up with criteria to evaluate these clusterings. Generally, more or easier insight is better. (If the computer isn’t making your reporting or filtering task significantly easier, why use it?)

6) Write up the whole thing. I will ask you to turn in your the code you wrote but that’s really only to confirm that you actually did all the steps. I am far more interested in what you have learned about the best way to use these algorithms on your data set. Or if you feel you’ve gained little or no insight into your data set using these techniques, explain why, and suggest other ways to explore it.

This assignment is due Monday, October 15 at 4:00 PM. You may email me the results. I am available for questions by email before then, or in person at office hours on Thursday afternoons 1-4.


Week 3: Document Topic Modeling

The text processing algorithms we discussed this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.


Here’s a slide I made showing the difference between TF and TF-IDF on the structure of the document vector space. I’m sure someone has tried this before, but It’s the first such comparison that I’ve seen, and matches the theoretical arguments put forth by Salton in 1975.

Here were the readings (from the syllabus):

  • Online Natural Language Processing Course, Stanford University
    • Week 7: Information Retrieval, Term-Document Incidence Matrix
    • Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
    • Week 7: Ranked Information Retrieval, Term Frequency Weighting
    • Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
    • Week 7: Ranked Information Retrieval, TF-IDF weighting
  • Probabilistic Topic Models, David M. Blei
  • A full-text visualization of the Iraq war logs, Jonathan Stray
  • Latent Semantic Analysis, Peter Wiemer-Hastings
  • Probabilistic Latent Semantic Indexing, Hofmann
  • Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.

Jeff Larson of ProPublica came and spoke to us about his work on Message Machine, which uses tf-idf to cluster and “reverse engineer” the campaign emails from Obama and Romney.

I mentioned that it’s possible to do better than TF-IDF weighting but not hugely, and that using bigrams as features doesn’t help much. My personal favorite research on term-weighting uses a genetic algorithm to evolve optimal formulas. (All of these discussions of “best” are measured with respect to a set of documents manually tagged as relevant/not relevant to specific test queries, which matches search engine use but may not be what we want for specific journalistic inquiries.)

Finally, you may be interested to know that TF-IDF + LSA  + cosine similarity has a correlation of 0.6 with asking humans to rate the similarity of two documents on a linear scale, which is just as strongly as different human estimates correlate with each other. In a certain sense, it is not possible to create a better document similarity algorithm because there isn’t a consistent human definition of such a concept. However, these experiments were done on short news articles and it might be possible to do very much better over specialized domains.

This week we also started Assignment 1.

Using clustering to analyze the voting blocs in the UK House of Lords

In this week’s class, we discussed clustering algorithms and their application to journalism. As an example, we built a distance metric to measure the similarity of the voting history between two members of the UK House of Lords, and used it with multi-dimensional scaling to visualize the voting blocs.

The data comes from The Public Whip, an independent site that scrapes the British parliamentary proceedings (the “hansard“) and extracts the voting record into a database. The files for the House of Lords are here. They’re tab-delimited, which is not the easiest format for R to read, so I opened them in Excel and re-saved as CSV. I also removed the descriptive header information from votematrix-lords.txt (which, crucially, explains how the vote data is formatted.)

The converted data files plus the scripts I used in class are up on GitHub. To get them running, first you’ll need to install R for your system. Then you’ll need the “proxy” package, which I have conveniently included with the files. To install proxy: on WIndows, “R CMD INSTALL” from the command line. On Mac, “R CMD INSTALL proxy_0.4-9.tgz” from  Terminal.

Then start R, and enter source(“lords-votes.R”)

You should see this (click for larger):

And voila! There’s a lot of structure here. The blue cluster on the left are all the lords of the Labor party. The red dots in the cluster on the right are likewise members of the Conservative party, who are currently in a coalition government with Liberal Democrats — magenta, and mostly overlapping the conservative cluster. The green dots throughout the figure are the crossbenchers, and we can see that they really are indpendent. The scattered black dots are the Bishops, who don’t seem to vote together or, really, quite like anyone else, but are vaguely leftist.

Let’s break down how we made this plot — which will also illuminate how to interpret it, and how much editorial choice went into its construction. No visualization is gospel, though, as we shall see, there are reasons to believe that what this one is telling us is accurate.

The first section of lords-votes.R just loads in the data, and convers the weird encoding (2=aye, 4=nay, -9=not present, etc.) into a matrix of 1 for aye, -1 for nay, and 0 did not vote. The votes matrix ends up with 1043 rows, each of which contains the votes cast by a single Lord on 1630 occasions.

The next section truncates this to the 100 most recent votes. The 1630 votes in this file go back to 1999, and the voting patterns might have changed substantially over that time. Besides, over 200 Lords have retired since then. So we arbitrarily drop all but the last 100 votes, which gives a period of voting history stretching back to mid-November 2011. A more sophisticated analysis could look at how the voting blocs might have varied over time.

Then we get to the distance function, which is the heart of the code.  We have to write a function that takes two vectors of votes, representing the voting history of two Lords, and returns the “distance” between them. This function can take any form but it has to obey the definition of a distance metric. That definition says zero means “identical,” but doesn’t put any constraint on the maximum “distance” between Lords. For simplicity and numerical stability, we’ll say the distance varies from zero (identical voting) to one (no votes in common.)

For our distance function, we just count the fraction of votes where Lord 1 and Lord 2 voted the same (or rather, one minus this fraction, so identical = zero.)  Except that not every Lord voted on every issue. So, we first find the overlap of issues where they both voted, and then count their identical votes on that subset. (If there is no overlap, we return 1, i.e. as far apart as possible)

# distance function = 1 - fraction of votes where both voted, and both voted the same
votedist <- function(v1, v2) {
   overlap = v1!=0 & v2!=0
   numoverlap = sum(overlap)
   match = overlap & v1==v2
   nummatch = sum(match)
   if (!numoverlap) 
      dist = 1 
      dist = 1- (nummatch/numoverlap)

With this distance function in hand, we can create a distance matrix. In the final section of the script, we pass this distance function to R’s built-in multi-dimensional scaling function, mdscale. And then we plot, using the party as a color. That plot again:

And there we are, with our voting blocs: Labour (blue), Conservative (red) with Liberal Democrats (magenta), the crossbenchers (green), and the Bishops (black).

But now we have to step back and ask — is this “objective”? Is this “real”? The plot seems like it comes right from the data, but consider: we wrote a distance function. There’s no particular reason our little distance function is the “right” way to compare Lords, journalistically speaking. We also decided to use multidimensional scaling here, whereas there are other techniques for visualizing voting spatially, such as the well known NOMINATE algorithm.

Also, the chart is very abstract. There seems to be a clear left-right axis, but what does the vertical axis mean? We know that MDS tries to keep distant points far away, so something is driving the dots near the top of the chart away from the others… but what does this really mean? Consider that we’ve already added a layer of interpretation by coloring by party affiliation; here’s the same chart without the color:

A lot less clear, isn’t it? Yes, our clustering tricks made us a chart, but to understand what it means, we have to supply context from outside the data. In this case, we know that voting behavior likely has a lot to do with party affiliation, so we chose to use party as an indpendent variable. You can think of it like this: in the colored plot, we’re graphing party against clustered coordinate. The fact that the clusters have clear solid colors means that, as we suspected, party is strongly correlated with voting history. The fact that Labor and Conservative are on opposite ends also makes sense, because we already imagine them to be on different ends of the political spectrum.

But what about the vertical axis? It doesn’t seem to correlate particularly well with party. What does it mean? Well, we’d have to dig into the data to see what the distance function and MDS algorithm have done here, and try to come up with an explanation. (And actually, the second dimension may not mean much at all — there is good evidence, from the NOMINATE research, that at least American voting behavior is typically only one-dimensional.)

In other words, it’s the human that interprets the plot, not the computer. The description at the start of this post of what this visualization tells us — the computer didn’t write that.

Let’s try a different distance function to see the effect of this piece of the algorithm. Our current distance function ignores the total number of votes where both Lords were present. If two Lords only voted on the same issue once, but they voted the same way, it will still score them as identical. That doesn’t seem quite right.

So let’s modify the crucial line of the distance function to penalize voting histories that don’t overlap much. (You can try this yourself by uncommenting line 47 in lords-votes.r)

      dist = 1- ((nummatch/numoverlap) * log(numoverlap)/log(Nvotes))

I’ve multiplied the fraction of matching votes by the fraction of issues that both Lords voted on (out of the possible Nvotes=100 most recent issues that we’re looking at.) I’ve also put a logarithm in there to compress the scale, because I want going from 5 to 10 votes to count more than going from 95 to 100 votes. This is all just made up, right out of my head, on the intuition that I want the distances to get large when there aren’t really enough overlapping votes to really say if two Lords are voting similarly or not. The resulting plot looks like this:

A bunch of Lords have been pushed up near the top of the diagram — away from everything else. These must be the Lords who didn’t participate in many votes, so our distance function says they’re not that close to anyone, really. But the general structure is preserved. This is a good sign. If what the visualization tells us is similar even when we try it lots of different ways, the result is robust and probably real. And I certainly recommend looking at your data lots of different ways — if you just pick the first chart that shows what you wanted to say anyway, that’s not journalism, it’s confirmation bias.

I want to leave you with one final thought, something even more fundamental than the choice of distance function or visualization algorithm: is voting record really the best way to see how politicians operate? What about all the other things they do? We could equally well ask about legislation they’ve drafted or the committees they’ve participated in. (We could even imagine defining a distance function based on the topics of the legislation that each representative has contributed to shaping.) As always, the first editorial choice is where to look.

Week 2: Clustering

A vector of numbers is a fundamental data representation which forms the basis of very many algorithms in data mining, language processing, machine learning, and visualization. This week we will looked in depth at two things: representing objects as vectors, and clustering them, which might be the most basic thing you can do with this sort of data. This requires a distance metric and a clustering algorithm — both of which involve editorial choices. In journalism we can use clusters to find groups of similar documents, analyze how politicians vote together, or automatically detect groups of crimes.

Slides for week 2 are here.

See also a discussion of (and files for reproducing) the UK House of Lords voting analysis we did in class, this one:

Here are the main references on this material (from the syllabus):

And here are some of the other things we looked at today:

Other uses of clustering in journalism:

Week 1

Here are the slides for Week 1, class of 10 September.

You will need to choose a data set to work on by the 24th. Here are some of the projects and data sources we discussed in class, which might provide inspiration — but pretty much anything of public interest is fair game. Remember, there are many types of data: documents, tables of numbers, networks of relationships, images, and more.

Syllabus – Fall 2012

Aims of the course

The aim of the course is to familiarise students with current areas of research and development within computer science that have a direct relevance to the field of journalism, so that they are capable of participating in the design of future public information systems.

The course is built around a “design” frame that examines technology from the point of view of its possible applications and social context. It will familiarize the students with both the major unsolved problems of internet-era journalism, and the major areas of research within computer science that are being brought to bear on these problems. The scope is wide enough to include both relatively traditional journalistic work, such as computer-assisted investigative reporting, and the broader information systems that we all use every day to inform ourselves, such as search engines. The course will provide students with a thorough understanding of how particular fields of computational research relate to products being developed for journalism, and provoke ideas for their own research and projects.

Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology. Students with a CS background will have opportunity for algorithmic exploration and innovation, however the primary goal of the course is thoughtful, application-centered research and design.

Assignments will be completed in groups and involve experimentation with fundamental computational techniques. There will be some light coding, but the emphasis will be on thoughtful and critical analysis.

Format of the class, grading and assignments.

It is a fourteen week course for Masters’ students which has both a six point and a three point version. The six point version is designed for dual degree candidates in the journalism and computer science concentration, while the three point version is designed for those cross listing from other concentrations and schools.

The class is conducted in a seminar format. Assigned readings and computational techniques will form the basis of class discussion. Throughout the semester we will be inviting guest speakers with expertise in the relevant areas to talk about their related research and product development

The output of the course for a 6pt candidate will be one research assignment in the form of a 25-page research paper. The three point course will require a shorter research paper, and both versions of the course will also have approximately bi-weekly written assignmenst which will frequently involve experimentation with computational techniques. For those in the dual degree program or who have strong technical skills, there is an option to produce a prototype as part of the final assignment. The class is conducted on pass/fail basis for grading, in line with the journalism school’s grading system.

Week 1. – Basics
We set out the expectations of the course, and frame our work as the task of designing of public information production and distribution systems. Computer science techniques can help in four different areas: data-driven reporting, story presentation, information filtering, and effect tracking. The recommended readings are aiming to to give you an understanding of the landscape of technical disruption in the news industry, and the ways in which computer science techniques can help to build something better.



  • Newspapers and thinking the Unthinkable, Clay Shirky
  • Computational Journalism, Cohen, Turner, Hamilton,
  • Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer

Viewed in class

Weeks 2-3: Technical fundamentals
We’ll spend the next couple weeks examining the techniques that will form the basis of much of the rest of our work in the course: clustering and the document vector space model.

Week 2: Clustering
A vector of numbers is a fundamental data representation which forms the basis of very many algorithms in data mining, language processing, machine learning, and visualization. This week we will explore two things: representing objects as vectors, and clustering them, which might be the most basic thing you can do with this sort of data. This requires a distance metric and a clustering algorithm — both of which involve editorial choices! In journalism we can use clusters to find groups of similar documents, analyze how politicians vote together, or automatically detect groups of crimes.



Viewed in class

Assignment: you must choose your groups of 2-3 students, and pick a data set to work with for the rest of the course. Due next week.

Week 3: Document topic modelling
The text processing algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.


  • Online Natural Language Processing Course, Stanford University
    • Week 7: Information Retrieval, Term-Document Incidence Matrix
    • Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
    • Week 7: Ranked Information Retrieval, Term Frequency Weighting
    • Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
    • Week 7: Ranked Information Retrieval, TF-IDF weighting
  • Probabilistic Topic Models, David M. Blei


Assignment – due in three weeks
You will perform document clustering with the gensim Python library, and analyze the results.

  1. Choose a document set. You can use the Reuters corpus if you like but you are encouraged to try other sources.
  2. Import the documents and score them in TF-IDF form. Then query the document set by retrieving the top ten closest documents (as ranked by cosine distance) for a variety different queries. Choose three different queries that show interesting strengths and weaknesses of this approach, and write analysis of the results.
  3. Choose a topic modelling method (such as connected components, LSA, or LDA) and cluster your documents. Hand in the extracted topics and comment on the results.
  4. Choose a clustering method (such as k-means) and cluster the documents based on the extracted topics. How do the resulting clusters compare to how a human might categorize the documents

Weeks 4-5: Filtering
Over the next few weeks we will explore various types of collaborative filters: social, algorithmic, hybrid classic correlation-based filtering algorithms (“users who bought X also bought Y”, Netflix Prize) location- and context-based filtering. Our study will include the technical fundamentals of clustering and recommendation algorithms.

Week 4: Information overload and algorithmic filtering
This week we begin our study of filtering with some basic ideas about its role in journalism. Then we shift gears to pure algorithmic approaches to filtering, with a  look at how the Newsblaster system works (similar to Google News.)



Week 5: Social software and social filtering
We have now studied purely algorithmic modes of filtering, and this week we will bring in the social. First we’ll look at the entire concept of “social software,” which is a new interdisciplinary field with its own dynamics. We’ll use the metaphor of “architecture,” suggested by Joel Spolsky, to think about how software influences behaviour. Then we’ll study social media and its role in journalism, including its role in information distribution and collection, and emerging techniques to help find sources.



Week 6: Hybrid filters, recommendation, and conversation
We have now studied purely algorithmic and mostly social modes of filtering. This week we’re going to study systems that combine software and people. We’ll a look “recommendation” systems and the socially-driven algorithms behind them. Then we’ll turn to online discussions, and hybrid techniques for ensuring a “good conversation” — a social outcome with no single definition. We’ll finish by looking at an example of using human preferences to drive machine learning algorithms: Google Web search.



Assignment – due in two weeks:
Design a filtering algorithm for Facebook status updates. The filtering function will be of the form(status update, user data) => boolean. That is, given all previously collected user data and a new status update from a friend, you must decide whether or not to show the new update in the user’s news feed. Turn in a design document with the following items:

  1. List all available information that Facebook has about you. Include a description of how this information is collected or changes over time.
  2. Argue for the factors that you would like to influence the filtering, both in terms of properties that are desirable to the user and properties that are desirable socially. Specify as concretely as possible how each of these (probably conflicting) goals might be implemented in code.
  3. Write psuedo-code for the filter function. It does not need to be executable and may omit details, however it must be specific enough that a competent programmer can turn it into working code in an obvious way.

Weeks 7-9: Knowledge mining

Week 7: Visualization
An introduction into how visualisation helps people interpret  information. The difference between infographics and visualization, and between exploration and presentation. Design principles from user experience considerations, graphic design, and the study of the human visual system. Also, what is specific about visualization in journalism, as opposed to the many other fields that use it?



Week 8: Structured journalism and knowledge representation
Is journalism in the text/video/audio business, or is it in the knowledge business? This week we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at th Reverb algorithm.



Assignment: Use Reverb to extract propositions from a subset of your data set (if applicable, otherwise the Reuters corpus). Analyze the results. What types of propositions are extracted? What types of propositions are not? Does it depend on the wording of the original text? What mistakes does Reverb make? What is the error rate? Are there different error rates for different types of statements, sources, or other categories?

Week 9: Network analysis
add intelligence examples?
Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.



Examples of journalistic network analysis

Week 10: Drawing conclusions from data
You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.


  • basic stats concepts?
  • Correlation and causation, Business Insider
  • The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer


Week 11: Security, Surveillance, and Censorship
intro to crypto?
‘On the internet everyone knows you are a dog’. Both in commercial and editorial terms the issues of online privacy, identity and surveillance and important for journalism. Who is watching our online works? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism?



Cryptographic security


Assignment: Come up with situation in which a source and a journalist need to collaborate to keep a secret. Describe in detail:

  1. The threat model. What are the risks?
  2. The adversary model. Who must the information be kept secret from? What are their capabilities, interests, and costs?
  3. A plan to keep the information secure, including tools and practices
  4. An evaluation of the costs, possible sources of failure, and remaining risks

Week 12: Tracking flow and impact
How does information flow in the online ecosystem? What happens to a story after it’s published? How do items spread through social networks? We’re just beginning to be able to track ideas as they move through the network, by combining techniques from social network analysis and bioinformatics.



Week 13 – Project review
We will spend this week discussing your final projects and figuring out the best approaches to your data and/or topic.

Week 14
Review of course. Invited guest panel of computer scientists working both within journalism and in related fields concerned with public information, discuss their priorities and answer questions about what their priorities.