Syllabus Fall 2015 | Computational Journalism

The course is a hands-on introduction to the areas of computer science that have a direct relevance to journalism, and the broader project of producing an informed and engaged public. We will touch on many different technical and social topics: information recommendation systems but also filter bubbles, principles of statistical analysis but also the human processes which generate data, network analysis and its role in investigative journalism, visualization techniques and the cognitive effects involved in viewing a visualization. Assignments will require programming in Python, but the emphasis will be on clearly articulating the connection between the algorithmic and the editorial.

Our scope is wide enough to include both relatively traditional journalistic work, such as computer-assisted investigative reporting, and the broader information systems that we all use every day to inform ourselves, such as search engines and social media. The course will provide students with a thorough understanding of how particular fields of computational research relate to journalism practice, and provoke ideas for their own research and projects.

Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology. Students with a CS background will have opportunity for algorithmic exploration and innovation, however the primary goal of the course is thoughtful, application-oriented research and design.

Assignments will be completed in groups (except dual degree students, who will work individually) and involve experimentation with fundamental computational techniques. Some assignments will require intermediate level coding in Python, but the emphasis will be on thoughtful and critical analysis. As this is a journalism course, you will be expected to write clearly.

Format of the class, grading and assignments.
This is a fourteen week course for Masters’ students which has both a six point and a three point version. The six point version is designed for CS & journalism dual degree students, while the three point version is designed for those cross listing from other schools. The class is conducted in a seminar format. Assigned readings and computational techniques will form the basis of class discussion. The course will be graded as follows:

Assignments: 80%. There will be a homework assignment after most classes.
Class participation: 20%

Dual degree students will also have a final project. This will be either a research paper, a computationally-driven story, or a software project. The class is conducted on pass/fail basis for journalism students, in line with the journalism school’s grading system. Students from other departments will receive a letter grade.

Week 1: Basics – 9/11
Slides.
First we ask: where do computer science and journalism intersect? CS techniques can help journalism in four different areas: data-driven reporting, story presentation, information filtering, and effect tracking. Then we jump right in with the concept of data. Specifically, we study the quantification process, leading to feature vectors which are a fundamental data representation for many techniques.

Required

What should the digital public sphere do?, Jonathan Stray
Computational Journalism, Cohen, Turner, Hamilton

Recommended

Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer

Viewed in class

The Jobless rate for People Like You, New York Times
Dollars for Docs, ProPublica
What did private security contractors do in Iraq and document mining methodology, Jonathan Stray
Message Machine, ProPublica

Week 2: Clustering – 9/18
Slides.
A vector of numbers is a fundamental data representation which forms the basis of very many algorithms in data mining, language processing, machine learning, and visualization. This week we will explore two things: representing objects as vectors, and clustering them, which might be the most basic thing you can do with this sort of data. This requires a distance metric and a clustering algorithm — both of which involve editorial choices! In journalism we can use clusters to find groups of similar documents, analyze how politicians vote together, or automatically detect groups of crimes.

Required

Cluster Analysis, Wikipedia

Recommended

General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King
‘GOP 5′ make strange bedfellows in budget fight, Chase Davis, California Watch
The Challenges of Clustering High Dimensional Data, Steinbach, Ertöz, Kumar
Survey of clustering data mining techniques, Pavel Berkhin

Viewed in class

Week 3: Text Analysis – 9/25
Slides.
Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.

Required

TF-IDF is about what matters, Aaron Schumacher
Topic modeling by hand, Shawn Graham
How ProPublica’s Message Machine reverse engineers political microtargeting, Jeff Larson

Recommended

A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
Online Natural Language Processing Course, Stanford University
Probabilistic Topic Models, David M. Blei

Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang

Assignment: TF-IDF analysis of State of the Union speeches.

Week 4: Information overload and algorithmic filtering – 10/2
Slides.
This week we begin our study of filtering with some basic ideas about its role in journalism. Then we shift gears to pure algorithmic approaches to filtering, with a look at how the Newsblaster system works (similar to Google News.)

Required

Who should see what when? Three design principles for personalized news Jonathan Stray
Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al

Recommended

Guess what? Automated news doesn’t quite work, Gabe Rivera
The Hermeneutics of Screwing Around, or What You Do With a Million Books, Stephen Ramsay
Can an algorithm be wrong?, Tarleton Gillespie
The Netflix Prize, Wikipedia

Week 5: Social filtering – 10/9
Slides.
We have now studied purely algorithmic modes of filtering, and this week we will bring in the social. The distinction we will draw is not so much the complexity of the software involved, but whether the user can understand and predict the filter’s choices. We’ll look at Twitter as a prototypical social filter and see how news spreads on this network, and tools to help journalists find sources. Finally, we’ll introduce the idea of “social software” use the metaphor of “architecture” to think about how software influences behaviour.

Required

What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson, Zeynep Tufekci
Finding and Assessing Social Information Sources in the Context of Journalism, Nick Diakopolous et al.

Recommended

A Group is its own worst enemy, Clay Shirky
Learning from Stackoverflow, first fifteen minutes, Joel Spolsky
Norms, Laws, and Code, Jonathan Stray
What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
International reporting in the age of participatory media, Ethan Zuckerman
Are we stuck in filter bubbles? Here are five potential paths out, Jonathan Stray

Week 6: Hybrid filters, recommendation, and conversation – 10/16
Slides.
We have now studied purely algorithmic and mostly social modes of filtering. This week we’re going to study systems that combine software and people. We’ll look at comment voting, recommendation systems, and how Google search optimizes based on user preference. We’ll dig into the operation of the New York Times’ new recommendation engine which includes both content and collaborative filtering.

Required

Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
Building the Next New York Times Recommendation Engine, Alexander Spangher

Recommended

Israel, Gaza, War & Data: social networks and the art of personalizing propaganda , Gilad Lotan
Google News Personalization: Scalable Online Collaborative Filtering, Das et al
How Reddit Ranking Algorithms Work, Amir Salihefendic
Slashdot Moderation, Rob Malda
Pay attention to what Nick Denton is doing with comments, Clay Shirky
How does Google use human raters in web search?, Matt Cutts

Assignment – Design a filtering algorithm for status updates.

Week 7: Visualization – 10/23
Slides.
An introduction into how visualisation helps people interpret information. Design principles from user experience considerations, graphic design, and the study of the human visual system. The Overview document visualization system used in investigative journalism.

Required

Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et al.
Computational Information Design chapters 1 and 2, Ben Fry

Recommended

Journalism in an age of data, Geoff McGhee
Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
Visualization, Tamara Munzner

No class 10/30

Week 8: Structured journalism and knowledge representation – 11/6
Slides.
Is journalism in the text/video/audio business, or is it in the knowledge business? This class we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at the Reverb algorithm.

Required

A fundamental way newspaper websites need to change, Adrian Holovaty
The next web of open, linked data – Tim Berners-Lee TED talk
Relation extraction and scoring in DeepQA – Wang et al, IBM

Recommended

Standards-based journalism in a semantic economy, Xark
What the semantic web can represent – Tim Berners-Lee
Building Watson: an overview of the DeepQA project
Can an algorithm write a better story than a reporter? Wired 2012.

Assignment: Text enrichment experiments using OpenCalais entity extraction. Due 11/20

Week 9: Algorithmic Accountability – 11/13
Slides.
Our society is woven together by algorithms. From high frequency trading to predictive policing, they regulate an increasing portion of our lives. But these algorithms are mostly secret, black boxes form our point of view. We’re at they’re mercy, unless we learn how to interrogate and critique algorithms.

Required

Algorithms, journalistic investigations and holding digital power accountable, Stefanie Knoll
How Uber surge pricing really works, Nick Diakopoulos
How Big Data is Unfair, Moritz Hardt

Recommended

Algorithmic Accountability primer, Data and Society Research Institute
How the Journal Tested Prices and Deals Online, Jeremy Singer-Vine, Ashkan Soltani and Jennifer Valentino-DeVries
Algorithmic Accountability: On the investigation of black boxes, Nick Diakopoulos
How Algorithms Shape our World, Kevin Slavin

Week 10: Network analysis – 11/20
Slides.
Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.

Required

Analyzing the Data Behind Skin and Bone, ICIJ
Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.

Recommended

Visualizing Communities, Jonathan Stray
Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
The network of global corporate control, Vitali et. al.
The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
Sections I and II of Community Detection in Graphs, Fortunato
Centrality and Network Flow, Borgatti
Exploring Enron, Jeffrey Heer

Examples:

Galleon’s Web, Wall Street Journal
Muckety
Theyrule.net,
Who Runs Hong Kong?, South China Morning Post

Assignment: Compare different centrality metrics in Gephi. Due 12/4.

No class 11/27

Week 11: Drawing conclusions from data – 12/4
Slides.
You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.

Required

Correlation and causation, Business Insider
The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer

Recommended

If correlation doesn’t imply causation, then what does?, Michael Nielsen
Graphical Inference for Infovis, Hadley Wickham et al.
Why most published research findings are false, John P. A. Ioannidis
The Introductory Statistics Course: a Ptolemaic Curriculum, George W. Cobb

Assignment: write a story on the status of women in science. Due 12/18.

Week 12: Privacy, Security, and Censorship – 12/11
Slides.
Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.

Required

Digital Security for Journalists, Part 1 and Part 2, Jonathan Stray
Hearst New Media Lecture 2012, Rebecca MacKinnon

Recommended

CPJ journalist security guide section 3, Information Security
Global Internet Filtering Map, Open Net Initiative
Unplugged: The Show part 9: Public Key Cryptography
Diffe-Hellman key exchange, ArtOfTheProblem
Tor Project Overview
Who is harmed by a real-names policy, Geek Feminism

Assignment: Use threat modeling to come up with a security plan for a given scenario.

Week 13: Tracking flow and impact – 12/18

How does information flow in the online ecosystem? What happens to a story after it’s published? How do items spread through social networks? We’re just beginning to be able to track ideas as they move through the network, by combining techniques from social network analysis and bioinformatics.

Required

Metrics, Metrics everywhere: Can we measure the impact of journalism?, Jonathan Stray
Meme-tracking and the Dynamics of the News Cycle, Leskovec et al.
How promotion affects pageviews on the New York Times website, Brian Abelson

Recommended

NewsLynx: A Tool for Newsroom Impact Measurement, Michael Keller, Brian Abelson
The role of social networks in information diffusion, Eytan Bakshy et al.
Defining Moments in Risk Communication Research: 1996–2005, Katherine McComas
Chain Letters and Evolutionary Histories, Charles H. Bennett, Ming Li and Bin Ma
Competition among memes in a world with limited attention, Weng et al.
Zach Seward, In the news cycle, memes spread more like a heartbeat than a virus
How hidden networks orchestrated Stop Kony 2012, Gilad Lotan

Final projects due 12/31 (dual degree Journalism/CS students only)