Syllabus Fall 2018

Featured

Posted on September 12, 2018 by jstray

The course is a hands-on, research-level introduction to the areas of computer science that have a direct relevance to journalism, and the broader project of producing an informed and engaged public $100 installment loan. We study two big ideas: the application of computation to produce journalism (such as data science for investigative reporting), and journalism about areas that involve computation (such as the analysis of credit scoring algorithms.)

Alon the way we will touch on many topics: information recommendation systems but also filter bubbles, principles of statistical analysis but also the human processes which generate data, network analysis and its role in investigative journalism, visualization techniques and the cognitive effects involved in viewing a visualization.

Assignments will require programming in Python, but the emphasis will be on clearly articulating the connection between the algorithmic and the editorial. Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology.

Format of the class, grading and assignments.
This is a fourteen week, six point course for CS & journalism dual degree students. (It is a three point course for cross-listed students, who also do not have to complete the final project.) The class is conducted in a seminar format. Assigned readings and computational techniques will form the basis of class discussion. The course will be graded as follows:

Assignments: 40%. There will be five homework assignments.
Final project 40%: Dual students will be complete a medium-ish final project (others will have this 40% from assignments)
Class participation: 20%

Assignments will involve experimentation with fundamental computational techniques. Some assignments will require intermediate level coding in Python, but the emphasis will be on thoughtful and critical analysis. As this is a journalism course, you will be expected to write clearly. The final project can be either a piece of software (especially a plugin or extension to an existing tool), a data-driven story, or a research paper on a relevant technique.

Dual degree students will also have a final project. This will be either a research paper, a computationally-driven story, or a software project. The class is conducted on pass/fail basis for journalism students, in line with the journalism school’s grading system. Students from other departments will receive a letter grade.

Week 1: High dimensional data – 9/12
CS techniques can help journalism in two main ways: using computation to do journalism, and doing journalism about computation. Either way, we’ll be working a lot with the abstraction of high dimensional vectors. We’ll start with an overview of interpreting high-dimensional data, then jump right into clustering and the document vector space model, which we’ll need to study natural language processing and recommendation engines.

Slides.

References

Computational Journalism, Cohen, Turner, Hamilton
TF-IDF is about what matters, Aaron Schumacher
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
How ProPublica’s Message Machine reverse engineers political microtargeting, Jeff Larson

Viewed in class

Using clustering to analyze the voting blocs in the UK House of Lords

Week 2: Text analysis – 9/19
We’ll start by picking up the story of text analysis in journalism, including the development of thew Overview document mining system. Then probabilistic topic modeling (ala LDA), matrix factorization, more general plate-notation graphical models, and word embedding approaches based on deep learning. Then on to fundamental recommendation approaches such as collaborative filtering. Bringing it to practice we will look at Columbia Newsblaster (a precursor to Google News) and the New York Times recommendation engine.

Slides.

Required

Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al
Topic modeling by hand, Shawn Graham

References

A full-text visualization of the Iraq war logs, Jonathan Stray
Document mining with the Overview prototype, includes visualizations of TF-IDF space
Word2Vec tutorial: The Skip-gram Model, Chris McCormick
Generating News Headlines with Recurrent Neural Networks, Konstantin Lopyrev

Discussed in class

More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked, Jeff Kao. Use of word embeddings (word2vec) in the public interest.
What do journalists do with documents? Jonathan Stray
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, Bolukbasi et. al.
Making Downton more traditional, Ben Schmidt

Assignment: LDA analysis of State of the Union speeches.

Week 3: Filter Design
We’ve studied filtering algorithms, but how are they used in practice — and how should they be? We will study the details of several algorithmic filtering approaches used by social networks, and effects such as polarization and filter bubbles.

Slides.

Readings

Who should see what when? Three design principles for personalized news Jonathan Stray
How Facebook’s Foray into Automated News Went from Messy to Disastrous, Will Oremus

References

How Reddit Ranking Algorithms Work, Amir Salihefendic
Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
Matrix Factorization Techniques for Recommender Systems, Koren et al
Building the Next New York Times Recommendation Engine, Alexander Spangher
Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter, Liu et al.
How does Google use human raters in web search?, Matt Cutts

Viewed in class

Recommending items to more than a billion people, Facebook
Israel, Gaza, War & Data: social networks and the art of personalizing propaganda , Gilad Lotan
Exposure to Diverse Information on Facebook, Bakshy et al

Assignment 2: Design a filtering algorithm for an information source of your choosing

Week 4: Quantification and Statistical Inference
We’ll begin with the most neglected topic in statistics: measurement. We’ll take a detailed look at the question of what to count, and how to “interview the data” to check for data quality. Then we’ll move on to risk ratios, one of the simplest statistical models and a key idea in accountability. We’ll continue with a look at the uses of multi-variable regression in journalism, and study graphical causal models to help untangle the whole correlation/causation thing.

Slides.

Required:

The Quartz Guide to Bad Data, Christopher Groskopf
The Curious Journalist’s Guide to Data: Quantification, Jonathan Stray

Recommended

Operationalizing, or the function of measurement in modern literary theory, Franco Moretti. A great discussion of quantification of complex concepts.
What data can’t tell us about buying politicians, Stray. Why we need need risk ratios to talk about corruption.
If correlation doesn’t imply causation, then what does?, Michael Nielsen

Viewed in class

Speed Trap: Who gets a ticket, who gets a break? Boston Globe. A classic use of regression.
How we measured surgical complications, ProPublica. A more recent, extremely sophisticated use of regression.

Week 5: Algorithmic Accountability and Discrimination
Algorithmic accountability is the study of the algorithms that regulate society, from high frequency trading to predictive policing. We’re at their mercy, unless we learn how to investigate them. We’ll review previous work in this area, then start our study of algorithmic discrimination. Analyzing discrimination data is more subtle and complex than it might seem.

Slides.

Required

How We Analyzed the COMPAS Recidivism Algorithm, Larson et al.
Bias In, Bias Out, Sandra Mayson
Our course notebook: Machine Bias

References

How Algorithms Shape our World, Kevin Slavin
Big Data’s Disparate Impact, Barocas and Selbst
Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al

Viewed in class

Should Prison Sentences Be Based On Crimes That Haven’t Been Committed Yet?, FiveThirtyEight. Nice interactive visualization of risk scores.
How the Journal Tested Prices and Deals Online, Jeremy Singer-Vine, Ashkan Soltani and Jennifer Valentino-DeVries
How Uber surge pricing really works, Nick Diakopoulos

Week 6: Quantitative Fairness

Most algorithmic accountability and AI fairness work so far has been concerned with “bias,” but what is that? The answer is more complex than it might seem. In this class we’ll discuss the many definitions of fairness and show that they mostly boil down to three different formulations. We’ll also discuss everything around the algorithm, including how the results are used and what the training data means.

Slides.

Required:

Fairness in Machine Learning, NIPS 2017 Tutorial by Solon Barocas and Moritz Hardt
A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions, Chouldechova et al.
A Child Abuse Prediction Model Fails Poor Families, Wired

References

How Big Data is Unfair, Moritz Hardt
Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al
Sex Bias in Graduate Admissions: Data from Berkeley, P. J. Bickel, E. A. Hammel, J. W. O’Connell. The classic paper on Simpson’s Paradox.

Week 7: Randomness and Significance
The notion of randomness is crucial to the idea of statistical significance. We’ll talk about determining causality, p-hacking and reproducibility, and the more qualitative, closer-to-real-world method of triangulation.

Slides.

Required

The Curious Journalist’s Guide to Data: Analysis, Jonathan Stray
I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How, John Bohannon

Recommended

Why most published research findings are false, John P. A. Ioannidis
Why Betting Data Alone Can’t Identify Match Fixers In Tennis, FiveThirtyEight
Solve Every Statistics Problem with One Weird Trick, Jonathan Stray. A lightning talk on randomization.
The Introductory Statistics Course: a Ptolemaic Curriculum, George W. Cobb. The argument for randomization instead of classic analytical statistics.

Viewed in class

How not to be misled by the jobs report, Neil Irwin and Kevin Quealy. Noise in official figures.
Science Isn’t Broken, Christie Aschwanden. P-Values and the replication crisis.
The Psychology of Intelligence Analysis, chapter 8. Richards J. Heuer

Week 8: Visualization, Network Analysis
Visualization helps people interpret information. We’ll look at design principles from user experience considerations, graphic design, and the study of the human visual system. Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, and inreasingly in journalism.

Slides.

Readings

Visualization, Tamara Munzner
Network Analysis in Journalism: Practices and Possibilities, Stray

References

39 Studies about Human Perception in 30 minutes, Kennedy Elliot
Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et al.
Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
Analyzing the Data Behind Skin and Bone, ICIJ
Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.

Examples:

The network of global corporate control, Vitali et. al.
Galleon’s Web, Wall Street Journal
Muckety
Theyrule.net,

Assignment: Compare different centrality metrics in Gephi.

Week 9: Knowledge representation
How can journalism benefit from encoding knowledge in some formal system? Is journalism in the media business or the data business? And could we use knowledge bases and inferential engines to do journalism better? This gets us deep into the issue of how knowledge is represented in a computer. We’ll look at traditional databases vs. linked data and graph databases, entity and relation detection from unstructured text, and traditional both probabilistic and propositional formalisms. Plus: NLP in investigative journalism, automated fact checking, and more.

Slides.

Readings

Identifying civilians killed by police with distantly supervised entity-event extraction, Keith et. al
Extracting References from Political Speech Auto-Transcripts, Brandon Roberts

References

A fundamental way newspaper websites need to change, Adrian Holovaty
Relation extraction and scoring in DeepQA – Wang et al, IBM
The State of Automated Fact Checking, Full Fact
Storylines as Data in BBC News, Jeremy Tarling
Building Watson: an overview of the DeepQA project

Viewed in class

The next web of open, linked data – Tim Berners-Lee TED talk
https://schema.org/NewsArticle
Connected China, Reuters/Fathom

Assignment: Text enrichment experiments using OpenCalais entity extraction.

Week 10: Truth and Trust

Credibility indicators and schema. Information operations. Fake news detection and automated fact checking. Tracking information flows.

Slides.

Readings

References

The Credibility Coalition is working to establish the common elements of trustworthy articles, journalism.co.uk
Checking in with the Facebook fact-checking partnership, Ananny

11: Privacy, Security, and Censorship
Who is watching our online activities? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we cover both the basics of digital security, and methods to deal with specific journalistic situations — anonymous sources, handling leaks, border crossings, and so on.

Slides.

Readings

Digital Security for Journalists, Part 1 and Part 2, Stray

References

CPJ journalist security guide section 3, Information Security
Global Internet Filtering Map, Open Net Initiative
Tor Project Overview
Who is harmed by a real-names policy, Geek Feminism

Viewed in Class

A World Without Wizards: On Facebook and Cambridge Analytica, Dave Karpf
Allen Dulles’ 73 Rules of Spycraft
The Mysterious Printer Code That Could Have Led the FBI to Reality Winner, The Atlantic

Week 12: Final Project Presentations

Assignment 7: Named Entity Recognition

Posted on November 21, 2018 by jstray

For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service, against your hand annotations.

1. Pick five random news stories and hand-annotate them. Pick an English-language news site with many stories on the home page, or a section of such a site (business, sports, etc.) Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate file, and save as plain text (no HTML, no formatting.)

2. Detect entities by hand in each article. Paste the text of each article into an RTF or Word document and go through it, underlining every entity. Count every mention of a person, place, or organization, including alternate names (“the president”) and pronoun references. Count how many entity references appear in each document (multiple mentions of the same entity all count.)

3. Now run each document through OpenCalais. You can paste the text into the demo page here. Now compare the results to your hand-annotations to produce a confusion matrix:

True Positives: entities marked by you and found by OC
False Negatives: entities marked by you and not found by OC
False Positives: entities found by OC but not marked by you

4. Turn in:

Your hand-marked documents.
A spreadsheet. Please turn in a spread sheet with one row per document, and four columns: total entities marked by you, true positives, false negatives, false positives.
Make sure your spreadsheet includes totals of these four columns across all documents
Your analysis. Report on any patterns in the failures that you see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors? Are there ambiguities as to what is really an entity?

This assignment is due before class on Wednesday, December 5.

Assignment 6: Social Network Analysis

Posted on November 21, 2018 by jstray

For this assignment you will analyze a social network using three different centrality algorithms, and compare the results.

1. Download and install Gephi, a free graph analysis package. It is open source and runs on any OS.

2. Download the data file lesmis.gml from the UCI Network Data Repository. This is a network extracted from the famous French novel Les Miserables — you may also be familiar with the musical and the recent movie. Each node is a character, and there is an edge between two characters if they appear in the same chapter. Les Miserables is written in over 300 short chapters, so two characters that appear in the same chapter are very likely to meet or talk in the plot of the book. The edges are weighted, and the weight is the number of chapters those characters appear together in.

3. Open this file in Gephi, by choosing File->Open. When the dialog box comes up, set the “Graph Type” type to “Undirected.” The graph will be plotted. What do you see? Can you discern any patterns?

4. Now arrange the nodes in a nicer way, by choosing the “Force Atlas 2″ layout algorithm from the Layout menu at left and pressing the “Run” button. When things settle down, hit the “Stop” button. The graph will be arranged nicely, but it will be quite small. You can zoom in using the mouse wheel (or two fingers on the trackpad on a mac) and pan using the right mouse button.

5. Select the “Edit” tool from the bottom of the toolbar on the left. It looks like a mouse pointer with question mark next to it:

6. Now you can click on any node to see its label, which is the name of the character it represents. This information will appear in the “Edit” menu in the upper left. Here’s the information for the character Gavroche.

Gephi Gavroche properties

Click around the various nodes in the graph. Which characters have been given the most central locations? If you are familiar with the story of Les Miserables, how does this correspond to the plot? Are the most central nodes the most important characters?

7. Make Gephi color nodes by degree. Choose the “Ranking” tab from panel at the upper left, then select the “Nodes” tab, then “Degree” from the drop-down menu. Press the “Apply” button.

screen-shot-2013-01-29-at-12-41-00-pm

Now the nodes with the highest degree will be darker. Do these high degree nodes correspond to the nodes that the layout algorithm put in the center? Are they the main characters in the story?

8. Now make Gephi compute betweenness and closeness centrality by pressing the “Run” button for the Network Diameter option under “Network Overview” in to the right of the screen.

Gephi network overview

You will get a report with some graphs. Just click “Close”. Now betweenness and closeness centrality will appear in the drop-down under “Ranking,” in the same place where you selected degree centrality earlier, and you can assign colors based on either run by clicking the “Apply” button.

Also, the numerical values for betweenness centrality and closeness centrality will now appear in the “Edit” window for each node.

Select “Betweenness Centrality” from the drop-down meny and hit “Apply.” What do you see? Which characters are marked as important? How does it differ from the characters which are marked as important by degree?

Now select “Closeness Centrality” and hit “Apply.” (Note that this metric uses a scale which is the reverse of the others — closeness measures average distance to all other nodes, so small values indicate more central nodes. You may want to swap the black and white endpoints of the color scale to get something which is comparable to the other visualizations.) How does closeness centrality differ from betweeness centrality and degree? Which characters differ between closeness and the other metrics?

9. Which centrality algorithm would you prefer to use to understand the structure of Les Miserables? Why? How would you validate your choice if you didn’t already know the story? That is the situation a journalist is in when they analyze unknown data.

Turn in: your answers to the questions in steps 3, 6, 7, 8 and 9, plus screenshots for the graph plotted with degree, betweenness centrality, and closeness centrality. (To take a screenshot: on Windows, use the Snipping Tool. On Mac, press ⌘ Cmd + ⇧ Shift + 4. If you’re on Linux, you get to tell me)

What I am interested in here is how the values computed by the different algorithms correspond to the plot of Les Miserables (if you are familiar with it), and how they compare to each other. Telling me that “Jean Valjean has a closeness centrality of X” is not a high-enough level interpretation — your couldn’t publish that in a finished story, because your readers won’t know what that means.

Due before class on Wednesday, Nov 28

Assignment 2: Filter Design

Posted on September 25, 2018 by jstray

For this assignment you will design an information filtering algorithm. You will not implement it, but you will explain your design criteria and provide a filtering algorithm in sufficient technical detail to convince me that it might actually work — including psuedocode.

1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?

2. Decide what you will filter. You can choose:

Social network updates, like the Facebook news feed or Twitter trending topics
A news organization recommendation engine
The whole web, like Google News
something else

3. List all available information that you have available as input to your algorithm. If you want to filter Facebook or Twitter, you may pretend that you are the company running the service, and have access to all posts and user data — from every user. You also also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but you must be specific and realistic about what data you are operating with.

4. Argue for the design factors that you would like to influence the filtering, in terms of what is desirable to the user, what is desirable to the publisher (e.g. Facebook or a news organization), and what is desirable socially. Explain as concretely as possible how each of these (probably conflicting) goals might be achieved through in software. You can imagine that you have a UI that supports certain types of interactions (e.g. likes, votes, ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.

5. Write psuedo-code for a function that produces a “top stories” list. This function will be called whenever the user loads your page or opens your app, so it must be fast and frequently updated. You can assume that there are background processes operating on your servers if you like. Your psuedo-code does not have to be executable, but it must be specific and unambiguous, such that I could actually go and implement it. You can assume that you have libraries for classic text analysis and machine learning algorithms. So, you don’t have to spell out algorithms like TF-IDF or item-based collaborative filtering, or anything else you can dig up in the research literature, but simply say how you’re going to use such building blocks. If you use an algorithm we haven’t discussed in class, be sure to provide a reference to it.

6. Write up steps 1-5. The result should be no more than three pages. You must be specific and plausible. You must be clear about what you are trying to accomplish, what your algorithm is, and why you believe your algorithm meets your design goals (though of course it’s impossible to know for sure without testing; but I want something that looks good enough to be worth trying.)

Due before class, October 10

Assignment 1: Topic Modeling

Posted on September 19, 2018 by jstray

This assignment is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents, and one way of handling the time dimension. First you will analyze a corpus of AP news articles. Then you’ll look at the State of the Union speeches, and report on how the subjects have shifted over time in relation to historical events.

Note: This assignment requires reading the documents, not just running algorithms on them. I am asking you to tell me how well these algorithms capture the meaning of the documents, and you can only determine meaning by reading the documents. So when you see a word scored highly within a document set, go read the documents that contain that word. When a topic ranks high for a document, go read the documents that contain that topic.

1. Load the data. To begin with you’ll try LDA on a homogeneous document set of short clean articles: this collection of AP wire stories. Get this CSV loaded as a list of strings, one per document.

2) Generate TF-IDF vectors. Use the gensim package to generate tf-idf weighted document vectors. Check out the gensim example code here. You will need to go through the file twice: once to generate the dictionary (the code snippet starting with “collect statistics about all tokens”) and then again to convert each document to what gensim calls the bag-of-words representation, which is un-normalized term frequency (the code snippet starting with “class MyCorpus(object)”

Note that there is implicitly another step here, which is to tokenize the document text into individual word features — not as straightforward in practice as it seems at first, but the example code just does the simplest, stupidest thing, which is to lowercase the string and split on spaces. You may want to use a better stopword list, such as this one.

Once you have your Corpus object, tell gensim to generate tf-idf scores for you like so.

3) Do LSI topic modeling. You can apply LSI to the tf-idf vectors, like so. You will have to supply the number of topics to generate. Figuring out a good number is part of the assignment. Print out the resulting topics, each topic as a lists of word coefficients. Now, sample ten topics randomly (not the first ten, a random ten!) from your set for closer analysis. Try to annotate each of these ten topics with a short descriptive name or phrase that captures what it is “about.” You will have to refer to the original documents that contain high proportions of that topic, and you will likely find that some topics have no clear concept.

Turn in: your annotated topics plus a comment on how well you feel each “topic” captured a real human concept.

4) Now do LDA topic modeling. Repeat the exercise of step 3 but with LDA instead, again trying to annotate ten randomly sampled topics. What is different? Did it better capture the meaning of the documents? If so, “better” how?

Turn in: your annotated topics plus a comment on how LDA differed from LSI.

5) Now apply LDA to the State of the Union. Download the source data file state-of-the-union.csv. This is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. You will write a Python program that reads this file and turns it into TF-IDF document vectors, then prints out some information. Here is how to read a CSV in Python. You may need to add the line

csv.field_size_limit(1000000000)

to the top of your program to be able to read this large file. Also, this file is probably too big to open in Excel or to read with Pandas.

The file is a csv with columns year, text. Note: there are some years where there was more than one speech! Design your data structures accordingly.

6) Determine how speeches have changed over the 20th century. We’ll use a very simple algorithm:

Generate TF-IDF vectors for the entire corpus
Group speeches by decade
Sum TF-IDF vectors for all speeches in each decade to get one “summary” vector per decade
Print out top 10 most highly ranked words in each decade vector

Turn in: an analysis of how the topics of the State of the Union have changed over the decades of the 20th century. What patterns do you see? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

Assignment 4: Entity Extraction

Posted on November 10, 2017 by jstray

взять займ безработному. For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service, against your hand annotations.

3. Now run each document through OpenCalais. You can paste the text into the demo page here. Now count:

Your hand count of entity references
Correct entities: entities found by OC and marked by you
Incorrect entities: entities found by OC but not marked by you
Missed entities: entities marked by you but not found by OC

4. Turn in:

Your hand-marked documents.
A spreadsheet. Please turn in a spread sheet with one row per document, and four columns: hand-labelled, correct, incorrect, missed
Make sure your spreadsheet includes totals of these four columns across all documents
Your analysis. Report on any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors? Are there ambiguities as to what is really an entity?

This assignment is due before class on Friday, December 1.

Assignment 3: Analyzing Stop and Frisk for Racial Bias

Posted on October 26, 2017 by jstray

For this exercise, you will break into three groups. Two will analyze data, and the third will do legal research. The data for this assignment is available here.

All groups: How can we encode our ideas about racial fairness into a quantitative metric? That is the fundamental question underlying this assignment and you must answer it. Don’t build models that give you answers to useless questions; each model you build must embody some justifiable concept of fairness. Many such metrics have been proposed; part of the assignment is researching and evaluating them. I’ve proposed some models that might be interesting, but I’ll be just has happy — perhaps happier — to have you tell me why these particular model formulations will not yield an interesting result. Also, you have to tell me what your modeling results mean. Is there bias? In what way, how significant is it, and what are alternate explanations? Uninterpreted results will not get a passing grade.

Group 1: Analyze the stop and frisk data using a multi-level linear model

This group will analyze the data along the lines of Gelman 2006. However, that paper used Bayesian estimation whereas this team will used standard linear regression. Use RStudio with the lme4 package, as described in this tutorial.

As above, you need to choose and justify quantitative definitions of fairness. Should you look at stop rate? Hit rate? And should you compare to the racial composition of each precinct? Or per-race crime rates? And if so, which crimes?

Each conceptual definition of fairness can be embodied in many different statistical models. Compare at least two different statistical formulations. For example you might end up modeling:

hit rate vs race, per precinct
hit rate vs race, per precinct, with precinct random effects

You must also choose the unit of analysis. Is precinct or census block a better unit? Why? Or you could compare the same model on different units.

Your final post must include a justification of the metric and model choices, and useful visualization of the results, and interpretation of both the regression coefficients and uncertainty measures (standard errors on regression coefficients, and modeling residuals.)

Group 2: Analyze the stop and frisk data using a Bayesian model

This group will analyze the data along the lines of Simoiu 2016, by adapting their code. For a tutorial on how to set up Bayesian modeling in R, see Bayesian Linear Mixed Models using STAN.

As with group 2, you must research and choose a definition of fairness, fit at least two different statistical formulations, and interpret your results including the uncertainty (posterior distributions and residuals.) Be sure to visualize your model fit, as in figure 7 of Simoiu.

I would be happy to see you replicate the threshold test of Simoiu. However, I want you to explain why the threshold test makes sense as a fairness metric. If it doesn’t make sense I want you to design a new model. Perhaps the assumption that each officer can make correctly calibrated estimates of the probability that someone is carrying contraband is unrealistic, and your model should be based on the idea that the estimates are biased and try to model that bias as latent variables.

Group 3: Research legal and policy precedent for statistical tests

This group will scour the legal literature to determine what sorts of statistical tests have been used, or could be used, to answer legal questions of discrimination. You will also research the related policy literature: what sorts of tests have governments, companies, schools etc. used to evaluate the presence or significance of discrimination. For an entry into the literature, you could do worse than to start with the works referenced by Big Data’s Disparate Impact.

This group will not be coding, but I expect you to ask not only what fairness metrics might be appropriate (as the other groups must also do) but 1) whether or not these metrics might hold up in court and 2) whether they have ever been used outside of court.

And one particular question I would like you to answer: would Simiou’s “threshold test” have legal or policy relevance?

Due Friday Nov 10

Assignment 2: Filter Design

Posted on September 28, 2017 by jstray

For this assignment you will design a hybrid filtering algorithm. You will not implement it, but you will explain your design criteria and provide a filtering algorithm in sufficient technical detail to convince me that it might actually work — including psuedocode.

1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?

2. Decide what you will filter. You can choose:

Social network updates, like the Facebook news feed or Twitter trending topics
A news organization recommendation engine
The whole web, like Google News
something else

4. Argue for the design factors that you would like to influence the filtering, in terms of what is desirable to the user, what is desirable to the publisher (e.g. Facebook or a news organization), and what is desirable socially. Explain as concretely as possible how each of these (probably conflicting) goals might be achieved through in software. Since this is a hybrid filter, you can also design social software that asks the user for certain types of information (e.g. likes, votes, ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.

Due before class, October 13

Assignment 1: Topic Modeling

Posted on September 14, 2017 by jstray

займ без процентов на карту. This assignment is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents, and common ways of handling a time dimension. You will analyze the State of the Union speeches corpus, and report on how the subjects have shifted over time in relation to historical events.

1. Download the source data file state-of-the-union.csv. This is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. You will write a Python program that reads this file and turns it into TF-IDF document vectors, then prints out some information. Here is how to read a CSV in Python. You may need to add the line

csv.field_size_limit(1000000000)

to the top of your program to be able to read this large file.

The file is a csv with columns year, text. Note: there are some years where there was more than one speech! Design your data structures accordingly.

2) Feed the data into gensim. Now you need to load the documents into Python and feed them into the gensim package to generate tf-idf weighted document vectors. Check out the gensim example code here. You will need to go through the file twice: once to generate the dictionary (the code snippet starting with “collect statistics about all tokens”) and then again to convert each document to what gensim calls the bag-of-words representation, which is un-normalized term frequency (the code snippet starting with “class MyCorpus(object)”

Once you have your Corpus object, tell gensim to generate tf-idf scores for you like so.

3) Do LSI topic modeling. You can apply LSI to the tf-idf vectors, like so. You will have to supply the number of topics to generate. Figuring out a good number is part of the assignment. Print out the resulting topics, each topic as a lists of word coefficients. Now, sample ten topics randomly from your set for closer analysis. Try to annotate each of these ten topics with a short descriptive name or phrase that captures what it is “about.” You will likely have to refer to the original documents that contain high proportions of that topic, and you will likely find that some topics have no clear concept.

Turn in: your annotated topics plus a comment on how well you feel each “topic” captured a real human concept.

4) Now do LDA topic modeling. Repeat the exercise of step 3 but with LDA instead, again trying to annotate ten randomly sampled topics. What is different?

Turn in: your annotated topics plus a comment on how LDA differed from LSI.

5) Now choose one of the following exercises:

a) Come up with a method to figure out how topics of speeches have changed over time. The goal is to summarize changes in the State of the Union speech in each decade of the 20th and 21st century. There are many different ways to use topic modeling to do this. Possibilities include: visualizations, grouping speeches by decade after topic modeling, and grouping speeches by decade before topic modeling. You can base your algorithm on either LSI or LDA, whichever you feel gives the most insight. Choose a method, then use your decade summarization algorithm to understand what the content of speeches was in each decade.

Turn in: a description of your decade summarization algorithm, and an analysis of how the topics of the State of the Union have changed over the decades of the 20th century. What patterns do you see? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…)

or b) Analyze a different document set. Try LDA on a different document set, this collection of AP wire stories. Repeat the process of choosing the number of topics, fitting a model, and interpreting a random sample of 10 of them. Are the topics any clearer on this document set? If so, why? You may wish to look at previous LDA results on these documents, the top 20 words from 100 topics.

Turn in: your annotated topic sample, plus a description of the differences between the output on these documents vs. the State of the Union documents. Does one work better than the other? If so, define “better.”

This assignment is due Friday, September 29 at 10:00 AM. You may email me the results.

Syllabus Fall 2017

Posted on September 11, 2017 by jstray

The course is a hands-on, research-level introduction to the areas of computer science that have a direct relevance to journalism, and the broader project of producing an informed and engaged public. We study two big ideas: the application of computation to produce journalism (such as data science for investigative reporting), and journalism about areas that involve computation (such as the analysis of credit scoring algorithms.)

Assignments will require programming in Python, but the emphasis will be on clearly articulating the connection between the algorithmic and the editorial.

Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology. Students with a CS background will have opportunity for algorithmic exploration and innovation, however the primary goal of the course is thoughtful, application-oriented research and design.

Assignments: 40%. There will be five homework assignments.
Final project 40%: Dual students will be complete a medium-ish final project (others will have this 40% from assignments)
Class participation: 20%

Week 1: Introduction and Clustering – 9/8
Slides.
First we ask: where do computer science and journalism intersect? CS techniques can help journalism in two main ways: using computation to do journalism, and doing journalism about computation. We’ll spend most of our time on the former: data-driven reporting, story presentation, information filtering, and effect tracking. Then we jump right into clustering and the document vector space model, which we’ll need to study filtering.

References

Computational Journalism, Cohen, Turner, Hamilton
TF-IDF is about what matters, Aaron Schumacher
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
How ProPublica’s Message Machine reverse engineers political microtargeting, Jeff Larson

Viewed in class

A full-text visualization of the Iraq war logs, Jonathan Stray
Using clustering to analyze the voting blocs in the UK House of Lords

Week 2: Filtering Algorithms – 9/15
Slides.
The filtering algorithms we will discuss this week are used in just about everything: search engines, document set analysis, figuring out when two different articles are about the same story, finding trending topics. The main topics are matrix factorization, probabilistic topic modeling (ala LDA) and more general plate-notation graphical models, and word embeddings. Bringing it to practice we will look at Columbia Newsblaster (a precursor to Google News) and the New York Times recommendation engine.

Required

Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al
Topic modeling by hand, Shawn Graham

References

How Reddit Ranking Algorithms Work, Amir Salihefendic
Matrix Factorization Techniques for Recommender Systems, Koren et al
Probabilistic Topic Models, David M. Blei
Building the Next New York Times Recommendation Engine, Alexander Spangher
Word2Vec tutorial: The Skip-gram Model, Chris McCormick

Discussed in class

Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al

Assignment: LDA analysis of State of the Union speeches.

Week 3: Filters as Editors – 9/22
Slides.
We’ve studied filtering algorithms, but how are they used in practice — and how should they be? We will study the details of several algorithmic filtering approaches used by social networks, and effects such as polarization and filter bubbles.

Readings

Who should see what when? Three design principles for personalized news Jonathan Stray
How Facebook’s Foray into Automated News Went from Messy to Disastrous, Will Oremus

References

Can an algorithm be wrong?, Tarleton Gillespie
Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter, Liu et al.
What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson, Zeynep Tufekci
How does Google use human raters in web search?, Matt Cutts

Viewed in class

Israel, Gaza, War & Data: social networks and the art of personalizing propaganda , Gilad Lotan
What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,

Week 4: Computational Journalism Platforms – 9/29
Slides.
We introduce the Overview document mining system and the Computational Journalism Workbench. Then we develop pitches for final projects, which may include writing plugins for these systems.

Guest Speaker: Alex Spangher, New York Times

Readings

What do journalists do with documents? Stray

References

About Overview
The Computational Journalism Workbench

Assignment – Design a filtering algorithm for an information source of your choosing

Week 5: Quantification, Counting, and Statistics – 10/6
Slides.
Every journalist needs a basic grasp of statistics. Not t-tests and all of that, but more grounded and more practical. How do we know we’re measuring the right thing? Why are we doing stats at all? Then a journalism oriented tutorial on the fundamental ideas of probability, conditional probability, and Bayes’ theorem.

Required:

The Quartz Guide to Bad Data, Christopher Groskopf
The Curious Journalist’s Guide to Data: Quantification, Jonathan Stray

Recommended

Why Not to Trust Statistics , Ben Orlin
To Explain or Predict?, Galit Shmuli. Also a video.
Statistics for Decision Makers: Base Rate Fallacy, Bernard Szlachta
Operationalizing, or the function of measurement in modern literary theory, Franco Moretti
The Curious Journalist’s Guide to Data: Prediction, Jonathan Stray

No class 10/13

Week 6: Inference – 10/20
Slides.
There is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis. We’ll start with statistics, the notion of randomness that is so crucial to the idea of statistical significance. Then we’ll talk about determining causality, p-hacking and reproducibility, and the more qualitative, closer-to-real-world method of analysis of competing hypothesis.

Required

The Curious Journalist’s Guide to Data: Analysis, Jonathan Stray
I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here’s How, John Bohannon

Recommended

Why most published research findings are false, John P. A. Ioannidis
If correlation doesn’t imply causation, then what does?, Michael Nielsen
The Psychology of Intelligence Analysis, chapter 8. Richards J. Heuer
Solve Every Statistics Problem with One Weird Trick, Jonathan Stray
The Introductory Statistics Course: a Ptolemaic Curriculum, George W. Cobb

Viewed in class

Science Isn’t Broken, Christie Aschwanden
Correlation and causation, Business Insider

Week 7: Discrimination and Algorithmic Accountability – 10/27
Slides.
Two topics this week. Discrimination is an important topic for reporters and for society, but analyzing discrimination data is more subtle and complex than it might seem. Algorithmic accountability is the study of the algorithms that regulate society, from high frequency trading to predictive policing. We’re at their mercy, unless we learn how to investigate them.

Required

How Uber surge pricing really works, Nick Diakopoulos
Fairness in Criminal Justice Risk Assessments: The State of The Art, Berk et. al.

References

How Algorithms Shape our World, Kevin Slavin
How the Journal Tested Prices and Deals Online, Jeremy Singer-Vine, Ashkan Soltani and Jennifer Valentino-DeVries
How Big Data is Unfair, Moritz Hardt
Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al
Sex Bias in Graduate Admissions: Data from Berkeley, P. J. Bickel, E. A. Hammel, J. W. O’Connell
How We Analyzed the COMPAS Recidivism Algorithm, Larson et al.
Big Data’s Disparate Impact, Barocas and Selbst

Assignment: Analyze NYPD stop and frisk data for racial discrimination.

No class 11/3

Week 8: Visualization, Network Analysis – 11/3
Slides.
Visualization helps people interpret information. We’ll look at design principles from user experience considerations, graphic design, and the study of the human visual system. Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, and inreasingly in journalism.

Readings

Visualization, Tamara Munzner
Network Analysis in Journalism: Practices and Possibilities, Stray

References

39 Studies about Human Perception in 30 minutes, Kennedy Elliot
Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists, Brehmer et al.
Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
Analyzing the Data Behind Skin and Bone, ICIJ
Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.

Examples:

The network of global corporate control, Vitali et. al.
Galleon’s Web, Wall Street Journal
Muckety
Theyrule.net,

Assignment: Compare different centrality metrics in Gephi.

Week 9 Knowledge representation – 11/10
Slides.
How can journalism benefit from encoding knowledge in some formal system? Is journalism in the media business or the data business? And could we use knowledge bases and inferential engines to do journalism better? This gets us deep into the issue of how knowledge is represented in a computer. We’ll look at traditional databases vs. linked data and graph databases, entity and relation detection from unstructured text, and traditional both probabilistic and propositional formalisms. Plus: NLP in investigative journalism, automated fact checking, and more.

Readings

Identifying civilians killed by police with distantly supervised entity-event extraction, Keith et. al
Extracting References from Political Speech Auto-Transcripts, Brandon Roberts

References

A fundamental way newspaper websites need to change, Adrian Holovaty
Relation extraction and scoring in DeepQA – Wang et al, IBM
The State of Automated Fact Checking, Full Fact
Storylines as Data in BBC News, Jeremy Tarling
Building Watson: an overview of the DeepQA project

Viewed in class

The next web of open, linked data – Tim Berners-Lee TED talk
https://schema.org/NewsArticle
Connected China, Reuters/Fathom

Assignment: Text enrichment experiments using OpenCalais entity extraction.

Week 10: Truth and Trust – 11/17
We went through The Ethics of Persuasion slides.
Computational propaganda. Structure of information operations. Fake news detection and tagging. Credibility schema. Systems to detect and combat abuse and harassment.

Speaker: Ed Bice, Meedan

Readings

References

The Credibility Coalition is working to establish the common elements of trustworthy articles, journalism.co.uk
Now Anyone can Deploy Google’s Troll-fighting AI, Wired
Approve or Reject? Can you Moderate Five New York Times Comments?
Recent harassment research

No class 11/24

Week 11: Privacy, Security, and Censorship – 12/1
Slides.
Who is watching our online activities? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we cover both the basics of digital security, and methods to deal with specific journalistic situations — anonymous sources, handling leaks, border crossings, and so on.

Readings

Digital Security for Journalists, Part 1 and Part 2, Stray

References

CPJ journalist security guide section 3, Information Security
Global Internet Filtering Map, Open Net Initiative
Tor Project Overview
Who is harmed by a real-names policy, Geek Feminism

Week 12: Tracking flow and impact – 12/8
How does information flow in the online ecosystem? What happens to a story after it’s published? How do items spread through social networks? We’re just beginning to be able to track ideas as they move through the network, but it’s still very difficult to really measure the public interest impact of journalism.

Readings

Metrics, Metrics everywhere: Can we measure the impact of journalism?, Jonathan Stray
How the news media activate public expression and influence national agendas, Gary King et. al.
How promotion affects pageviews on the New York Times website, Brian Abelson

References

Meme-tracking and the Dynamics of the News Cycle, Leskovec et al.
NewsLynx: A Tool for Newsroom Impact Measurement, Michael Keller, Brian Abelson
The role of social networks in information diffusion, Eytan Bakshy et al.
Defining Moments in Risk Communication Research: 1996–2005, Katherine McComas
Chain Letters and Evolutionary Histories, Charles H. Bennett, Ming Li and Bin Ma
Competition among memes in a world with limited attention, Weng et al.
Zach Seward, In the news cycle, memes spread more like a heartbeat than a virus
How hidden networks orchestrated Stop Kony 2012, Gilad Lotan

Week 13: Final Project Presentations – 12/15