Week 10: Security, Surveillance, and Privacy

Posted on November 25, 2013 by jstray

Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.

Required

Chris Soghoian, Why secrets aren’t safe with journalists, New York times 2011
Hearst New Media Lecture 2012, Rebecca MacKinnon

Recommended

CPJ journalist security guide section 3, Information Security
Global Internet Filtering Map, Open Net Initiative
NSA Files Decoded, The Guardian

Cryptographic security

Unplugged: The Show part 9: Public Key Cryptography
Diffe-Hellman key exchange, ArtOfTheProblem

Anonymity

Tor Project Overview
Who is harmed by a real-names policy, Geek Feminism

Assignment: Use threat modeling to come up with a security plan for a given scenario.

Assignment 6: Threat Modeling

Posted on November 25, 2013 by jstray

For this assignment, each of you will pick one of the four reporting scenarios below and design a security plan. More specifically, you will flesh out the scenario, create a threat model, come up with a plausible security plan, and analyze the weaknesses of your plan.

Start by creating a threat model, which must consider:

What must be kept private? Specify all of the information that must be secret, including notes, documents, files, locations, and identities — and possibly even the fact that someone is working on a story.
Who is the adversary and what do they want to know? It may be a single person, or an entire organization or state, or multiple entities. They may be very interested in certain types of information, e.g. identities, and uninterested in others. List each adversary and their interests.
What can they do to find out? List every way they could try to find out what you want secret, including technical, legal, and social methods.
What is the risk? Explain what happens if an adversary succeeds in breaking your security. What are the consequences, and to whom? Which of these is it absolutely necessary to avoid?

Once you have specified your your threat model, you are ready to design your security plan. The threat model describes the risk, and the goal of the security plan is to reduce that risk as much as possible.

Your plan must specify appropriate software tools, plus how these tools must be used. Pay particular attention to necessary habits: specify who must do what, and in what way, to keep the system secure. Explain how you will educate your sources and collaborators in the proper use of your chosen tools, and how hard you think it will be to make sure everyone does exactly the right thing.

Also document the weaknesses of your plan. What can still go wrong? What are the critical assumptions that will cause failure if it turns out you have guessed wrong? What is going to be difficult or expensive about this plan?

The scenarios you can choose from are:

1. You are a photojournalist in Syria with digital images you wants to get out of the country. Limited internet access is available at a cafe. Some of the images may identify people working with the rebels who could be targeted by the government if their identity is revealed. In addition you would like to remain anonymous until the photographs are published, so that you can continue to work inside the country for a little longer, and leave without difficulty.

2. You are working on an investigative story about the CIA conducting operations in the U.S., in possible violation the law. You have sources inside the CIA who would like to remain anonymous. You will occasionally meet with these sources in but mostly communicate electronically. You would like to keep the story secret until it is published, to avoid pre-emptive legal challenges to publication.

3. You are reporting on insider trading at a large bank, and talking secretly to two whistleblowers. If these sources are identified before the story comes out, at the very least you will lose your sources, but there might also be more serious repercussions — they could lose their jobs, or the bank could attempt to sue. This story involves a large volume of proprietary data and documents which must be analyzed.

4. You are working in Europe, assisting a Chinese human rights activist. The activist is working inside China with other activists, but so far the Chinese government does not know they are an activist and they would like to keep it this way. You have met the activist once before, in person, and have a phone number for them, but need to set up a secure communications channel.

These scenario descriptions are incomplete. Please feel free to expand them, making any reasonable assumptions about the environment or the story — though you must document your assumptions, and you can’t assume that you have unrealistic resources or that your adversary is incompetent.

Assignment 5: Statistical Inference

Posted on November 6, 2013 by jstray

For this assignment you will analyze global data on the number of homicides versus the number of guns in each country. I’m giving you the data — your job is to tell me what it means. You will interpret a few different plots, and then implement the visual randomization procedure from the paper we discussed in class to examine a tricky case more closely.

The data is from The Guardian Data Blog. I simplified the header names, dropped a few unnecessary columns, and added an OECD column.

1. I’ve written most of the code you will need for this assignment, available from this github repo. (You can git clone if you like, otherwise just click here to download all files as a zip archive).

2. We are going to use the R language for this assignment. This is mostly because it has really nice built in charts (doing this in Python is a real pain), but also because you are likely to encounter R out in the real world of data journalism. Download and install it. To start R, enter R on the command line. To run a program, enter source(‘filename.R’) at the R command prompt A full language manual is here. You will only need to use a few basic concepts, such as random number generation and for loops.

3. Plot the data for all countries’ homicide rate (per 100,000) versus number of privately-owned firearms (per 100) by running source(‘plot-all-countries.R’) at the R prompt. What do you see? Please report on the general patterns here, the outliers, and what this all might mean.

4. Now take a look at only the OECD countries, by uncommenting the indicated line in the source. Re-run the file. What does the chart show now?

5. Now plot only the non-OECD countries, by uncommenting the indicated line in the source (be sure to re-comment the line that selects only OECD countries). What does the chart show now?

6. It looks like there might be a pattern among the OECD countries, but the United States is such an outlier that it’s hard to tell. Is this pattern still significant without the US? To find out, you’re going to apply a randomization test. (We’ll also remove Mexico since it’s not a developed country and thus not really comparable to the other OECD countries.)

Start with the file randomization-test.R. You need to write the code that performs the actual randomization, filling the eight of the columns of charts with random permutations of the original y values (homicide rates), but putting the original data in the realchart column. To prevent sneak peaks, the code is currently set up to use testing data. When your permutations are working right, you should see something like this when you run the file:

After pressing Enter, the program will tell you which chart has the real (un-permuted) data. Here, with fake data, it’s obvious. It won’t always be.

7. Now that your program works, try it on the real data by commenting out the two lines that generate the fake data. Re-run, and look at the plots carefully. Which one do you think is the real data? Write down the number of the chart. Then hit enter, and see if you got it right.

8. This isn’t quite fair, because you were already looking at the data in step 4. So get someone else to look at it fresh. Explain to them that you are charting firearms versus homicides and that one of the charts is real but the rest are fakes, and ask them to spot the real chart.

9. Did you guess right? Did your fresh observer guess right? Did you and your observer guess differently? If so, why do you think that is? Was it difficult for you to choose? Based on all of this, do you think there is a correlation between gun ownership and homicide rate for the OECD countries? If so, how strong is it (effect size) and how strong is the evidence (statistical significance)?

10. What does all this mean? Please write a short journalistic analysis of the global relationship between firearms ownership and homicide rate, for a general audience. Your editor has asked you to do this analysis and is very interested in whether there is a causal relationship — whether more guns cause more crime — so you will have to include something about that.

Turn in: answers to questions in steps 3,4,5,7,8,9, your code, and your final short analysis article.

Week 9: Drawing Conclusions from Data

Posted on November 6, 2013 by jstray

Slides.

You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.

Required

Correlation and causation, Business Insider
The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer

Recommended

If correlation doesn’t imply causation, then what does?, Michael Nielsen
Graphical Inference for Infovis, Hadley Wickham et al.
Why most published research findings are false, John P. A. Ioannidis

Discussed in class:

Drawing conclusions from data, another version of this lecture, plus post and links for learning stats
Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology, a classic on the difficulties of statistical significance and quantization in psychology

Week 8: Social Network Analysis

Posted on November 2, 2013 by jstray

Slides.

Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.

Readings

Analyzing the Data Behind Skin and Bone, ICIJ
Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.

Recommended

Visualizing Communities, Jonathan Stray
The network of global corporate control, Vitali et. al.
The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
Sections I and II of Community Detection in Graphs, Fortunato
Centrality and Network Flow, Borgatti
Exploring Enron, Jeffrey Heer

Examples:

Galleon’s Web, Wall Street Journal
Muckety
Theyrule.net,
Connected China, Reuters

Assignment 4: Social Network Analysis

Posted on October 30, 2013 by jstray

For this assignment you will analyze a social network using three different centrality algorithms, and compare the results.

1. Download and install Gephi, a free graph analysis package. It is open source and runs on any OS.

2. Download the data file lesmis.gml from the UCI Network Data Repository. This is a network extracted from the famous French novel Les Miserables — you may also be familiar with the musical and the recent movie. Each node is a character, and there is an edge between two characters if they appear in the same chapter. Les Miserables is written in over 300 short chapters, so two characters that appear in the same chapter are very likely to meet or talk in the plot of the book. Actually, the edges are weighted, and the weight is the number of chapters those characters appear together in.

3. Open this file in Gephi, by choosing File->Open. When the dialog box comes up, set the “Graph Type” type to “Undirected.” The graph will be plotted. What do you see? Can you discern any patterns?

4. Now arrange the nodes in a nicer way, by choosing the “Force Atlas 2″ layout algorithm from the Layout menu at left and pressing the “Run” button. When things settle down, hit the “Stop” button. The graph will be arranged nicely, but it will be quite small. You can zoom in using the mouse wheel (or two fingers on the trackpad on a mac) and pan using the right mouse button.

5. Select the “Edit” tool from the bottom of the toolbar on the left. It looks like a mouse pointer with question mark next to it:

6. Now you can click on any node to see its label, which is the name of the character it represents. This information will appear in the “Edit” menu in the upper left. Here’s the information for the character Gavroche.

Click around the various nodes in the graph. Which characters have been given the most central locations? If you are familiar with the story of Les Miserables, how does this correspond to the plot? Are the most central nodes the most important characters?

7. Make Gephi color nodes by degree. Choose the “Ranking” tab from panel at the upper left, then select the “Nodes” tab, then “Degree” from the drop-down menu. Press the “Apply” button.

Now the nodes with the highest degree will be darker. Do these high degree nodes correspond to the nodes that the layout algorithm put in the center? Are they the main characters in the story?

8. Now make Gephi compute betweenness and closeness centrality by pressing the “Run” button for the Network Diameter option under “Network Overview” in to the right of the screen.

You will get a report with some graphs. Just click “Close”. Now betweenness and closeness centrality will appear in the drop-down under “Ranking,” in the same place where you selected degree centrality earlier, and you can assign colors based on either run by clicking the “Apply” button.

Also, the numerical values for betweenness centrality and closeness centrality will now appear in the “Edit” window for each node.

Select “Betweenness Centrality” from the drop-down meny and hit “Apply.” What do you see? Which characters are marked as important? How does it differ from the characters which are marked as important by degree?

Now selecte “Closeness Centrality” and hit “Apply.” (Note that this metric uses a scale which is the reverse of the others — closeness measures average distance to all other nodes, so small values indicate more central nodes. You may want to swap the black and white endpoints of the color scale to get something which is comparable to the other visualizations.) How does closeness centrality differ from betweeness centrality and degree? Which characters differ between closeness and the other metrics?

9. Turn in: your answers to the questions in steps 3, 6, 7 and 8, plus screenshots for the graph plotted with degree, betweenness centrality, and closeness centrality. (To take a screenshot: on Windows, use the Snipping Tool. On Mac, press ⌘ Cmd + ⇧ Shift + 4. If you’re on Linux, you get to tell me)

What I am interested in here is how the values computed by the different algorithms correspond to the plot of Les Miserables (if you are familiar with it), and how they compare to each other. Telling me that “Jean Valjean has a closeness centrality of X” is not a high-enough level interpretation — your couldn’t publish that in a finished story, because your readers won’t know what that means.

Week 7: Knowledge Representation

Posted on October 30, 2013 by jstray

Slides.

Is journalism in the text/video/audio business, or is it in the knowledge business? This class we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at th Reverb algorithm.

Required

A fundamental way newspaper websites need to change, Adrian Holovaty
The next web of open, linked data – Tim Berners-Lee TED talk
Identifying Relations for Open Information Extraction, Fader, Soderland, and Etzioni (Reverb algorithm)

Recommended

Standards-based journalism in a semantic economy, Xark
What the semantic web can represent – Tim Berners-Lee
Building Watson: an overview of the DeepQA project
Can an algorithm write a better story than a reporter? Wired/ 2012.

Assignment 3: Text enrichment experiments using OpenCalais entity extraction.

Week 6: Visualization

Posted on October 30, 2013 by jstray

Slides.

An introduction into how visualisation helps people interpret information. The difference between infographics and visualization, and between exploration and presentation. Design principles from user experience considerations, graphic design, and the study of the human visual system. Also, what is specific about visualization in journalism, as opposed to the many other fields that use it?

Readings.

Designing Data Visualizations, Noah Illinsky and Julie Steele, OReilly
Computational Information Design chapters 1 and 2, Ben Fry

Recommended

Journalism in an age of data, Geoff McGhee
Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
Visualization, Tamara Munzner

Assignment 3: Entity Extraction

Posted on October 24, 2013 by jstray

For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service. You’ll do this by building a text enrichment program, which takes plain text and outputs HTML with links to the detected entities. Then you will take five random articles from your data set, enrich them, and manually count how many entities OpenCalais missed or got wrong.

1. Get an OpenCalais API key, from this page.

2. Install the python-calais module. This will allow you to call OpenCalais from Python easily. First, download the latest version of python-calais. To install it, you just need calais.py in your working directory. You will probably also need to install the simplejson Python module. Download it, then run “python setup.py install.” You may need to execute this as super-user.

3. Call OpenCalais from Python. Make sure you can successfully submit text and get the results back, following these steps. The output you want to look at is in the entities array, which would be accessed as “results.entities” using the variable names in the sample code. In particular you want the list of occurrences for each entity, in the “instances” field.

>>> result.entities[0]['instances']
[{u'suffix': u' is the new President of the United States', u'prefix': u'of the United States of America until 2009.  ', u'detection': u'[of the United States of America until 2009.  ]Barack Obama[ is the new President of the United States]', u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]
>>> result.entities[0]['instances'][0]['offset']
75
>>>

Each instance has “offset” and “length” fields that indicate where in the input text the entity was referenced. You can use these to determine where to place links in the output HTML.

4. Read a text file, create hyperlinks, and write it out. Your Python program should read text from stdin and write HTML with links on all detected entities to stdout. There are two cases to handle, depending on how much information OpenCalais gives back.

In many cases, like the example in step 3, OpenCalais will not be able to give you any information other than the string corresponding to the entity, result.entities[x][‘name’]. In this case you should construct a Wikipedia link by simply appending to the name to a Wikipedia URL, converting spaces to underscores, e.g.

http://en.wikipedia.org/wiki/Barack_Obama

In other cases, especially companies and places, OpenCalias will supply a link to an RDF document that contains more information about the entity. For example.

>>> result.entities[0]{u'_typeReference': u'http://s.opencalais.com/1/type/em/e/Company', u'_type': u'Company', u'name': u'Starbucks', '__reference': u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba-7410d77d7a79', u'instances': [{u'suffix': u' in Paris.', u'prefix': u'of the United States now and likes to drink at ', u'detection': u'[of the United States now and likes to drink at ]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact': u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A', u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol': u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker': u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}
>>> result.entities[0]['resolutions'][0]['id']
u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'
>>>

In this case the resolutions array will contain a hyperlink for each resolved entity, and this is where your link should go. The linked page will contain a series of triples (assertions) about the entity, which you can obtain in machine-readable from by changing the .html at the end of the link to .json. The sameAs: links are particularly important because they tell you that this entity is equivalent to others in dbPedia and elsewhere.

Here is more on OpenCalias’ entity disambiguation and use of linked data.

The final result should look something like below. Note that some links go to OpenCalais entity pages with RDF links on them (“London”), some go to Wikipedia (“politician”) and some are broken links when Wikipedia doesn’t have the topic (“Aarthi Ramachandran”) And of course Mr Gandhi is an entity that was not detected, three times.

The latest effort to “decode” Mr Gandhi comes in the form of a limited yet rather well written biography by a political journalist, Aarthi Ramachandran. Her task is a thankless one. Mr Gandhi is an applicant for a big job: ultimately, to lead India. But whereas any other job applicant will at least offer minimal information about his qualifications, work experience, reasons for wanting a post, Mr Gandhi is so secretive and defensive that he won’t respond to the most basic queries about his studies abroad, his time working for a management consultancy in London, or what he hopes to do as a politician.

Don’t worry about producing a fully valid HTML document with headers and a <body> tag, just wrap each entity with <a href=”…”> and </a>. Your browser will load it fine.

5. Pick five random news stories and enrich them. First pick a news site with many stories on the home page. Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate file, and save as plain text (no HTML, no formatting.)

6. Read the enriched documents and count to see how well OpenCalais did. You need to read each output document very carefully and count three things:

Entity references. Count each time there is a name of a person, place, or organization appears, or other references to these things (e.g. “the president.”)
Detected references. How many of these references did OpenCalais find?
Correct references. How many of the links go to the right page? Did our hyperlinking strategy (OpenCalais RDF pages where possible, Wikipedia when not) fail to correctly disambiguate any of the references, or, even worse, disambiguate any to the wrong object? Also, a broken link counts as an incorrect reference.

7. Turn in your work. Please turn in:

Your code
The enriched output from your documents
A brief report describing your results.

The report should include a table of the three numbers — references, detected, correct — for each document, plus the totals of these three numbers across all documents. Also report on any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors?

This assignment is due before class on Wednesday, October 30.

Week 5: Hybrid Filtering

Posted on October 4, 2013 by jstray

In previous weeks we discussed filters that are purely algorithmic (such as NewsBlaster) and filters that are purely social (such as Twitter.) This week we discussed how to create a filtering system that uses both social interactions and algorithmic components.

Slides.

Here are all the sources of information such an algorithm can draw on.

We looked at two concrete examples of hybrid filtering. First, the Reddit comment ranking algorithm, which takes the users’ upvotes and downvotes and sorts not just by the proporition of upvotes, but by how certain we are about proportion, given the number of people who have actually voted so far. Then we looked at item-based collaborative filtering, which is one of several classic techniques based on a matrix of users-item ratings. Such algorithms power everything from Amazon’s “users who bought A also bought B” to Netflix movie recommendations to Google News’ personalization system.

Evaluating the performance of such systems is a major challenge. We need some metric, but not all problems have an obvious way to measure whether we’re doing well. There are many options. Business goals — revenue, time on site, engagement — are generally much easier to measure than editorial goals.

The readings for this week were:

Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
How Reddit Ranking Algorithms Work, Amir Salihefendic
Google News Personalization: Scalable Online Collaborative Filtering, Das et al
Slashdot Moderation, Rob Malda
Pay attention to what Nick Denton is doing with comments, Clay Shirky
How does Google use human raters in web search?, Matt Cutts

This concludes our work on filtering systems — except for Assignment 2.

Computational Journalism

At the Tow Center for Digital Journalism, Columbia University, as taught by Jonathan Stray

Category Archives: Fall 2013

Week 10: Security, Surveillance, and Privacy

Assignment 6: Threat Modeling

Assignment 5: Statistical Inference

Week 9: Drawing Conclusions from Data

Week 8: Social Network Analysis

Assignment 4: Social Network Analysis

Week 7: Knowledge Representation

Week 6: Visualization

Assignment 3: Entity Extraction

Week 5: Hybrid Filtering