Assignment 3: Entity Extraction

Posted on October 24, 2013 by jstray

For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service. You’ll do this by building a text enrichment program, which takes plain text and outputs HTML with links to the detected entities. Then you will take five random articles from your data set, enrich them, and manually count how many entities OpenCalais missed or got wrong.

1. Get an OpenCalais API key, from this page.

2. Install the python-calais module. This will allow you to call OpenCalais from Python easily. First, download the latest version of python-calais. To install it, you just need calais.py in your working directory. You will probably also need to install the simplejson Python module. Download it, then run “python setup.py install.” You may need to execute this as super-user.

3. Call OpenCalais from Python. Make sure you can successfully submit text and get the results back, following these steps. The output you want to look at is in the entities array, which would be accessed as “results.entities” using the variable names in the sample code. In particular you want the list of occurrences for each entity, in the “instances” field.

>>> result.entities[0]['instances']
[{u'suffix': u' is the new President of the United States', u'prefix': u'of the United States of America until 2009.  ', u'detection': u'[of the United States of America until 2009.  ]Barack Obama[ is the new President of the United States]', u'length': 12, u'offset': 75, u'exact': u'Barack Obama'}]
>>> result.entities[0]['instances'][0]['offset']
75
>>>

Each instance has “offset” and “length” fields that indicate where in the input text the entity was referenced. You can use these to determine where to place links in the output HTML.

4. Read a text file, create hyperlinks, and write it out. Your Python program should read text from stdin and write HTML with links on all detected entities to stdout. There are two cases to handle, depending on how much information OpenCalais gives back.

In many cases, like the example in step 3, OpenCalais will not be able to give you any information other than the string corresponding to the entity, result.entities[x][‘name’]. In this case you should construct a Wikipedia link by simply appending to the name to a Wikipedia URL, converting spaces to underscores, e.g.

http://en.wikipedia.org/wiki/Barack_Obama

In other cases, especially companies and places, OpenCalias will supply a link to an RDF document that contains more information about the entity. For example.

>>> result.entities[0]{u'_typeReference': u'http://s.opencalais.com/1/type/em/e/Company', u'_type': u'Company', u'name': u'Starbucks', '__reference': u'http://d.opencalais.com/comphash-1/6b2d9108-7924-3b86-bdba-7410d77d7a79', u'instances': [{u'suffix': u' in Paris.', u'prefix': u'of the United States now and likes to drink at ', u'detection': u'[of the United States now and likes to drink at ]Starbucks[ in Paris.]', u'length': 9, u'offset': 156, u'exact': u'Starbucks'}], u'relevance': 0.314, u'nationality': u'N/A', u'resolutions': [{u'name': u'Starbucks Corporation', u'symbol': u'SBUX.OQ', u'score': 1, u'shortname': u'Starbucks', u'ticker': u'SBUX', u'id': u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'}]}
>>> result.entities[0]['resolutions'][0]['id']
u'http://d.opencalais.com/er/company/ralg-tr1r/f8512d2d-f016-3ad0-8084-a405e59139b3'
>>>

In this case the resolutions array will contain a hyperlink for each resolved entity, and this is where your link should go. The linked page will contain a series of triples (assertions) about the entity, which you can obtain in machine-readable from by changing the .html at the end of the link to .json. The sameAs: links are particularly important because they tell you that this entity is equivalent to others in dbPedia and elsewhere.

Here is more on OpenCalias’ entity disambiguation and use of linked data.

The final result should look something like below. Note that some links go to OpenCalais entity pages with RDF links on them (“London”), some go to Wikipedia (“politician”) and some are broken links when Wikipedia doesn’t have the topic (“Aarthi Ramachandran”) And of course Mr Gandhi is an entity that was not detected, three times.

The latest effort to “decode” Mr Gandhi comes in the form of a limited yet rather well written biography by a political journalist, Aarthi Ramachandran. Her task is a thankless one. Mr Gandhi is an applicant for a big job: ultimately, to lead India. But whereas any other job applicant will at least offer minimal information about his qualifications, work experience, reasons for wanting a post, Mr Gandhi is so secretive and defensive that he won’t respond to the most basic queries about his studies abroad, his time working for a management consultancy in London, or what he hopes to do as a politician.

Don’t worry about producing a fully valid HTML document with headers and a <body> tag, just wrap each entity with <a href=”…”> and </a>. Your browser will load it fine.

5. Pick five random news stories and enrich them. First pick a news site with many stories on the home page. Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate file, and save as plain text (no HTML, no formatting.)

6. Read the enriched documents and count to see how well OpenCalais did. You need to read each output document very carefully and count three things:

Entity references. Count each time there is a name of a person, place, or organization appears, or other references to these things (e.g. “the president.”)
Detected references. How many of these references did OpenCalais find?
Correct references. How many of the links go to the right page? Did our hyperlinking strategy (OpenCalais RDF pages where possible, Wikipedia when not) fail to correctly disambiguate any of the references, or, even worse, disambiguate any to the wrong object? Also, a broken link counts as an incorrect reference.

7. Turn in your work. Please turn in:

Your code
The enriched output from your documents
A brief report describing your results.

The report should include a table of the three numbers — references, detected, correct — for each document, plus the totals of these three numbers across all documents. Also report on any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors?

This assignment is due before class on Wednesday, October 30.

Week 5: Hybrid Filtering

Posted on October 4, 2013 by jstray

In previous weeks we discussed filters that are purely algorithmic (such as NewsBlaster) and filters that are purely social (such as Twitter.) This week we discussed how to create a filtering system that uses both social interactions and algorithmic components.

Slides.

Here are all the sources of information such an algorithm can draw on.

We looked at two concrete examples of hybrid filtering. First, the Reddit comment ranking algorithm, which takes the users’ upvotes and downvotes and sorts not just by the proporition of upvotes, but by how certain we are about proportion, given the number of people who have actually voted so far. Then we looked at item-based collaborative filtering, which is one of several classic techniques based on a matrix of users-item ratings. Such algorithms power everything from Amazon’s “users who bought A also bought B” to Netflix movie recommendations to Google News’ personalization system.

Evaluating the performance of such systems is a major challenge. We need some metric, but not all problems have an obvious way to measure whether we’re doing well. There are many options. Business goals — revenue, time on site, engagement — are generally much easier to measure than editorial goals.

The readings for this week were:

Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
How Reddit Ranking Algorithms Work, Amir Salihefendic
Google News Personalization: Scalable Online Collaborative Filtering, Das et al
Slashdot Moderation, Rob Malda
Pay attention to what Nick Denton is doing with comments, Clay Shirky
How does Google use human raters in web search?, Matt Cutts

This concludes our work on filtering systems — except for Assignment 2.

Assignment 2: Filter Design

Posted on October 2, 2013 by jstray

For this assignment you will design a hybrid filtering algorithm. You will not implement it, but you will explain your design criteria and provide a filtering algorithm in sufficient technical detail to convince me that it might actually work — including psuedocode.

1. Decide who your users are. Journalists? Professionals? General consumers? Someone else?

2. Decide what you will filter. You can choose:

Facebook status updates, like the Facebook news feed
Tweets, like Trending Topics or the many Tweet discovery tools
The whole web, like Prismatic
something else, but ask me first

3. List all available information that you have available as input to your algorithm. If you want to filter Facebook or Twitter, you may pretend that you are the company running the service, and have access to all posts and user data — from every user. You also also assume you have a web crawler or a firehose of every RSS feed or whatever you like, but you must be specific and realistic about what data you are operating with.

4. Argue for the design factors that you would like to influence the filtering, in terms of what is desirable to the user, what is desirable to the publisher (e.g. Facebook or Prismatic), and what is desirable socially. Explain as concretely as possible how each of these (probably conflicting) goals might be achieved through in software. Since this is a hybrid filter, you can also design social software that asks the user for certain types of information (e.g. likes, votes, ratings) or encourages users to act in certain ways (e.g. following) that generate data for you.

5. Write psuedo-code for a function that produces a “top stories” list. This function will be called whenever the user loads your page or opens your app, so it must be fast and frequently updated. You can assume that there are background processes operating on your servers if you like. Your psuedo-code does not have to be executable, but it must be specific and unambiguous, such that a good programmer could actually go and implement it. You can assume that you have libraries for classic text analysis and machine learning algorithms. So, you don’t have to spell out algorithms like TF-IDF or item-based collaborative filtering, or anything else you can dig up in the research literature, but simply say how you’re going to use such building blocks. If you use an algorithm we haven’t discussed in class, be sure to provide a reference to it.

6. Write up steps 1-5. The result should be no more than three pages. However, you must be specific and plausible. You must be clear about what you are trying to accomplish, what your algorithm is, and why you believe your algorithm meets your design goals (though of course it’s impossible to know for sure without testing; but I want something that looks good enough to be worth trying.)

Due before class, October 9

Week 4: Social Filtering

Posted on October 1, 2013 by jstray

This week we looked how groups of people can act as information filters.

Slides.

First we studied Diakopolous’ SRSR (“seriously rapid source review”) system for finding sources on Twitter. There were a few clever bits of machine learning in there, for classifying source types (journalist/blogger, organization, or ordinary individual) and for identifying eyewitnesses. But mostly the system is useful because it presents many different “cues” to the journalist to help them determine whether a source is interesting and/or trustworthy. Useful, but when we look at how this fits into the broader process of social media sourcing — in particular how it fits into the Associated Press’ verification process — it’s clear that current software only adresses part of this complex process. This isn’t a machine learning problem, it’s a user interface and workflow design issue. (For more on social media verification practices, see for example the BBC’s “UGC hub“)

More broadly, journalism now involves users informing each other, and institutions or other authorities communicating directly. The model of journalism we looked at last week, which put reporters at the center of the loop, is simply wrong. A more complete picture includes users and institutions as publishers.

That horizontal arrow of institutions producing their own broadcast media is such a new part of the journalism ecosystem, and so disruptive, that the phenomenon has its own name: “sources go direct,” which seems to have been originally coined by blogging pioneer Dave Winer.

But this picture does not include filtering. There are thousands — no, millions — of sources we could tune into now, but we only direct attention to a narrow set of them, maybe including some journalists or news publications, but probably mostly other types of source, including some primary sources.

This is social filtering. By choosing who we follow, we determine what information reaches us. Twitter in particular does this very well, and we looked at how the Twitter network topology doesn’t look like a human social network, but is more tuned for news distribution.

There are no algorithms involved here… except of course for the code that lets people publish and share things. But the effect here isn’t primarily algorithmic. Instead, it’s about how people operate in groups. This gets us into the concept of “social software,” which is a new interdisciplinary field with its own dynamics. We used the metaphor of “software as architecture,” suggested by Joel Spolsky, to think about how software influences behavior.

As an example of how environment influences behaviour, we watched this video which shows how to get people to take the stairs.

I argued that there are three forces which we can use to shape behavior in social software: norms, laws, and code. This implies that we have to write the code to be “anthropologically correct,” as Spolsky put it, but it also means that the code alone is not enough. This is something Spolsky observed as StackOverflow has become a network of Q&A sites on everything from statistics to cooking: each site has its own community and its own culture.

Previously we phrased the filter design problem in two ways: as a relevance function, and as a set of design criteria. When we use social filtering, there’s no relevance function deciding what we see. But we still have our design criteria, which tell us what type of filter we would like, and we can try to build systems that help people work together to produce this filtering. And along with this, we can imagine norms — habits, best practices, etiquette — that help this process along, an idea more thoroughly explored by Dan Gilmour in We The Media.

The readings from the syllabus were:

Required

A Group is its own worst enemy, Clay Shirky
What’s the point of social news?, Jonathan Stray
Finding and Assessing Social Information Sources in the Context of Journalism, Nick Diakopolous et al.

Recommended

Learning from Stackoverflow, first fifteen minutes, Joel Spolsky
Norms, Laws, and Code, Jonathan Stray
What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
International reporting in the age of participatory media, Ethan Zuckerman
We The Media. Introduction and Chapter 1, Dan Gillmor,
Are we stuck in filter bubbles? Here are five potential paths out, Jonathan Stray

Week 3: Algorithmic Filtering

Posted on September 18, 2013 by jstray

Slides.

Assignment 1: TF-IDF

Posted on September 11, 2013 by jstray

In this assignment you will implement the TF-IDF formula and use it to study the topics in State of the Union speeches given every year by the U.S. president.

1. Download the source data file state-of-the-union.csv. This is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. You will write a Python program that reads this file and turns it into TF-IDF document vectors, then prints out some information. Here is how to read a CSV in Python.

2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, we’re going to tokenize using a simple scheme:

convert all characters to lowercase
remove all punctuation characters
split the string on spaces

3. Compute a TF (term frequency) vector for each document. This is simply how many times each word appears in that document. You should end up with a Python dictionary from terms (strings) to term counts (numbers) for each document.

4. Count how many documents each word appears in. This can be done after computing how the TF vector by each document, by incrementing the document count of each word that appears in the TF vector. After reading all documents you should now have a dictionary from each term to the number of documents that term appears in.

5. Turn the final document counts into IDF (inverse document frequency) weights by applying the formula IDF(term) = log(total number of documents / number of documents that term appears in.)

6. Now multiply the TF vectors for each document by the IDF weights for each term, to produce TF-IDF vectors for each document. Then normalize each vector, so the sum of squared weights is 1.

7. Congratulations! You have a set of TF-IDF vectors for this corpus. Now it’s time to see what they say. Take the speech you were assigned in class, and print out the highest weighted 20 terms, along with their weights. What do you think this particular speech is about? Write your answer in at most 200 words.

8. Your task now is to see if you can understand how the topics changed since 1900. For each decade since 1900, do the following:

sum all of the TF-IDF vectors for all speeches in that decade
print out the top 20 terms in the summed vector, and their weights

Now take a look at the terms for each decade. What patterns do you see? Can you connect the terms to major historical events? (wars, the great depression, assassinations, the civil rights movement, Watergate…) Write up what you see in narrative form, no more than 500 words, referring to the terms for each decade.

9. Hand in by email, before class next week:

your code
the printout and analysis from step 7
the printout and narrative from step 8.

Week 2: Text Analysis

Posted on September 11, 2013 by jstray

Slides.

Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.

Required

Online Natural Language Processing Course, Stanford University
- Week 7: Information Retrieval, Term-Document Incidence Matrix
- Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
- Week 7: Ranked Information Retrieval, Term Frequency Weighting
- Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
- Week 7: Ranked Information Retrieval, TF-IDF weighting

Recommended

A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
Probabilistic Topic Models, David M. Blei
General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King

Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang
Message Machine, ProPublica

Assignment: TF-IDF analysis of State of the Union speeches.

Week 1: Basics

Posted on September 11, 2013 by jstray

Slides.

In this first week we ask: where do computer science and journalism intersect? CS techniques can help journalism in four different areas: data-driven reporting, story presentation, information filtering, and effect tracking.

Then we jumped right in with the concept of data. Specifically, we study feature vectors which are a fundamental data representation for many algorithms in data mining, language processing, machine learning, and visualization. This week we will explore two things: representing objects as vectors, and visualizing high dimensional spaces.

We also explored a principal components analysis of voting data from the UK House of Lords. The R file we ran to produce the output is here. A more sophisticated analysis, using custom distance metrics and multi-dimensional scaling, is here.

Readings:

What should the digital public sphere do?, Jonathan Stray
Computational Journalism, Cohen, Turner, Hamilton
Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer

Viewed in class

The Jobless rate for People Like You, New York Times
Dollars for Docs, ProPublica
What did private security contractors do in Iraq and document mining methodology, Jonathan Stray

Syllabus – Fall 2013

Posted on August 20, 2013 by jstray

Aims of the course
The aim of the course is to familiarise students with current areas of research and development within computer science that have a direct relevance to the field of journalism. We are interested in producing both stories and software: we will study advanced techniques which can be used for individual acts of journalism, but we will also be studying the design of the software systems which inform us all.

Our scope is wide enough to include both relatively traditional journalistic work, such as computer-assisted investigative reporting, and the broader information systems that we all use every day to inform ourselves, such as search engines. The course will provide students with a thorough understanding of how particular fields of computational research relate to products being developed for journalism, and provoke ideas for their own research and projects.

Research-level computer science material will be discussed in class, but the emphasis will be on understanding the capabilities and limitations of this technology. Students with a CS background will have opportunity for algorithmic exploration and innovation, however the primary goal of the course is thoughtful, application-oriented research and design.

Assignments will be completed in groups and involve experimentation with fundamental computational techniques. There will be some coding, but the emphasis will be on thoughtful and critical analysis. As this is a journalism course, you will be expected to write clearly.

Format of the class, grading and assignments.
This is a fourteen week course for Masters’ students which has both a six point and a three point version. The six point version is designed for dual degree candidates in the journalism and computer science concentration, while the three point version is designed for those cross listing from other concentrations and schools.

The class is conducted in a seminar format. Assigned readings and computational techniques will form the basis of class discussion. Throughout the semester we will invite guest speakers with expertise in the relevant areas to talk about their related journalism, research, and product development

The course will be a graded as follows:

Assignments: 60%. There will be a homework assignment after most classes.
Class participation: 10%
Final project (for dual degree students only): 30%. This will be either a research paper, a computationally-driven story, or a software project.

The class is conducted on pass/fail basis for journalism students, in line with the journalism school’s grading system. Students from other departments will receive a letter grade.

Week 1. – Basics
First we ask: where do computer science and journalism intersect? CS techniques can help journalism in four different areas: data-driven reporting, story presentation, information filtering, and effect tracking. Then we jump right in with the concept of data. Specifically, we study feature vectors which are a fundamental data representation for many algorithms in data mining, language processing, machine learning, and visualization. This week we will explore two things: representing objects as vectors, and visualizing high dimensional spaces.

Required

What should the digital public sphere do?, Jonathan Stray
Computational Journalism, Cohen, Turner, Hamilton

Recommended

Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer

Viewed in class

The Jobless rate for People Like You, New York Times
Dollars for Docs, ProPublica
What did private security contractors do in Iraq and document mining methodology, Jonathan Stray
The network of global corporate control, Vitali et. al.
World Government Data, The Guardian UK
NICAR journalism database library

Lecture 2: Text Analysis
Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.

Required

Online Natural Language Processing Course, Stanford University
- Week 7: Information Retrieval, Term-Document Incidence Matrix
- Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
- Week 7: Ranked Information Retrieval, Term Frequency Weighting
- Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
- Week 7: Ranked Information Retrieval, TF-IDF weighting

Recommended

A full-text visualization of the Iraq war logs, Jonathan Stray
Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
Probabilistic Topic Models, David M. Blei
General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King

Examples

Watchwords: Reading China Through its Party Vocabulary, Qian Gang
Message Machine, ProPublica

Assignment: TF-IDF analysis of State of the Union speeches.

Week 3: Information overload and algorithmic filtering
This week we begin our study of filtering with some basic ideas about its role in journalism. Then we shift gears to pure algorithmic approaches to filtering, with a look at how the Newsblaster system works (similar to Google News.)

Required

Who should see what when? Three design principles for personalized news Jonathan Stray
Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al

Recommended

Guess what? Automated news doesn’t quite work, Gabe Rivera
The Hermeneutics of Screwing Around, or What You Do With a Million Books, Stephen Ramsay
Can an algorithm be wrong?, Tarleton Gillespie
The Netflix Prize, Wikipedia

Week 4: Social software and social filtering
We have now studied purely algorithmic modes of filtering, and this week we will bring in the social. First we’ll look at the entire concept of “social software,” which is a new interdisciplinary field with its own dynamics. We’ll use the metaphor of “architecture,” suggested by Joel Spolsky, to think about how software influences behaviour. Then we’ll study social media and its role in journalism, including its role in information distribution and collection, and emerging techniques to help find sources.

Required

A Group is its own worst enemy, Clay Shirky
What’s the point of social news?, Jonathan Stray
Finding and Assessing Social Information Sources in the Context of Journalism, Nick Diakopolous et al.

Recommended

Learning from Stackoverflow, first fifteen minutes, Joel Spolsky
Norms, Laws, and Code, Jonathan Stray
What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
International reporting in the age of participatory media, Ethan Zuckerman
We The Media. Introduction and Chapter 1, Dan Gillmor,
Are we stuck in filter bubbles? Here are five potential paths out, Jonathan Stray

Week 5: Hybrid filters, recommendation, and conversation
We have now studied purely algorithmic and mostly social modes of filtering. This week we’re going to study systems that combine software and people. We’ll a look “recommendation” systems and the socially-driven algorithms behind them. Then we’ll turn to online discussions, and hybrid techniques for ensuring a “good conversation” — a social outcome with no single definition. We’ll finish by looking at an example of using human preferences to drive machine learning algorithms: Google Web search.

Required

Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
How Reddit Ranking Algorithms Work, Amir Salihefendic

Recommended

Google News Personalization: Scalable Online Collaborative Filtering, Das et al
Slashdot Moderation, Rob Malda
Pay attention to what Nick Denton is doing with comments, Clay Shirky
How does Google use human raters in web search?, Matt Cutts

Assignment – Design a filtering algorithm for status updates.

Week 6: Visualization
An introduction into how visualisation helps people interpret information. The difference between infographics and visualization, and between exploration and presentation. Design principles from user experience considerations, graphic design, and the study of the human visual system. Also, what is specific about visualization in journalism, as opposed to the many other fields that use it?

Required

Designing Data Visualizations, Noah Illinsky and Julie Steele, OReilly
Computational Information Design chapters 1 and 2, Ben Fry

Recommended

Journalism in an age of data, Geoff McGhee
Visualization Rhetoric: Framing Effects in Narrative Visualization, Hullman and Diakopolous
Visualization, Tamara Munzner

Week 7: Structured journalism and knowledge representation
Is journalism in the text/video/audio business, or is it in the knowledge business? This class we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at th Reverb algorithm.

Required

A fundamental way newspaper websites need to change, Adrian Holovaty
The next web of open, linked data – Tim Berners-Lee TED talk
Identifying Relations for Open Information Extraction, Fader, Soderland, and Etzioni (Reverb algorithm)

Recommended

Standards-based journalism in a semantic economy, Xark
What the semantic web can represent – Tim Berners-Lee
Building Watson: an overview of the DeepQA project
Can an algorithm write a better story than a reporter? Wired/ 2012.

Assignment: Text enrichment experiments using OpenCalais entity extraction.

Week 8: Network analysis
Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.

Required

Analyzing the Data Behind Skin and Bone, ICIJ
Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes.
Simmelian Backbones: Amplifying Hidden Homophily in Facebook Networks. A soophisticated and sociologically-aware network analysis method.

Recommended

Visualizing Communities, Jonathan Stray
The network of global corporate control, Vitali et. al.
The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
Sections I and II of Community Detection in Graphs, Fortunato
Centrality and Network Flow, Borgatti
Exploring Enron, Jeffrey Heer

Examples:

Galleon’s Web, Wall Street Journal
Muckety
Theyrule.net,
Who Runs Hong Kong?, South China Morning Post

Assignment: Compare different centrality metrics in Gephi.

Week 9: Drawing conclusions from data
You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.

Required

Correlation and causation, Business Insider
The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer

Recommended

If correlation doesn’t imply causation, then what does?, Michael Nielsen
Graphical Inference for Infovis, Hadley Wickham et al.
Why most published research findings are false, John P. A. Ioannidis

Assignment: analyze gun ownership vs. gun violence data.

Week 10: Security, Surveillance, and Censorship
Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.

Required

Chris Soghoian, Why secrets aren’t safe with journalists, New York times 2011
Hearst New Media Lecture 2012, Rebecca MacKinnon

Recommended

CPJ journalist security guide section 3, Information Security
Global Internet Filtering Map, Open Net Initiative
The NSA is building the country’s biggest spy center, James Banford, Wired

Cryptographic security

Unplugged: The Show part 9: Public Key Cryptography
Diffe-Hellman key exchange, ArtOfTheProblem

Anonymity

Tor Project Overview
Who is harmed by a real-names policy, Geek Feminism

Assignment: Use threat modeling to come up with a security plan for a given scenario.

Week 11: Tracking flow and impact
How does information flow in the online ecosystem? What happens to a story after it’s published? How do items spread through social networks? We’re just beginning to be able to track ideas as they move through the network, by combining techniques from social network analysis and bioinformatics.

Required

Metrics, Metrics everywhere: Can we measure the impact of journalism?, Jonathan Stray
Meme-tracking and the Dynamics of the News Cycle, Leskovec et al.
The role of social networks in information diffusion, Eytan Bakshy et al.

Recommended

Defining Moments in Risk Communication Research: 1996–2005, Katherine McComas
Chain Letters and Evolutionary Histories, Charles H. Bennett, Ming Li and Bin Ma
Competition among memes in a world with limited attention, Weng et al.
Zach Seward, In the news cycle, memes spread more like a heartbeat than a virus
How hidden networks orchestrated Stop Kony 2012, Gilad Lotan

Week 12 – Project review
We will spend this week discussing your final projects and figuring out the best approaches to your data and/or topic.

Assignment 4

Posted on November 27, 2012 by jstray

For this assignment, each group will take one of the following four scenarios and design a security plan. More specifically, you will flesh out the scenario, create a threat model, come up with a plausible security plan, and analyze the weaknesses of your plan.

1. You are a photojournalist in Syria with digital images you wants to get out of the country. Limited internet access is available at a cafe. Some of the images may identify people working with the rebels who could be targeted by the government if their identity is revealed. In addition you would like to remain anonymous until the photographs are published, so that you can continue to work inside the country for a little longer, and leave without difficulty.

2. You are working on an investigative story about the CIA conducting operations in the U.S., in possible violation the law. You have sources inside the CIA who would like to remain anonymous. You will occasionally meet with these sources in but mostly communicate electronically. You would like to keep the story secret until it is published, to avoid pre-emptive legal challenges to publication.

3. You are reporting on insider trading at a large bank, and talking secretly to two whistleblowers. If these sources are identified before the story comes out, at the very least you will lose your sources, but there might also be more serious repercussions — they could lose their jobs, or the bank could attempt to sue. This story involves a large volume of proprietary data and documents which must be analyzed.

4. You are working in Europe, assisting a Chinese human rights activist. The activist is working inside China with other activists, but so far the Chinese government does not know they are an activist and they would like to keep it this way. You have met the activist once before, in person, and have a phone number for them, but need to set up a secure communications channel.

These scenario descriptions are incomplete. Please feel free to expand them, making any reasonable assumptions about the environment or the story — though you must document your assumptions, and you can’t assume that you have unrealistic resources or that your adversary is incompetent.

Start by creating a threat model, which must consider:

What must be kept private? Specify all of the information that must be secret, including notes, documents, files, locations, and identities — and possibly even the fact that someone is working on a story.
Who is the adversary and what do they want to know? It may be a single person, or an entire organization or state, or multiple entities. They may be very interested in certain types of information, e.g. identities, and uninterested in others. List each adversary and their interests.
What can they do to find out? List every way they could try to find out what you want secret, including technical, legal, and social methods.
What is the risk? Explain what happens if an adversary succeeds in breaking your security. What are the consequences, and to whom? Which of these is it absolutely necessary to avoid?

Once you have specified your your threat model, you are ready to design your security plan. The threat model describes the risk, and the goal of the security plan is to reduce that risk as much as possible.

Your plan must specify appropriate software tools, plus how these tools must be used. Pay particular attention to necessary habits: specify who must do what, and in what way, to keep the system secure. Explain how you will educate your sources and collaborators in the proper use of your chosen tools, and how hard you think it will be to make sure everyone does exactly the right thing.

Also document the weaknesses of your plan. What can still go wrong? What are the critical assumptions that will cause failure if it turns out you have guessed wrong? What is going to be difficult or expensive about this plan?

Include in your writeup (5 pages max):

A more detailed scenario, including all the assumptions you have made to flesh out the situation.
A threat model answering the four questions above.
A security plan including tools, procedures, necessary habits.
A training plan, explaining how you are going to teach everyone involved to execute the security plan.
An analysis of the vulnerability of your plan. What can still go wrong?

Due last class, Dec 10.