Assignment 4: Entity Extraction

For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service, against your hand annotations.


1. Pick five random news stories and hand-annotate them. Pick an English-language news site with many stories on the home page, or a section of such a site (business, sports, etc.) Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate  file, and save as plain text (no HTML, no formatting.)

2. Detect entities by hand in each article. Paste the text of each article into an RTF or Word document and go through it, underlining every entity. Count every mention of a person, place, or organization, including alternate names (“the president”) and pronoun references. Count how many entities appear in each document.

3. Now run each document through OpenCalais. You can paste the text into the demo page here. Now count:

  • Correct entities: entities found by OC and marked by you
  • Incorrect entities: entities found by OC but not marked by you
  • Missed entities: entities marked by you but not found by OC

4. Turn in:

  • Your hand-marked documents.
  • A spreadsheet. Please turn in a spread sheet with one row per document, and four columns: hand-labelled, correct, incorrect, missed
  • Make sure your spreadsheet includes totals of these four columns across all documents
  • Your analysis. Report on any patterns in the failures that your see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors? Are there ambiguities as to what is really an entity?
This assignment is due before class on Friday,  November 18.

