For this assignment you will evaluate the performance of OpenCalais, a commercial entity extraction service, against your hand annotations.
1. Pick five random news stories and hand-annotate them. Pick an English-language news site with many stories on the home page, or a section of such a site (business, sports, etc.) Then generate five random numbers from 1 to the number of stories on the page. Cut and paste the text of each article into a separate file, and save as plain text (no HTML, no formatting.)
2. Detect entities by hand in each article. Paste the text of each article into an RTF or Word document and go through it, underlining every entity. Count every mention of a person, place, or organization, including alternate names (“the president”) and pronoun references. Count how many entity references appear in each document (multiple mentions of the same entity all count.)
3. Now run each document through OpenCalais. You can paste the text into the demo page here. Now compare the results to your hand-annotations to produce a confusion matrix:
- True Positives: entities marked by you and found by OC
- False Negatives: entities marked by you and not found by OC
- False Positives: entities found by OC but not marked by you
4. Turn in:
- Your hand-marked documents.
- A spreadsheet. Please turn in a spread sheet with one row per document, and four columns: total entities marked by you, true positives, false negatives, false positives.
- Make sure your spreadsheet includes totals of these four columns across all documents
- Your analysis. Report on any patterns in the failures that you see. Where is OpenCalais most accurate? Where is it least accurate? Are there predictable patterns to the errors? Are there ambiguities as to what is really an entity?