In this assignment you will implement the TF-IDF formula and use it to study the topics in State of the Union speeches given every year by the U.S. president.
1. Download the source data file state-of-the-union.csv. This is a standard CSV file with one speech per row. There are two columns: the year of the speech, and the text of the speech. You will write a Python program that reads this file and turns it into TF-IDF document vectors, then prints out some information. Here is how to read a CSV in Python.
2. Tokenize the text each speech, to turn it into a list of words. As we discussed in class, we’re going to tokenize using a simple scheme:
- convert all characters to lowercase
- remove all punctuation characters
- split the string on spaces
3. Compute a TF (term frequency) vector for each document. This is simply how many times each word appears in that document. You should end up with a Python dictionary from terms (strings) to term counts (numbers) for each document.
4. Count how many documents each word appears in. This can be done after computing how the TF vector by each document, by incrementing the document count of each word that appears in the TF vector. After reading all documents you should now have a dictionary from each term to the number of documents that term appears in.
5. Turn the final document counts into IDF (inverse document frequency) weights by applying the formula IDF(term) = log(total number of documents / number of documents that term appears in.)
6. Now multiply the TF vectors for each document by the IDF weights for each term, to produce TF-IDF vectors for each document. Then normalize each vector, so the sum of squared weights is 1.
7. Congratulations! You have a set of TF-IDF vectors for this corpus. Now it’s time to see what they say. Take the speech you were assigned in class, and print out the highest weighted 20 terms, along with their weights. What do you think this particular speech is about? Write your answer in at most 200 words.
8. Your task now is to see if you can understand how the topics changed since 1900. For each decade since 1900, do the following:
- sum all of the TF-IDF vectors for all speeches in that decade
- print out the top 20 terms in the summed vector, and their weights
9. Hand in by email, before class next week:
- your code
- the printout and analysis from step 7
- the printout and narrative from step 8.