For this assignment you will analyze global data on the number of homicides versus the number of guns in each country. I’m giving you the data — your job is to tell me what it means. You will interpret a few different plots, and then implement the visual randomization procedure from the paper we discussed in class to examine a tricky case more closely.
The data is from The Guardian Data Blog. I simplified the header names, dropped a few unnecessary columns, and added an OECD column.
1. I’ve written most of the code you will need for this assignment, available from this github repo. (You can git clone if you like, otherwise just click here to download all files as a zip archive).
2. We are going to use the R language for this assignment. This is mostly because it has really nice built in charts (doing this in Python is a real pain), but also because you are likely to encounter R out in the real world of data journalism. Download and install it. To start R, enter R on the command line. To run a program, enter source(‘filename.R’) at the R command prompt A full language manual is here. You will only need to use a few basic concepts, such as random number generation and for loops.
3. Plot the data for all countries’ homicide rate (per 100,000) versus number of privately-owned firearms (per 100) by running source(‘plot-all-countries.R’) at the R prompt. What do you see? Please report on the general patterns here, the outliers, and what this all might mean.
4. Now take a look at only the OECD countries, by uncommenting the indicated line in the source. Re-run the file. What does the chart show now?
5. Now plot only the non-OECD countries, by uncommenting the indicated line in the source (be sure to re-comment the line that selects only OECD countries). What does the chart show now?
6. It looks like there might be a pattern among the OECD countries, but the United States is such an outlier that it’s hard to tell. Is this pattern still significant without the US? To find out, you’re going to apply a randomization test. (We’ll also remove Mexico since it’s not a developed country and thus not really comparable to the other OECD countries.)
Start with the file randomization-test.R. You need to write the code that performs the actual randomization, filling the eight of the columns of charts with random permutations of the original y values (homicide rates), but putting the original data in the realchart column. To prevent sneak peaks, the code is currently set up to use testing data. When your permutations are working right, you should see something like this when you run the file:
After pressing Enter, the program will tell you which chart has the real (un-permuted) data. Here, with fake data, it’s obvious. It won’t always be.
7. Now that your program works, try it on the real data by commenting out the two lines that generate the fake data. Re-run, and look at the plots carefully. Which one do you think is the real data? Write down the number of the chart. Then hit enter, and see if you got it right.
8. This isn’t quite fair, because you were already looking at the data in step 4. So get someone else to look at it fresh. Explain to them that you are charting firearms versus homicides and that one of the charts is real but the rest are fakes, and ask them to spot the real chart.
9. Did you guess right? Did your fresh observer guess right? Did you and your observer guess differently? If so, why do you think that is? Was it difficult for you to choose? Based on all of this, do you think there is a correlation between gun ownership and homicide rate for the OECD countries? If so, how strong is it (effect size) and how strong is the evidence (statistical significance)?