Top Banner
Data: Collect, Clean and Manipulate ONA 2012 San Francisco Jennifer LaFleur, ProPublica j [email protected] @j_la28
102
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. Data: Collect, Clean and Manipulate ONA 2012San Francisco Jennifer LaFleur, [email protected] @j_la28

2. Why data? It takes you beyond the anecdote Its easier than counting sheets of paper 3. Why data? Contrasts are in the data 4. Caution: This slide contains extreme nerdiness 5. Why Computer-Assisted Reporting? Contrasts are in the data Your most powerful figures are in the data 6. Source: CaliforniaHealth Dept.data, Medicare billingdataFindings: Somehospitals hadalarming rates of aThird World nutritionaldisorder among itsMedicare patients. 7. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise 8. Data: Youthprison workers,criminalconvictions andgrievance dataFindings:Employees withcriminalbackgroundswere more likelyto be accused ofabusing inmates. 9. Data: Federalbridgeinspections andstimulus funding.Findings: Someof the nationsworst bridges didnot get stimulusfunds. 10. Why data? Contrasts are in the data Your most powerful figures are in the data You can make connections you might not be able to make otherwise You can test assumptions 11. Source: NHTSAcomplaint dataFindings:unintendedacceleration hasbeen a problemacross the autoindustry. 12. HT/FlorenciaCoelho 13. Collecting the data 14. Wheres the data?If something is inspectedLicensedEnforced orPurchasedThere probably is a database 15. Wheres the data?If there is a reportOr a formThere probably is a database 16. Wheres the data?Sometimes data is readily availableonline for download 17. Source: CensusFindings: Fueled bythe dismal economyand highunemployment, more Americansaredoubling up 18. Source: Medicaid nursing homesurvey data and financedata, housing dataFindings: a shortage of placesfor the disabled to live outside anursing home and regulationsthat critics say make it hard toqualify for home services meanmany who want out continue toreceive expensive nursing care. 19. Wheres the data?Sometimes you have to scrape it.That usually involves programsthat automate searching tasks onWeb sites. 20. Wheres the data?More often you need to go to anagency to get the dataThis can be tricky if an agencydoesnt want to release it. (Staytuned for more on that) 21. Source: School districtcredit card purchasesFindings: District cardholders madequestionablepurchases with theircards. 22. Sometimes, there is no data.But its okay because there aretechniques for sampling and buildinga database. 23. ProPublica pulled a randomsample of 500 names from alist of individuals who hadbeen granted or deniedpardons (around 2,000). Wecreated a database frommonths or researchingindividuals: their crime, age,sentenceWe found that even aftercontrolling for other factors,whites were more likely to geta pardon. 24. Source: Loan details,foreclosure information andbankruptcy filingsFindings: Loans leading toforeclosure didnt alwaysfollow conventional wisdom 25. When you have to ask for the data Before filing a request: Ask for it If they require a formal request, find out who it should go to and what you should ask for Letter should describe what youre asking for Note that youre willing tonegotiate Ask for a cost estimate 26. Dear Records Administrator:Im writing to request under the Texas Public Information Act an electronic copy of the current health-related services registry database for the state of Texas. I also am requesting electronic copies or adatabase of all complaints filed against health-related service registry members since Jan. 1, 2000.I frequently deal with large raw databases, so I would be able to accept information in several formatsincluding ASCII, dbf, xls, etc and can accept the data on a variety of media (computer tape, CD-ROM, FTP, email attachment, etc...). Please include record layouts, code sheets or any otherdocumentation necessary to interpret the data.I am requesting all data fields. If there are any fields that you must withhold by law, please let meknow what those fields are, so I can amend my request.In the interest of expediency, and to minimize the research and/or duplication burden on your staff, Iwould be happy to speak with your database administrator to figure out a method that is easiest foryou.If you have questions or need more information, please contact me by telephone or email. Mytelephone number is: 214-977-8509. My email address is [email protected] you will be charging processing fees, please send me an itemized estimate explaining how thecosts were calculated. 27. Getting electronic information Know the law. Know how your state treats (or doesnt) the records you need. Know what information you want. Do your homework Know what the appropriate cost should be. Know who does the data entry. Get to know Leon When something may not clearly be public use your sourcing 28. Just another way of saying noHuge costsDelay tacticsOh you silly little journalistSending you the wrong thingYour request was unclearHIPAAPrivacyPrivatization 29. Negotiating: Some examples 30. Our database is on amainframe and itsvery complicated,Missy 31. We dont havethe authorityto do that 32. That will cost$25,000. 33. We have processed your request. Thelabor cost for the request is asfollows.Item # of hoursRESEARCH 20CREATING FILES6CODING 24TESTING 4Total (54 X$72) =$3,888.00 34. From Texas Public Information Act:111.67. Estimates and Waivers of Public Information Charges (a) A governmental body is required to provide a requestorwith an itemized statement of estimated charges if charges forcopies of public information will exceed $40, or if a charge inaccordance with 111.65 of this title (relating to Access toInformation Where Copies Are Not Requested) will exceed$40 for making public information available for inspection. Agovernmental body that fails to provide the requiredstatement may not collect more than $40. The itemizedstatement must be provided free of charge and must contain thefollowing information: 35. We only keeptheinformationfor 7 days 36. Check retention schedules 37. That usesproprietarysoftware. 38. We dont keepthat oncomputer 39. Okay, we do,but its a lotof files 40. Thatinformation isprotected bylaw 41. Cleaning data 42. Remember that data are not perfect 43. It doesnt mean you cant use itDo integrity checks to find the flawsAdd caveats where necessaryDo your own analysis rather than relying on anagencys analysis of bad data 44. Integrity checks for every data setRead the documentation. Understand thecontents of every field.Know how many records you should have.Check counts and totals against reports.Are all possibilities included? All states, allcounties, correct ranges? 45. Integrity checks for every data set Internal data checks: Is there more money going to sub-contractors than went to the prime contractor? Are there more teachers than students? Do people have birth dates in the future or so long ago they would be long gone? 46. If your data is inExcel, use the filterfunction to see whatthe values are inindividual fields. 47. Integrity checks for every data setCheck for missing data, misplaced data or blankfieldsUse a standard naming convention for files andtables (I wouldnt recommend final)Check for duplicatesTake margins of error into account if necessary(important if youre using Census data). 48. 2010 Census ACS: Median HH Income by Metro Area 49. Be creative when you look for duplicates 50. Beyond the basics Keep a notes file Dont work off your original database Know the source Check against summary reports Use the right tool Check for outliers when it comes to ups and downs 51. Truck accidents by year and agency 52. Beyond the basics Check with experts Are there standards? (ex: a drop by more than 10 perc pts is a red flag) Find out what others have done Gut check Go physically see a record or spot check against documents 53. Voter FraudDozens of St. Louis voters are being wrongly accusedof casting ballots from fraudulent addresses in lastyears Nov. 7 election.They are among thousands of registered voters who,based on city property records, appear to live onvacant lots. 54. Texas test score data official results versus district Duncanville district reported 4th grade writingOfficial report for Duncanville4th grade writingCourtesy Holly Hacker, The Dallas Morning News 55. Three rounds of analysisafter bouncing off subjectsand expertsDemographically basedVoir direSocioeconomics 56. Checks when youre matching dataA name is not enough. Lots of people have the same name Get dates of birth and other information to make sure you have the correct person. 57. Source: Illinois health data, police dataFindings: Dangerous systemic failed to protect elderly patients inIllinois nursing homes that also house mentally ill younger residents,including murderers, sex offenders, and armed robbers. 58. Even people with seemingly unique names arent so unique 59. Evaluating outside studies Get the questionnaire and methodology Beware of nonscientific methods: Web surveys, man on the street Know the sample size..sampling error Account for margin of error and non-response when drawing conclusions Run statistical tests on the data if possible 60. Reporting data Consider reporting rates not raw numbers Avoid false precision: 53.14 percent said in a poll with a 5 percentage point margin of error Avoid number overload. About half is usually just as useful as 51 percent in most cases Adjust money for inflation When analyzing income, use median rather than average (Bill Gates factor) 61. When the data is the problem you might stillhave a storyErroneous government databases can oftenbe a story themselves 62. Manipulating data for stories and apps 63. Know which tool to use Reporting individual records Counting/summing Mapping Statistics 64. Source: Medicaidoutcomes data fordialysis facilitiesFindings: A CMSonline tool did nottell the whole storyabout facilities. Insome counties thegap inmeasures, such assurvival rate werevast. 65. Source: Washington Health Department dataFindings: MRSA has been quietly killing in hospitals for decades. But noone had tracked it until this story. 66. Source: Dept. of Ed data and surveys of campus crisis clinicsFindings: Many campuses had lax enforcement and reporting loop holesmean problems go unchecked. 67. Source: EPA and state data on hazardous chemical locationsFindings: Dallas County has 900+ sites that store hazardous chemicals 68. Source: Daminspection datafrom Texas andfederal governmentFindings: Damrecords had notbeen updated toaccount forpopulation growth 69. Source: 311 calls for downed treesFindings: After a tornado swept across New York City, 311calls for downed trees helps trace its path 70. Source: City BudgetFindings: Some neighborhoods suffermore than others as mayor cuts budgets 71. Disparities in waterusageWater use highest inpoor areas of the cityMapping and statisticalanalysis 72. Presenting the dataInclude a methodology explaining what you did andwhat you dont know.For really complicated analyses consider a supernerdy white paper explaining all of your findingsIf you make data downloadable include fielddescriptions and anything users should watch for 73. For more informationwww.ire.orgwww.propublica.orgjennifer.lafleur@propublica.org