Top Banner
“OK, but where did that data come from?” Data validation in the Digital Age Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m 1 Cheryl Phillips Data Enterprise Editor Seattle Times Seattle, Washington USA c p h i l l i p s @ s e a t t l e t I m e s . c o m
25

Data validation in the Digital Age

Aug 23, 2014

Download

News & Politics

The data sets you are about to analyze are only as good and valid as the methodology used to gather the data and create the data set. The presentation by Tom Johnson and Cheryl Phillips was made at the 2012 meeting of the National Institute for Computer-Assisted Reporting, Feb. 2012, in St. Louis.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data validation in the Digital Age

“OK, but where did that data come from?”

Data validation in the Digital Age

Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m

1

Cheryl PhillipsData Enterprise Editor

Seattle TimesSeattle, Washington USA

c p h i l l i p s @ s e a t t l e t I m e s . c o m

Page 2: Data validation in the Digital Age

Data validation in the Digital Age

Presentation by Cheryl Phillips and Tom Johnson atNational Institute of Computer-Assisted Reporting ConferenceDate/Time: Friday, Feb. 24 at 11 a.m. Location: Frisco/Burlington RoomSt. Louis, Missouri USA

This PowerPoint deck and Tipsheets posted at:

http:// s d r v . m s / w N t i M 7

2

Page 3: Data validation in the Digital Age

The methodology / = the value of the data set and your story

Important point

3

1A data base (or report) is only as good as the methodology used to create it.

Page 4: Data validation in the Digital Age

Data sets are living things; they have pedigree and genealogy

Important points

4

2•Most [all?] data sets are living things. •And they have a pedigree, a genealogy. •Data sets live in a dynamic environment. •Understand the DB ecology

Page 5: Data validation in the Digital Age

How bad data can do you wrongIllinois and Missouri sex-offender DB•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; State's database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger•See stories here

Page 6: Data validation in the Digital Age

How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionable voters” data set – “The Big Bundle”•~1.1m voters•Previous SoS didn’t clean rolls•Matched name, address, DoB and SS#

– SSA data base; NM driver’s licenses– 2 variables “mismatch” = Questionable?– Asked State Police (not AG’s office) to investigate

Page 7: Data validation in the Digital Age

Problems with Sec. of State methodology

• What’s the error rate of original DB?• Definition of “error”? (Gonzales or

Gonzalez)• Sample(s) by county and state total?• Error rates of comparative DBs?• Aggregation of error problem

• 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html

Page 8: Data validation in the Digital Age

Source: https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html

Page 9: Data validation in the Digital Age

A most wonderful

story!!!

Data base rich with potential

There be dragons!

9

Page 10: Data validation in the Digital Age

1. Pre-plan•2nd monitor•“Logbook” apps

Building genealogy for target DB

3. Do data fit theoretical models?

4. Do a “critical biography” of the data

5. Does biography raise critical warnings?

7. Acquire latest data and related docs

8. Do tables conform to record layout?

9. Do docs specify expected ranges & frequencies?

10. Are data values missing or out of range?

6. Have others run analysis of this data?

11. Review major checklist

2. Lit. review/ interview peers

Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146. Ver 1.0 Proceedings, IAJ Press (Santa Fe, NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459

Page 11: Data validation in the Digital Age

Building genealogy for target DB

1. Pre-plan•2nd monitor•“Logbook” apps

3. Do data fit theoretical models?

4. Do a “critical biography” of the data

5. Does biography raise critical warnings?

7. Acquire latest data and related docs

8. Do tables conform to record layout?

9. Do docs specify expected ranges & frequencies?

10. Are data values missing or out of range?

6. Have others run analysis of this data?

11. Review major checklist

2. Lit. review/ interview peers

• Changes in definitions?

• By administrators?• Formal or informal?

• By statute?• Changes in collection

methods, data entry, vetting, updating, file type/format?

• Changes in users and usage

• Data cleaning

11. Review major checklist

Page 12: Data validation in the Digital Age

Data Quality checkpoints

• Constancy of definitions and coding categories?• All at same time and location?

• Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types?

• Precision: Are the numbers rounded or?• Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and

geographic data, i.e. What is the range(s) of the time scales?

Page 13: Data validation in the Digital Age

Cheryl on Quant methods for measuring data quality

Page 14: Data validation in the Digital Age

Data Quality checkpoints• Constancy of definitions and coding

categories?• All at same time and location?

• Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types?

• Precision: Are the numbers rounded or?• Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and

geographic data, i.e. What is the range(s) of the time scales?

Page 15: Data validation in the Digital Age

Newsroom methods for measuring data quality

• Test frequencies on key fieldsBicycle accidents in Seattle included a time field. But it was almost always noon when accidents occurred.Caveat: Don’t over-reach with your conclusions or analysis

Page 16: Data validation in the Digital Age

Don’t over-reach with your analysis

– Rates are good – IF you have the data to calculate them.

Page 17: Data validation in the Digital Age

Outliers are importantExplore the reasons behind anomalies or unexpected trends in the data.

From the state of WA: After going back and forth with our analyst on this, we decided it would be easiest for her to just pull the data. You would have been able to get most of the way there through that fiscal.wa.gov site, but there was some stimulus money you wouldn’t have captured and we included the changes so far to the current biennium (based on the supplemental the legislature approved in December).

Page 18: Data validation in the Digital Age

Other Key Data Checks

– When you update the data, make sure nothing has changed. Check definitions for expansion or reduction and talk to the creator of the data.

– Be ready to nix a story.

Page 19: Data validation in the Digital Age

Other Key Data Checks

– Do the math: run sums, percent change, other calculations. Test that math against the results in the database – do they match?

– Look for unexpected nulls– Run a group by query and sort alphabetically by

major fields to test for misspellings or other categorization errors.

– If your data should include every city, or every county in the state, does it? Are you missing data?

Page 20: Data validation in the Digital Age

Other Key Data Checks

– Check with experts and have them test your analysis. Research the methodology used with the kind of data you are working with.

– There is version control for Web frameworks – use some kind of version control for your database, even if it’s in an Excel spreadsheet. Any time you change it, log what you did and when and why.

Page 21: Data validation in the Digital Age

Other Key Data Checks– Test the data against source documents.

Page 22: Data validation in the Digital Age

Other Key Data Checks• How we did it

Page 23: Data validation in the Digital Age

Building genealogy for target DB

•Pre-plan2nd monitor“Logbook” apps

•Lit. review/ interview peers

•Do data fit theoretical models?

•Do a “critical biography” of the data

•Does biography raise critical warnings?

•Acquire latest data and related docs

•Do tables conform to record layout?

•Do docs specify expected ranges & frequencies?

•Are data values missing or out of range?

•Have others run analysis of this data?

•Review major checklist

Analysis

NOW you are ready to NOW you are ready to write a story based on write a story based on

a data basea data base!!

Page 24: Data validation in the Digital Age

Summing Up

• Databases are constantly dynamic, “living” things. Look for and measure their energy and change.

• Beware of rounding error– Always try to get the most fine-grained data possible in its

ORIGINAL data form or application, i.e. avoid PDFs with SUMMARY data

• Beware of changing definitions• Beware of changing data collectors, data entry

personnel, changing norms of editing and usage.

Page 25: Data validation in the Digital Age

“OK, but where did that data come from?”

Data validation in the

Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAt o m @ j t j o h n s o n . c o m

25

Cheryl PhillipsData Enterprise Editor

Seattle TimesSeattle, Washington USA

c p h i l l i p s @ s e a t t l e t I m e s . c o m

Many Thanks This PowerPoint deck and Tipsheets posted at:

http:// s d r v . m s / w N t i M 7