1 Wedtech - 24 Aug. 05 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe, New Mexico Presentation to FRIAMGroup's Applied Complexity Lecture Series Santa Fe, NM USA 24 August 2005
29
Embed
Wedtech - 24 Aug. 05 1 What is the Proof necessary for Truth (whatever that is) Tom Johnson Managing Director Institute for Analytic Journalism Santa Fe,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1Wedtech - 24 Aug. 05
What is the Proof necessary for
Truth(whatever that is)
Tom JohnsonManaging Director
Institute for Analytic JournalismSanta Fe, New Mexico
Presentation to FRIAMGroup's Applied Complexity Lecture Series
Santa Fe, NM USA 24 August 2005
2Wedtech - 24 Aug. 05
What is the IAJ
Analysis using a variety of tools and methods from multiple disciplinesUnderstand multiple phenomena Communicate results to multiple audiences in a variety of ways.
3Wedtech - 24 Aug. 05
Cornerstones of IAJ
General Systems Theory Statistics Visual statistics/infographics Simulation modeling
4Wedtech - 24 Aug. 05
Prob of day
So what’s the problem of the day
for analytic journalists?
5Wedtech - 24 Aug. 05
So what’s the problem?
Ever increasing -- beyond estimate -- number of public records databases
DB increasingly used for broad spectrum of decision-making
Assumption that data, as given, is correct. Anecdotal evidence suggests that’s not so.
6Wedtech - 24 Aug. 05
Examples of bad data
St. Louis Post-Dispatch 1997-98: 350 S.Ill. Sex offenders “…found that hundreds of convicted sex offenders don't actually live
at the addresses listed on the sex offender registries for St. Louis, St. Louis County and the Metro East area.”
Every record carried probability between 30-50% of error
1999 - City of St. Louis: “About 700 Sex Offenders Do Not Appear To Live At The Addresses Listed On A St. Louis Registry.”
Boston 2000 BPD - 6 detectives assigned to cleaning up sex
offenders DB
7Wedtech - 24 Aug. 05
Examples of bad data
2000 - Florida voter registration rolls State hires DBT Online/Choicepoint to
“purge rolls.”
“Some [counties] found the list too unreliable and didn't use it at all. … Counties that did their best to vet the file discovered a high level of errors, with as many as 15 percent of names incorrectly identified as felons.”
2004 - Dallas Morning News “…The state criminal convictions database is so
riddled with holes that law enforcement officials say public safety is at risk. “… the state has only 69 percent of the complete criminal histories records for 2002. In 2001, the state had only 60 percent. Hundreds of thousands of records are missing.”
9Wedtech - 24 Aug. 05
Surely there is a simple solution….
Is there a methodology to measure, to know -- or to anticipate -- the quality, i.e. veracity, of a given database?
What are the best -- and most objective -- ways to “X-ray” a DB to note internal problems or potential problems?
Hoping for answers from statisticians, data miners, forensic accountants, bioinformatics, genomics, physics, etc. ‘cause journalists don’t have much of a clue
10Wedtech - 24 Aug. 05
Approaches to database analysis
Theoretical/statistical What can we know about a database only based
on its size and whether a record’s field/cell is occupied?
Are there cheap, fast and good templates/tools to X-ray the DB?
Contextual/statistical How would knowing the context/meaning of data
-- or lack of data -- in cells change our answers to previous questions?
Are there methodologies to help us weigh the importance of a variable relative to the veracity of a record? e.g. is “name” more important than SS#?
11Wedtech - 24 Aug. 05
Approaches to database analysis
Theoretical/statistical What can we know about a database -- and its
potential validity -- only based on its size and whether a record’s field/cell is occupied?
Are there cheap, fast and good templates/tools to X-ray the DB
Contextual/statistical How would knowing the context/meaning of data
-- or lack of data -- in cells change our answers to previous question?A
Are there methodologies to help us weigh the importance of a variable relative to the veracity of a record? e.g. is “name” more important than SS#?
Both/all approaches vary
with the question(s) being
asked
12Wedtech - 24 Aug. 05
Theoretical database structure
DB = Metadata Coding sheet
Fields/elements Field tag (name) Character limited/open field
Numeric/alpha End-of-Record character
Number of records
13Wedtech - 24 Aug. 05
Theoretical database
Assume matrix - 100 records, 10 fields
Assume a given -- and occupied -- index field (serial record number)
14Wedtech - 24 Aug. 05
Theoretical database
Assume matrix - 100 records, 10 fields Assume a given -- and occupied --
index field (serial record number)
Does a record's LCI (Loaded Cell Index), from 10% to 100%, constitute "proof" of anything?
15Wedtech - 24 Aug. 05
Theoretical database
LAs (logical adjacencies) not necessarily physically adjacent in record layout.
Like genome, data present -- or not present -- in a field can trigger the presence or lack of data in another.
The greater a record’s LCI, the greater potential (probability?) that record has enough “Proof” to achieve “True Data" status. Do we think this is true?
Probably, even when we have no idea what the data is/means. Still, “proof” seems to occupy a density-of-data continuum reaching for some critical mass. How do we measure that criticality?
When software achieves critical mass, it can never
be fixed; it can only be discarded and rewritten.
Same for DBs?How do programmers measure that critical
mass?
17Wedtech - 24 Aug. 05
Assumptions???
Probably, even when we have no idea what the data is/means. Still, “proof” seems to occupy a continuum reaching for some critical mass. How do we measure that criticality?
When focus is on individual record, must have context/meaning/definition for the variables/elements, otherwise a nonsensical array of possibly random numbers.
There is no opportunity for Proof of anything, much less Truth.
18Wedtech - 24 Aug. 05
Search for patterns (in 100+k records)
Are there patterns? How can I quickly identify them?
Are there consistencies?
Do populated cells suggest anything about hierarchy of importance?
Are there "Logical Adjacencies,“ (LAs)?
19Wedtech - 24 Aug. 05
Demographics of a database
Logical Adjacencies
Patterns in LAs?
Is there a hierarchy of import/value of LAs?
Are there various thresholds of LAs present, i.e. is it better Proof to have four LAs than three?
Maybe, maybe not. So how do we create rules to weigh (a) a cell and (b) weigh LAs.
20Wedtech - 24 Aug. 05
Demographics of a database
Logical Adjacencies
If a record does not meet some standard of LA-ness, do we discard it from the analysis because it lacks the potential for Proof? (Discarded outlier problem)
Do patterns of populated cells suggest anything about hierarch of importance or only data input process?