validating external data sets what social scholars and data journalists can learn from each another
validating external data sets
what social scholars and data journalists can learn from each another
Hille van der Kaa @Hillevanderkaa
missing data, no value stored “I need to solve this”
missing data, no value stored “I need to solve this”
missing data, no value stored “I need to write a story about this”
forreporters.com/andrew-lehren/
“Trustworthiness and data management are vital to the success of
qualitative studies … There is a lack of scientific literature regarding the
structures and processes for managing large qualitative data sets.”
(White, Oelken, Friesen, 2012)
“A simple answer to objective reporting is the kind of reporting that uses relevant and reliable sources which is not bias or
slanted to a certain party.”
Ibrahim, Pawanteh, Kee (2011)
can I trust and use this dataset?
check the data source
what are his/her/its intentions?
what is the citation index of the data owner?
do other journalists cite the data owner?
benefit do I really need this?
do I really need to use it?
check data gathering? is this correct?
clarification of the data? do I understand?
missing data what is wrong? I need to solve
what is the story?
I need to write
internal validation TEST!
CALL!
I need more sources! (do I?) give me data check consistency
give me humans
check my story
scientists
check the source
(citation)
check the data
check benefit
check data gathering
TEST!
more data sources
data journalists
check the source
(citation)
check the data
check benefit
check clarification
CALL!
more human sources
scientist to journalist: “You twist everything”
“Dear datajournalist,
Please take a look at the research method yourself and act a bit more like a
scientist.”
journalist to scientist: “Your articles are useless”
“Dear scientist,
Try to avoid intellectual arrogance. There are
other people who are just as smart.”
journalistic data mining The process of finding correlations or
patterns in large relational databases.
It is the process of analyzing data from different perspectives and summarizing it
into useful and reliable information.
���Gross Time Ranking versus Net Time Ranking
‘The net time is the measured time from starting line to finish line and the gross time is the measured time
from the starting shot until the finish line.
In photo's of the starting line of marathons one can see thousands of runners who are eager to start.
However, when one stands in the last starting pen, one can not directly run at full speed.
A kind of human traffic jam arises when the
marathon starts. On the internet people complain about this difference in time results, because the
ranking is based on gross times.’
missing values - solve
‘We discovered that the data of 100 runners lacked. Apparently one scraped page was added double. We removed
the 100 duplicates.’
missing values - story
‘Still, nineteen runners were missing in the Amsterdam data set.
Perchance these are runners that have been disqualified.’
Or…
‘To calculate the average position changes, caused by net ranking, we converted the difference scores to
absolute figures.
The average position change in the Amsterdam Marathon was 281.6
places.’
scientific outcome ‘We calculated the Kendalls Tau rank correlation coefficient for the net and
gross ranking of the Amsterdam Marathon.
This coefficient shows that despite of the average differences between the
rankings, the net and gross time rankings are almost equal to each other.’
journalistic outcome ‘We spoke Patrick Schuerman from Tilburg on the phone. Patrick had starting number 11797 in the Amsterdam Marathon of 2013
and had a gross time versus net time difference of over 21 minutes.
In his opinion, the ranking of the marathon should happen after net times since these
are the ‘real’ times people ran.’
we are both right
we can learn from each other
current topic:
a citizen view on the credibility of machine
written news
http://tinyurl.com/research-uvt
Part of PhD research
Human Component in Machine Written Narratives
Hille van der Kaa @Hillevanderkaa