
Post on 25-Jan-2015






Click to see full reader




validating external data sets

what social scholars and data journalists can learn from each another

    Hille van der Kaa @Hillevanderkaa  

missing data, no value stored “I need to solve this”

missing data, no value stored “I need to solve this”

missing data, no value stored “I need to write a story about this”

“Trustworthiness and data management are vital to the success of

qualitative studies … There is a lack of scientific literature regarding the

structures and processes for managing large qualitative data sets.”

(White, Oelken, Friesen, 2012)


“A simple answer to objective reporting is the kind of reporting that uses relevant and reliable sources which is not bias or

slanted to a certain party.”

Ibrahim, Pawanteh, Kee (2011)

can I trust and use this dataset?

check the data source

what are his/her/its intentions?

what is the citation index of the data owner?

do other journalists cite the data owner?


benefit do I really need this?

do I really need to use it?


check data gathering? is this correct?

clarification of the data? do I understand?


missing data what is wrong? I need to solve

what is the story?

I need to write  

internal validation TEST!


I need more sources! (do I?) give me data check consistency

give me humans

check my story  


check the source


check the data

check benefit

check data gathering


more data sources

data journalists

check the source


check the data

check benefit

check clarification


more human sources

scientist to journalist: “You twist everything”

“Dear datajournalist,

Please take a look at the research method yourself and act a bit more like a


journalist to scientist: “Your articles are useless”

“Dear scientist,

Try to avoid intellectual arrogance. There are

other people who are just as smart.”


journalistic data mining The process of finding correlations or

patterns in large relational databases.

It is the process of analyzing data from different perspectives and summarizing it

into useful and reliable information.  

���Gross Time Ranking versus Net Time Ranking  

 ‘The net time is the measured time from starting line to finish line and the gross time is the measured time

from the starting shot until the finish line.

In photo's of the starting line of marathons one can see thousands of runners who are eager to start.

However, when one stands in the last starting pen, one can not directly run at full speed.

A kind of human traffic jam arises when the

marathon starts. On the internet people complain about this difference in time results, because the

ranking is based on gross times.’

missing values - solve

‘We discovered that the data of 100 runners lacked. Apparently one scraped page was added double. We removed

the 100 duplicates.’  

missing values - story

‘Still, nineteen runners were missing in the Amsterdam data set.

Perchance these are runners that have been disqualified.’


‘To calculate the average position changes, caused by net ranking, we converted the difference scores to

absolute figures.

The average position change in the Amsterdam Marathon was 281.6


scientific outcome ‘We calculated the Kendalls Tau rank correlation coefficient for the net and

gross ranking of the Amsterdam Marathon.

This coefficient shows that despite of the average differences between the

rankings, the net and gross time rankings are almost equal to each other.’

journalistic outcome ‘We spoke Patrick Schuerman from Tilburg on the phone. Patrick had starting number 11797 in the Amsterdam Marathon of 2013

and had a gross time versus net time difference of over 21 minutes.

In his opinion, the ranking of the marathon should happen after net times since these

are the ‘real’ times people ran.’

we are both right

we can learn from each other

current topic:

a citizen view on the credibility of machine

written news

Part of PhD research

Human Component in Machine Written Narratives

    Hille van der Kaa @Hillevanderkaa  

top related