Visualizing Relationships: Journalistic Problems in a Digital Age
Jan 13, 2015
Visualizing Relationships: Journalistic Problems in a
Digital Age
2
Summary1. Introduction2. The Problem we are solving3. Involved issues4. Problems we found5. The Challenge
2© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
3
WHO ARE WE?
• Mariano Blejman is a technology editor and youth editor in Argentine newspaper Página/12, and Hacks/Hackers Buenos Aires co-founder. @blejmanevel
• Marcos Vanetta is a biomedical engineer. Software developer at 3PillarGlobal and hacker at Hacks/Hackers Buenos Aires. @malev
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
4© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
HACKS/HACKERS BUENO AIRES
5
THE PROBLEM• 1976 A dictatorship started in Argentina.
• 30,000 persons were kidnapped and disappeared.
• 1985 First trials happened in Argentina. They judged the bad guys but we have to stop.
• 2003 Justice start judging the bad guys again.
• 2012 Large amount of judicial documents.
No one can read all of them
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
6
INVOLVED ISSUES• Semantic Analytics
• Ontology
• Data Mining
• Social Network Analysis
• Visualizations
Who were dealing with documents?
DocumentCloud, Overview, Open Calais, NLTK, Gate
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
7
FIRST APPROACH
Read all the documents
Software solution based on regular expressions Ruby, Padrino and MySQL database.
def self.extract_plain_text(path)basename = File.basename(path).split('.')[0..-
2].join('.')tmp_dir = Dir.tmpdirDocsplit.extract_text(path, :output =>
tmp_dir, :ocr => false)text = File.open(File.join(tmp_dir,
"#{basename}.txt")).readself.clean_text(text)
end
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
8
THE PROBLEMS WE FOUND• Convert text from pdf files
• Extract entities from documents
• Parse dates and addresses
• Co-reference names resolution
• How to store relations
• Documents contextual information
• Confidence on data on a crowdsourcing platform.
Visualizing Relationships over the Time
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
9
WHAT DO WE HAVE NOW?Prototype for a single (and local) use case: mapa76
Platform for different use cases: analice.me
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
10
THE VISUALIZATIONS THAT WE IMAGINED
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
11© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
12
THE VISUALIZATIONS THAT WE FOUND
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
13© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential
14
THE #MOZFEST CHALLENGE
Find a big journalistic issue that involves:
• Lot of documents with unstructured data
• Lot of data to find inside
• What relationships do you wants to find
© Copyright 2014. 3Pillar | All rights reserved Strictly Confidential