Hazy Research Group Led by Christopher Ré (Zifei Shan, Mikhail Sushkov, Feiran Wang, Ce Zhang) The Architecture of DeepDive Rigorous ProbabilisIc Framework ExecuIve Summary Simpler Feature Engineering PaleoDeepDive We gratefully acknowledge the support from Toshiba, Google, DARPA DEFT and XDATA, ONR, and NSF. Any opinions, findings, and conclusions or recommendaGons expressed in this material are those of the authors and do not necessarily reflect the view of Toshiba, Google, DARPA, ONR, NSF, or the US government. Takeaways DeepDive enables macroscopic science by building a “dark data” extrac9on system. Developers should think about features, not algorithms. It is possible to abstract probabilis9c inference for use by domain scien9sts using standard SQL and Python. We can achieve comparable (or beEer) quality to human volunteers. We demonstrate these three takeaways! DeepDive is the underlying framework DeepDive has three features! Unstructured text Structured Resources RelaIonal Input Tables Variables, factors, and connecIons between variables and factors Factor Graph e.g., Appear(Taxon, FormaIon) relaIon RelaIonal OutputTables Text spans Freebase, Bing results… StaIsIcal Inference & Learning Feature ExtracIon Feature extracIon with Python & SQL DeepDive supports different highlevel languages to specify a factor graph, e.g., Markov logic network, WinBUGS, etc. PaleoDeepDive is built with a combinaIon of Python and SQL Python SQL Input relaIons (e.g. Coref Task) PID DOC TEXT P1 D1 Columbia Fm. P2 D1 Columbia Phrase Phrase A is coreferent to phrase B if the edit distance between A and B is smaller than 5 and they appear in the same document. We write an SQL query to generate all phrase pairs that appear in the same document, and pair it with a python funcAon. SELECT t0.PID, t0.TEXT, t1.PID, t1.TEXT FROM Phrase t0, Phrase t1 WHERE t0.DOC=t1.DOC USEPYTHON pyfunc We write a Python funcAon to process all phrase pairs and make predicAons. def pyfunc(p1, t1, p2, t2): if edit_dist(t1, t2) < 5: emit(“Coref”, p1, p2) DeepDive will learn the weight automaGcally Extract Error analysis Extractor ApplicaIon Write/Improve extractors DeepDive is able to integrate a diverse set of signals & feedback DeepDive supports an “E3 loop” for feature engineering Domain Experts Unstructured Data Structured Knowledge Base Training labels HTML documents Scanned ArIcles Maps, photos, images Freebase Macrostrat DicIonaries Training labels HeurisIcs & Rules Hard Constraints MoreStructured Signal MoreSupervised Signal Crowd The more signals we use, the beger quality we can expect! Our DimmWiEed System is able to run Gibbs sampling in the speed of 100 million variables/sec! (hgp://arxiv.org/abs/1403.7550) Geoscience Research Group Led by Shanan Peters (Jackson Borchardt, Tim Foltz) PaleoDeepDive: A Applica9on Sta9s9cal Inference using Familiar DataProcessing Languages Try out ! hgp://deepdive.stanford.edu ? How does climate change impact biodiversity? T. Rex are found daIng to the upper Cretaceous. (“T. Rex”, “Cretaceous”) 380 volunteer scienGsts manually read 11K journal arGcles since 1994! Extract biodiversityrelated relaAons from journal arAcles. The DeepDive Approach The central task of PaleoDeepDive is to extract relaAons from unstructured text automaAcally: PaleoDeepDive achieves comparable (or beNer) quality with human volunteers, in a cheaper way. 0 500 1000 1500 2000 2500 3000 0 100 200 300 400 500 Total Diversity Geological Time (M.A.) Human PaleoDeepDive Different Sources of Signals DeepDive uses a joint probability model that enables rigorous probabilisIc interpretaIon 0 0.5 1 0 0.5 1 Accuracy Actual Ideal 0K 1K 2K 3K 4K 0 0.2 0.4 0.6 0.8 # Extrac9ons Goal Candidates for improvement Output to users Output Probability Example Extractor We expect that 8 of 10 with probability 0.8 will be correct PaleoDeepDive