Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

Post on 30-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

ExtractingLinkedDatafromstatisticspreadsheets

Tien-Duc Caotien-duc.cao@inria.frIoana Manolescu ioana.manolescu@inria.fr

XavierTannier xtannier@limsi.fr

SemanticBigDataworkshop,Chicago,May19th,2017

Agenda

1. Context:datajournalismandjournalisticfact-checking

2. Researchproblem:extractinglinkedopendatafromspreadsheets

3. Approach

4. Results

5. Futurework

1Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

1.Fact-checkingisacontentmanagementproblem

19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 2

Claimtobechecked (text

ordata)Mediacontent

Mediacontext

Referenceinformationsource1

Human actors(journalists,experts,

crowd workers)

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Analysis result« True /rather true /rather false/false

See sources:http://dataref.com… »

1.Fact-checkingisacontentmanagementproblem

19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 3

Claimtobechecked (text

ordata)Mediacontent

Mediacontext

Referenceinformationsource1

Human actors(journalists,experts,

crowd workers)

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Analysis result« True /rather true /rather false/false

See sources:http://dataref.com… »

Claimextraction

Socialnetworkanalysis

Reconciliation,reputation

Sourced’informationderéférencen+1

Sourced’informationderéférencen+1

Referenceinformationsourcen+1

Sourcesearch /sourceselection

Referencesourceconstruction,refinement,integration

1.Context

• Whichdatasource canhelpustofact-checkastatisticalclaimfromthemedia?

• E.g:“TheunemploymentrateinFrancelastyearwas50%?”• ThisworkisapartofContentCheck 1 project

41 https://team.inria.fr/cedar/contentcheck/

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

2.Researchproblem:high-qualityreferencedata

• NationalstatisticinstitutessuchasINSEE1,France’seconomicandsocietalstatisticsinstituteareoftenvaluabledataproviders

51 https://insee.fr/Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

http://abonnes.lemonde.fr/les-decodeurs/portfolio/2017/04/18/les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html

Existing houseprice indexAvailable revenueperheadRent indexConsumerprice index

2.Theroadtohighqualitydata…

6

UnfortunatelymostofthedatapublishedbyINSEElookslikethis(ourtextcoloring):

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

2.Theroadtohighqualitydata…

7

Sometimestherearemorethan1tablepersheet

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

3.Extractionapproach

8Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:

https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

3.Extractionapproach

9Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:

https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

3.Approach:findingtableboundaries

10Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

3.Extractionapproach

11Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"

Imagesources:https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

3.Approach:tableextractor

12

• Headercellsmostly containtexts

• Theirpositionsareat:• thetop(headerrows)oftable• theleft(headercolumns)oftable

• Havingmorethan1headerrows/columnsindicatesdataaggregation

• Datacellsmostly containnumericvalues

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

3.Approach:tableextractor

1. Wedistinguishheader/datarow/columnsusing• datatypeofitscells(text,number,specialvaluetoindicateamissingvalue,nullforemptycell)• formattinginformationofitscells:cell’sborder,cellsbelongtomergedcell• thetypesofitsneighborrows/columns

2. Basedontheseweidentifytheexactstructureofeachtable

13Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

3.Conceptualdatamodel

14Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

4.Results• Collected16011 Excelspreadsheets,extracted74117 tables.

• Accuracyevaluation:• Weselectedrandomly100Excelfilesà 2432tables• Wevisuallyidentifiedtheheadercells,datacellsandheaderhierarchyandthencomparedwiththoseobtainedfromoursystem.

15Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

16Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

4.SampleextractedRDF

5.Futurework

17Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Referenceinformationsource1

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Sourcesearch /sourceselection

Referencesourceconstruction,refinement,integration

Thanks/questions?

18Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

ExcelfilesandextractedRDFfiles(10.5GBwillbeexpiredinMay29th 2017)https://goo.gl/4Y5Dtv

Sourcecode:noexpirationdate:)https://gitlab.inria.fr/cedar/insee-crawlerhttps://gitlab.inria.fr/cedar/excel-extractor

top related