Top Banner
Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC [email protected] http://workingontologist.com/events
22

Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC [email protected] .

Dec 24, 2015

Download

Documents

Deirdre Goodman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Data Federation and Search

Courtesy:Dean Allemang

Working Ontologist, [email protected]

http://workingontologist.com/events

Page 2: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Problems

RDB Spreadsheet

RelationalDatabase

XMLRDBRDB

email?

Page 3: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Challenges

• Syntactic challenges– Different formats– Character encodings– Upper/lower case

• Structural challenges– Grouping – References

• Semantic challenges – Identity (when are we talking about the same thing?)– Mapping (zip code -> post code)– Conversions (e.g., $ -> ₩)

Page 4: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

The Approach

RDB Spreadsheet

RelationalDatabase

XMLRDBRDB

email

?

Page 5: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Scenario

• Data about companies and their sectors and last sale are stored in different formats

• How to search information across different datasets.– Mizuho Financial Group, Inc., its Legal Entity

Identifier, located in which country, its last sale, its marketCap, its sector

– How about Guangshen Railway Company Limited?

Page 6: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Scenario

• LEIAsia.xml: Legal Entity Identifiers (LEI) for organizations or individual that could manage money (example is a small set of entities registered in Asia)– Using country code rather than country full name

• ISO3166.xml: converting country code to country full name

• Companylistnyse.csv and companylistnasdaq.csv : company listings from NASDAQ including marketCap, IPOyear, Sector, Industry, Summary Quote, LastSale and so on.

Page 7: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Solutions

• Integrating above three datasets using RDF and then search using Sparql

Page 8: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Software

• All software used in this tutorial is open source, and all data sets are in the public domain. The tutorial materials include:

• xsltproc, a processor for xslt from XMLSoft • xml2rdf3.xsl, an

XML to RDF translator in XSLT from AstroGrid, • tab2n3.py, a spreadsheet (CSV) to RDF converter

from MindSWAP. This runs in Python. • arq, a RDF/SPARQL processor based on Jena

Page 9: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Exercise Architecture

NASDAQ listings

NYSE listings

ISO Country Codes

Legal Entity Identifier(LEI) Asia

XML2RDF3

TAB2N3

ARQ

Page 10: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Setup

• Download and unzip jist2013.allemang.org.zip• Set up environmental variables

@echo offrem change this path to work dirset testdir=C:\Ding@IU\Teaching\Fall2013\Z636\GuestLecture\jist2013\jist2013.allemang.org

rem do not change below codeset JENA_HOME=%testdir%\apache-jena-2.11.0set PATH=%PATH%;%testdir%\bin;%testdir%\apache-jena-2.11.0\binset PATH=%PATH%;%testdir%\apache-jena-2.11.0\lib;%testdir%\apache-jena-2.11.0\bat@echo on

Page 11: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

SetupRun Path.bat

Echo %path%

Page 12: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting XML to RDF<LegalEntity>

<LegalName>Hyundai Capital Services, Inc.</LegalName><OtherNames>

<OtherName>Hyundai Auto Finance Co., Ltd.</OtherName></OtherNames><RegisteredAddress>

<AddressLineOne>10th Floor</AddressLineOne><AddressLineTwo>Hyundai Capital Building</AddressLineTwo><AddressLineThree>15-21, Youido-dong</AddressLineThree><City>Youngdungpo-Ku</City><State>Seoul</State><Country>KR</Country><PostCode>150-706</PostCode>

</RegisteredAddress></LegalEntity>

Page 13: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting XML to RDFS0-0

S0-0-3

S0-0-2

S0-0-3-0 S0-0-4S0-0-4-1

S0-0-4-0

S0-0-4-2

S0-0-4-3

S0-0-4-4

S0-0-4-5

S0-0-4-6

AddressLineOne

AddressLineTwo

AddressLineThree

City

Country

PostCode

State

10th Floor

Hyundai Capital Building

15-21, Youido-dong

Youngdungpo-Ku

Seoul

KR

150-706

value

value

value

value

value

value

value

RegisteredAddress

OtherNames

LegalName

OtherName

Hyundai Auto Finance Co., Ltd.

Hyundai Capital Services, Inc.

value

value

Page 14: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting XML to RDF• Converting LEIAsia.xml to RDF

xsltproc -stringparam BaseURI "http://jist2013.org/LEIAsia" xml2rdf3.xsl leiasia.xml > LEIAsia.rdf

arq --data LEIAsia.rdf -query queries/properties.rq

arq --data LEIAsia.rdf -query queries/name.rq

arq --data LEIAsia.rdf -query queries/name1.rq

arq --data LEIAsia.rdf -query queries/addresses.rq

Page 15: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting XML to RDF• Converting ISO3166.xml to RDF

xsltproc -stringparam BaseURI "http://jist2013.org/ISO3166" xml2rdf3.xsl ISO3166.xml > iso3166.rdf

arq --data iso3166.rdf -query queries/properties.rq

arq --data iso3166.rdf -query queries/iso.rq

Page 16: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting CSV to RDF

<http://jist2013.org/nasdaq#FUBC>

"1st United Bancorp, Inc. (FL)""260938080.01"

"Finance""Major Banks" name

SectorIndustry

Market Cap

Page 17: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Same data in turtle@prefix : <http://jist2013.org/nasdaq#>.

:nFUBC a :Item ; :symbol "FUBC"; :name "1st United Bancorp, Inc. (FL)"; :marketCap "260938080.01"; :sector "Finance"; :industry "Major Banks".

:nABMD a :Item ; :symbol "ABMD"; :name "ABIOMED, Inc."; :marketCap "976161502.05"; :sector "Health Care"; :industry "Medical/Dental Instruments" .

:nARAY a :Item ; :symbol "ARAY"; :name "Accuray Incorporated"; :marketCap "512861883.8"; :sector "Health Care"; :industry "Medical/Dental Instruments".

:nACFN a :Item ; :symbol "ACFN"; :name "Acorn Energy, Inc."; :marketCap "86289929.7"; :sector "Consumer Services"; :industry "Military/Government/Technical" .

Page 18: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Converting CSV to RDF• Using tab2n3.py to convert CSV to RDF

python tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nasdaq <companylistnasdaq.csv >companylistnasdaq.ttl

python tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nyse <companylistnyse.csv >companylistnyse.ttl

arq --data companylistnasdaq.ttl -query queries/properties.rq

arq --data companylistnasdaq.ttl -query queries/company.rq

Set up a correct path for python (see python.bat)

Python.bat

Page 19: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Federated Search 1• Looking for companies mentioned in NYSE and LEIAsia

arq --data companylistnyse.ttl --data LEIAsia.rdf -query queries/fed2.rq

prefix owl: <http://www.w3.org/2002/07/owl#> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>prefix lei: <http://www.leiutility.org#> prefix iso: <http://iso.org/3166#>prefix leia: <http://jist2013.org/LEIAsia#>prefix nasdaq: <http://jist2013.org/nasdaq#>prefix nyse: <http://jist2013.org/nyse#>prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?name WHERE { ?lei lei:LegalName ?lname . # from name.rq ?lname rdf:value ?name . # from name.rq ?stock nyse:name ?name # from company.rq}

Page 20: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Federated Search 2Show all companies that are listed on the NYSE or NASDAQ, showing their market cap and the name of the country they are registered in. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed3.rq

Show the legal forms of all companies, sorted by country (which legal forms are used in which country?) arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed4.rq

Show the legal forms of all publicly traded companies. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed5.rq

Sum up the market caps of all listed companies in each country. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed6.rq

Page 21: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

Tutorial: Answer Questions

• Mizuho Financial Group, Inc., its Legal Entity Identifier, located in which country, its last sale, its marketCap, its sector– arq --data companylistnyse.ttl --data

companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/mizuho.rq

• How about Guangshen Railway Company Limited?

Page 22: Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC dallemang@workingontologist.com .

For more info about Dean’s Tutorial

The hands-on exercise software is Windows-Only

1. Visit http://workingontologist.com/events2. Click the link for the hands-on exercise

materials.3. Download jist2013.allemang.org.zip4. Unzip it to your desktop5. Open tutorial.html, and follow the directions

there.