Data Federation and Search Courtesy: Dean Allemang Working Ontologist, LLC [email protected] http://workingontologist.com/events
Dec 24, 2015
Data Federation and Search
Courtesy:Dean Allemang
Working Ontologist, [email protected]
http://workingontologist.com/events
Problems
RDB Spreadsheet
RelationalDatabase
XMLRDBRDB
email?
Challenges
• Syntactic challenges– Different formats– Character encodings– Upper/lower case
• Structural challenges– Grouping – References
• Semantic challenges – Identity (when are we talking about the same thing?)– Mapping (zip code -> post code)– Conversions (e.g., $ -> ₩)
The Approach
RDB Spreadsheet
RelationalDatabase
XMLRDBRDB
?
Scenario
• Data about companies and their sectors and last sale are stored in different formats
• How to search information across different datasets.– Mizuho Financial Group, Inc., its Legal Entity
Identifier, located in which country, its last sale, its marketCap, its sector
– How about Guangshen Railway Company Limited?
Scenario
• LEIAsia.xml: Legal Entity Identifiers (LEI) for organizations or individual that could manage money (example is a small set of entities registered in Asia)– Using country code rather than country full name
• ISO3166.xml: converting country code to country full name
• Companylistnyse.csv and companylistnasdaq.csv : company listings from NASDAQ including marketCap, IPOyear, Sector, Industry, Summary Quote, LastSale and so on.
Solutions
• Integrating above three datasets using RDF and then search using Sparql
Software
• All software used in this tutorial is open source, and all data sets are in the public domain. The tutorial materials include:
• xsltproc, a processor for xslt from XMLSoft • xml2rdf3.xsl, an
XML to RDF translator in XSLT from AstroGrid, • tab2n3.py, a spreadsheet (CSV) to RDF converter
from MindSWAP. This runs in Python. • arq, a RDF/SPARQL processor based on Jena
Exercise Architecture
NASDAQ listings
NYSE listings
ISO Country Codes
Legal Entity Identifier(LEI) Asia
XML2RDF3
TAB2N3
ARQ
Setup
• Download and unzip jist2013.allemang.org.zip• Set up environmental variables
@echo offrem change this path to work dirset testdir=C:\Ding@IU\Teaching\Fall2013\Z636\GuestLecture\jist2013\jist2013.allemang.org
rem do not change below codeset JENA_HOME=%testdir%\apache-jena-2.11.0set PATH=%PATH%;%testdir%\bin;%testdir%\apache-jena-2.11.0\binset PATH=%PATH%;%testdir%\apache-jena-2.11.0\lib;%testdir%\apache-jena-2.11.0\bat@echo on
SetupRun Path.bat
Echo %path%
Converting XML to RDF<LegalEntity>
<LegalName>Hyundai Capital Services, Inc.</LegalName><OtherNames>
<OtherName>Hyundai Auto Finance Co., Ltd.</OtherName></OtherNames><RegisteredAddress>
<AddressLineOne>10th Floor</AddressLineOne><AddressLineTwo>Hyundai Capital Building</AddressLineTwo><AddressLineThree>15-21, Youido-dong</AddressLineThree><City>Youngdungpo-Ku</City><State>Seoul</State><Country>KR</Country><PostCode>150-706</PostCode>
</RegisteredAddress></LegalEntity>
Converting XML to RDFS0-0
S0-0-3
S0-0-2
S0-0-3-0 S0-0-4S0-0-4-1
S0-0-4-0
S0-0-4-2
S0-0-4-3
S0-0-4-4
S0-0-4-5
S0-0-4-6
AddressLineOne
AddressLineTwo
AddressLineThree
City
Country
PostCode
State
10th Floor
Hyundai Capital Building
15-21, Youido-dong
Youngdungpo-Ku
Seoul
KR
150-706
value
value
value
value
value
value
value
RegisteredAddress
OtherNames
LegalName
OtherName
Hyundai Auto Finance Co., Ltd.
Hyundai Capital Services, Inc.
value
value
Converting XML to RDF• Converting LEIAsia.xml to RDF
xsltproc -stringparam BaseURI "http://jist2013.org/LEIAsia" xml2rdf3.xsl leiasia.xml > LEIAsia.rdf
arq --data LEIAsia.rdf -query queries/properties.rq
arq --data LEIAsia.rdf -query queries/name.rq
arq --data LEIAsia.rdf -query queries/name1.rq
arq --data LEIAsia.rdf -query queries/addresses.rq
Converting XML to RDF• Converting ISO3166.xml to RDF
xsltproc -stringparam BaseURI "http://jist2013.org/ISO3166" xml2rdf3.xsl ISO3166.xml > iso3166.rdf
arq --data iso3166.rdf -query queries/properties.rq
arq --data iso3166.rdf -query queries/iso.rq
Converting CSV to RDF
<http://jist2013.org/nasdaq#FUBC>
"1st United Bancorp, Inc. (FL)""260938080.01"
"Finance""Major Banks" name
SectorIndustry
Market Cap
Same data in turtle@prefix : <http://jist2013.org/nasdaq#>.
:nFUBC a :Item ; :symbol "FUBC"; :name "1st United Bancorp, Inc. (FL)"; :marketCap "260938080.01"; :sector "Finance"; :industry "Major Banks".
:nABMD a :Item ; :symbol "ABMD"; :name "ABIOMED, Inc."; :marketCap "976161502.05"; :sector "Health Care"; :industry "Medical/Dental Instruments" .
:nARAY a :Item ; :symbol "ARAY"; :name "Accuray Incorporated"; :marketCap "512861883.8"; :sector "Health Care"; :industry "Medical/Dental Instruments".
:nACFN a :Item ; :symbol "ACFN"; :name "Acorn Energy, Inc."; :marketCap "86289929.7"; :sector "Consumer Services"; :industry "Military/Government/Technical" .
Converting CSV to RDF• Using tab2n3.py to convert CSV to RDF
python tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nasdaq <companylistnasdaq.csv >companylistnasdaq.ttl
python tab2n3.py -comma -schema -type -idfield -namespace http://jist2013.org/nyse <companylistnyse.csv >companylistnyse.ttl
arq --data companylistnasdaq.ttl -query queries/properties.rq
arq --data companylistnasdaq.ttl -query queries/company.rq
Set up a correct path for python (see python.bat)
Python.bat
Federated Search 1• Looking for companies mentioned in NYSE and LEIAsia
arq --data companylistnyse.ttl --data LEIAsia.rdf -query queries/fed2.rq
prefix owl: <http://www.w3.org/2002/07/owl#> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>prefix lei: <http://www.leiutility.org#> prefix iso: <http://iso.org/3166#>prefix leia: <http://jist2013.org/LEIAsia#>prefix nasdaq: <http://jist2013.org/nasdaq#>prefix nyse: <http://jist2013.org/nyse#>prefix xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?name WHERE { ?lei lei:LegalName ?lname . # from name.rq ?lname rdf:value ?name . # from name.rq ?stock nyse:name ?name # from company.rq}
Federated Search 2Show all companies that are listed on the NYSE or NASDAQ, showing their market cap and the name of the country they are registered in. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed3.rq
Show the legal forms of all companies, sorted by country (which legal forms are used in which country?) arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed4.rq
Show the legal forms of all publicly traded companies. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed5.rq
Sum up the market caps of all listed companies in each country. arq --data companylistnyse.ttl --data companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/fed6.rq
Tutorial: Answer Questions
• Mizuho Financial Group, Inc., its Legal Entity Identifier, located in which country, its last sale, its marketCap, its sector– arq --data companylistnyse.ttl --data
companylistnasdaq.ttl --data iso3166.rdf --data LEIAsia.rdf -query queries/mizuho.rq
• How about Guangshen Railway Company Limited?
For more info about Dean’s Tutorial
The hands-on exercise software is Windows-Only
1. Visit http://workingontologist.com/events2. Click the link for the hands-on exercise
materials.3. Download jist2013.allemang.org.zip4. Unzip it to your desktop5. Open tutorial.html, and follow the directions
there.