The Queen’s University of Belfast The Queen’s University of Belfast GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in legal data mining Karen Loughran
Jan 19, 2016
The Queen’s University of Belfast The Queen’s University of Belfast
GEDDM: Comparisons of OGSA-DAI and GridFTP for access to and conversion of remote unstructured data in
legal data mining
Karen Loughran
The Queen’s University of Belfast
Introduction
Grid Enabled Distributed Data MiningIndustrial partner Overview of GEDDMGEDDM Common Semantic Model (CSM)
objectivesGrid enabled solution
The Queen’s University of Belfast
Industrial Partner - Datactics
Northern Ireland based ern Ireland based (formed 1999)(formed 1999)
Specialising in grid enabled “data-centric” Specialising in grid enabled “data-centric” matching across multiple sectorsmatching across multiple sectors
Datactics technology is fully parallelisedDatactics technology is fully parallelisedComputationally intensive - need to compare Computationally intensive - need to compare
every record with every record with every every other recordother record Improve data quality by applying fuzzy matching Improve data quality by applying fuzzy matching
techniquestechniquesData mining software being used in the real worldData mining software being used in the real world
The Queen’s University of Belfast
GEDDM Business Driver
Data sourcesnumerous structures, formats, locations, administrative
domains…Client
US County Court: insider trading litigation case45TbVariety of formats
Email, pdf, weblogs, DBMS, report text dumps …How to interface to large volumes of data in
common structured parallel approach
The Queen’s University of Belfast
Common Semantic Model (CSM) Objectives
Representation of unstructured data such as email, weblog, report dumps.
Conversion to structured format.Evaluation of Grid technologies for access
and conversion.Secure, reliable and scaleable.Exploit high bandwidth.
The Queen’s University of Belfast
CSM Grid Enabled Solution
Two Stages:Represent and convert unstructured Flat File
Formats (FFF) to structured Common Output Format File (COFF).
Investigate Grid technologies for the remote access and conversion of unstructured data.
The Queen’s University of Belfast
CSM Representation & Conversion
Data Description Language DDL - XSDData Description File DDFParser
The Queen’s University of Belfast
Sample FFF data source & DDF
App Account Address BalanceIMP 343818 Dede H Smith 8600.76 181 Glen Rd Earls Court, LondonIMP 565777 Annie Saunders 9905.50 60 Newhaven St Edinburgh, Scotland
___________________________________________________________________<datasource> <database> <header><headertext>App Account Address Balance </headertext></header> <rectype eorecord=’\n’> <pfield name=”App” pos=1 length=3/> <pfield name=”Account” pos=10 length=6/> <pfield name=”Address” pos=24 length=23 multiline=”yes”/> <pfield name=”Balance” pos=49 length=8/> </rectype> </database></datasource>
The Queen’s University of Belfast
Parser Design
Object oriented component hierarchyEach object represents an XML elementEncapsulates data relating to the flat file
component it describesEncapsulates all import “parse”SAX parse performed on DDF to build up
internal OO representation of FFFParse called on top level object.
The Queen’s University of Belfast
CSM Grid technologies
Transfer & conversion toolsOGSA-DAI (Version 4)GridFTP (GT4.0.0)
GUI interfacing to both of these technologies.
The Queen’s University of Belfast
GUI interface – access & conversion
Data Conversion Services
Conversion Module
Structured Data (COFF)
View Results (COFF)
OK ?
Complete
Yes
No
Unstructured FFF Data
View Sample
Describe (DDF)
Convert
GUI Interface to sample remote FFF, DDF creation and conversion.
The Queen’s University of Belfast
Implementation under OGSA-DAI
OGSA-DAI 4.0.0Globus Toolkit 3.2.1New conversion activity designed &
implementedCalls out to python scripts to perform
conversion
The Queen’s University of Belfast
Implementation under GridFTP
Globus Toolkit 4.0.0Data Storage Interface (DSI) creation to
perform conversion processing at serverInstead of original unstructured FFF, send
the COFF file back to clientSetup striped server architecture – multiple
nodes working together in parallel.
The Queen’s University of Belfast
GridFTP Striped Architecture
Host A
Host B
Host C
RaidHost X
Host Y
Host Z
LondonBelfast
Raid
Raid Raid
Raid
Raid
The Queen’s University of Belfast
GridFTP Machine Specifications
BELFAST AMD4400 Dual Processor 4Gig RAM 1 Terabyte hard disk, serial ATA2 1 Gigabit ethernet
LONDON Dual Optron Processor 4Gig RAM 1 Terabyte hard disk 1 Gigabit ethernet
The Queen’s University of Belfast
GridFTP Evaluation Tests
Attempted conversion and access to large files across the network.
File sizes:13Mb, 26Mb, 52Mb, 103Mb, 205Mb, 409Mb,
817Mb, 1634Mb
Buffer sizes:Default, 4915, 409150, 785408MTU 1400 - 8000
The Queen’s University of Belfast
OGSA-DAI Benchmark Results
Currently no results available:Socket Timeout Error and Engine receives a
terminate signal when Activity takes longer than approximately 10 minutes to run.
DeliverToGridFTP activity would not work in version 4. Patches required. So far, unable to get working with these patches.
Security setup issues.
The Queen’s University of Belfast
GridFTP Network Topology
BBC NI
Queens BESC Router BBC ROUTER
BBC London
Janet Bar
1GBit
1GBit
1GBit
1GBit
Queens
100MBit
The Queen’s University of Belfast
Results – GridFTP transfer
Throughput hindered by:Physical Infrastructure/Service Provider-
80MbsRouter/switches/NIC808 Mbs CPU to CPU (London to Belfast)688 Mbs Disk to Disk (BBC NI)Striping with 2 BE servers - 60% improvement
Local 100Mbs switch:Disc to disc – 82 Mbs
The Queen’s University of Belfast
OGSA-DAI Evaluation ….
DeliverToGridFTP not working in 4.0.0Configuring GridFTP not possible (buffer sizes,
no. of streams, striped transfer etc.)Some way to go in efficient transfer of large files.Installation/runtime overheads Design/code conversion activity & design perform
documents for access/conversion Timeouts converting large files. Threads may be
solution.Clear documentation
The Queen’s University of Belfast
GridFTP Evaluation
Secure, reliable, fast and scaleableLightweight installationOptimum use of high bandwidth networksExtra ERET/ESTO processing allows
tighter integration of conversions operation through the definition of a DSI
Striping for much improved efficiency
The Queen’s University of Belfast
GridFTP Evaluation
Extensive tuning requiredNo clear documentation for writing a DSI. [email protected] useful source of infoPoor performance on NFS. PVFS like filesystem recommended for striping.1Gbit bandwidth in practice difficult to achieve
due to problems with:RouterNICPhysical Infrastructure
The Queen’s University of Belfast
Conclusions
Investigated grid technologies for remote access & conversion
OGSA-DAI disappointing due to lack of support for large file transfer
GridFTP involved extensive configuration and due to network infrastructure problems difficult to get optimum performance in remote transfer
The Queen’s University of Belfast
Future work
Tighter integration of conversion services within GridFTP DSI server module.
Extend the services under GridFTP to cope with Distributed Query Processing.
COFF produced as XML, ready for XPATH queries.
The Queen’s University of Belfast
Questions ?