Grab some coffee and enjoy the pre-show banter before the top of the hour!
Jul 16, 2015
Grab some coffee and
enjoy the
pre-show
banter
before the top of the
hour!
Episode 2: Back to Normal Tech Lab Webcast | September 24, 2014
Sponsored by
u Real-‐world proving ground for enterprise soCware
u Designed to showcase the process of creaEng soluEons
u Completely independent of sponsor influence
u Run by Master ScienEst, Dr. Geoffrey Malafsky
u Projects span 3-‐6 months
What Is the Tech Lab?
u Data NormalizaEon is a process by which disparate data sets, terms, models and ontologies can be reconciled for the purpose of providing cerEfiably accurate enterprise data.
What Is Data NormalizaEon?
u Disparate Data Systems u Disparate File Structures u Disparate Data Models
u Variable Business Logic u ConflicEng Data Values
u Serious SemanEc Issues
Why Is NormalizaEon Necessary?
u Robust plaYorm for data persistence
u RelaEvely easy to connect to enterprise apps
u Enables ‘future-‐proofing’ by avoiding lock-‐in
u Growing array of parallel processing funcEons
u New standard for data management
u No need to delete data, enabling roll-‐back
How Hadoop Can Help
QuesEons?
Thank you!
FIND THE ARCHIVE AT InsideAnalysis.com
DATA SCIENCE AND HADOOP TO NORMALIZE CORPORATE DATA
u Normalizing data is more sophisEcated than what is commonly done in integraEon
u It combines subject maaer knowledge, governance, business rules, and raw data.
u Small Data is “corporate structured data that is the fuel of its main ac2vi2es, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applica2ons, reports, and Business Intelligence.”
The State of Corporate Data
multiple instances of source data
multiple definitions for reporting
multiple copies of data
variable structures
different data values
hidden conflicts in data definiEons
which source to use
different model types & standards
more storage , esp. when mulEplied by envinroments
more data flows to develop and maintain
more than 100 DW or data marts downstream
different methods for ETL
complex dependencies, difficult for impact assessment
conflicEng business logic & views
global analyses & aggregaEons restricted by inconsistencies
Copyright PSIKORS InsEtute 2013 11
Copyright PSIKORS InsEtute 2014 12
Data NormalizaEon Showcase
u FPDS is an open source of Federal Procurement data that has poor quality and consistency. – Approx 10M+ records each with 306 columns = 25GB raw text
– Structured data except for some free text fields u We are normalizing it for analysis of IT expenditures for a real client
u Queries are used by analysts supported by Hadoop environment via Data NormalizaEon plaYorm
NormalizaEon Begins with Understanding Data
u Databases are supposed to have official informaEon on formal acquisiEon of IT assets. – Contracts DB not aligned with Procurement DB
• Example, FA330012Dxxx in one but not other
u Differing data sets and values – FA330012F0005: Same in both – FA330012P0020: Contracts DB: 10 items; FPDS: 1 item; Same descripEon, same total dollars
– HQ042312*: Contracts 6 = $278.4K, FPDS 1 = $48K • $48K is one of 6 records in Contracts
Copyright PSIKORS InsEtute 2014 14
ConverEng supposedly same primary keys into normalized values that can be compared: contract number
u If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER
u If ( x1='0') v_modificaEon_number = '0‘ else v_modificaEon_number = x2 – where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD – where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD – where x2: if (x4=NULL) x2='0‘ else x2=x4 – where x4: x4= LTRIM(x5) – where x5: x5=x1 – essenEally this first tries to use ACO_MOD, and if this is NULL then it tries to
use PCO_MOD and sets = '0' if these are NULL
u If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT – where y1: y1 = REF_PROC_INSTRUMENT with all '-‐' characters
removed
Copyright PSIKORS InsEtute 2014 16
key business logic as buried in a database stored procedure (condensed)
SQL Queries via Hue: Impala
SQL Queries via Hue: Hive
Querying Impala From Data NormalizaEon System
Simplifying Queries and Tying to AuthoritaEve Management
Storing Term Rules in Master Codes
Note wildcard character (*) in middle as well as
front and back
SELECT recordid,contracEngagencyid,contracEngagencyname,orgcode,orgid,modificaEonnumber,piid,piidagencyid,solicitaEonid,effecEvedate,fiscalyear,fundingagencyid,fundingagencyname,typeofcontract,consolidatedcontractdesc,descofreq,naicscode,naicsdesc,productorservicecode,productorservicedesc,globaldunsnumber,dunsnumber,globalvendorname,vendorname,datesigned,referencedidvpiid,referencedidvagencyid,referencedidvmodnumber,contracEngdepartmenEd,contracEngdepartmentname,contracEngofficeid,contracEngofficename,contracEngofficeregion,funcdimenddate,funcdimstartdate,funcEon1,funcEon1value,funcEon2,funcEon2value,funcEon3,funcEon3value,majorcommandcode,majorcommandid,majorcommandname,parentmacomcode,primarydimensionid,primarydimensionvalueid,secondarydimensionid,secondarydimensionvalueid,subcommand1code,subcommand1id,subcommand1name,subcommand2code,subcommand2id,subcommand2name,subcommand3code,subcommand3id,subcommand3name,subcommand4code,subcommand4id,subcommand4name,terEarydimensionid,terEarydimensionvalueid,transacEonnumber,lastdatetoorder,compleEondate,estulEmatecompleEondate,signeddate,fundingofficeid,fundingofficename,isfundedforeignenEtycode,isfundedforeignenEtydesc,reasoninteragencycontracEng,feeforuseofservice,fixed,lowervalue,maximumorderlimit,orderingprocedure,uppervalue,websiteurl,whocanuse,feepaidforuseofidv,programacronym,typeofidc,a76acEoncode,a76acEondesc,conEngencyhumanitarianpeaceop,contracYinancing,costacctstdclausecode,costacctstdclausedesc,costorpricingdata,emailaddress,gfegfpcode,gfegfpdesc,inherentlygovernmentaldesc,inherentlygovernmentalfuncEon,leaercontractundefacEoncode,leaercontractundefacEondesc,majorprogram,mulEpleorsingleawardidv,mulEyearcontractcode,mulEyearcontractdesc,naEonalinterestacEon,naEonalinterestdesc,numberofacEons,performancebasedserviceacqcode,performancebasedserviceacqdesc,purchasecardpaymethodcode,purchasecardpaymethoddesc,seatransportaEon,subcontractplan,treasuryacctsymbolagencyid,treasuryacctsymboliniEaEve,treasuryacctsymbolmaincode,treasuryacctsymbolsubcode,clingercohenactcode,clingercohenactdesc,davisbaconactcode,davisbaconactdesc,economyact,interagencycontracEngauthcode,interagencycontracEngauthdesc,otherstatutoryauthdesc,servicecontractactdesc,servicecontractactcode,walshhealeyactcode,walshhealeyactdesc,bundledreqs,claimantprogramcode,consolidatedcontractcode,domesEcorforeignenEtycode,domesEcorforeignenEtydesc,infotechcommercialitemcategory,recoveredmaterialssustain,recoveredmaterialssustaindesc,systemequipmentcode,useofepadesignatedproducts,congrdistrictplaceofperf,placeofperfzipcode,princplaceofperfcityname,princplaceofperfcountrycode,princplaceofperfcountryname,princplaceofperfcountycode,princplaceofperfcountyname,princplaceofperflocaEoncode,princplaceofperfstatecode,countryprodserviceorigincode,placeofmanufacture,placeofmanufacturedesc,alternaEveadverEsing,commercialitemacqperoccode,commercialitemacqperocdesc,commercialitemtestprogram,commercialitemtestprogramdesc,evaluatedpreference,extentcompeted,fairopportunitylimitedsources,fedbizoppscode,fedbizoppsdesc,localareasetasidecode,localareasetasidedesc,numberofoffersreceived,otherthanfullopencompeEEon,preawardyosynopsis,priceevaluaEonpercentdiff,sbaorofppsynopsiswaiverpilot,sbirsar,smallbuscompdemoprog,solicitaEonperoc,typeofsetaside,awardoridvtype,createdvia,lastmodifiedby,lastmodifieddate,part8orpart13,preparedby,prepareddate,reasonformodificaEoncode,reasonformodificaEondesc,congrdistrictcontractor,contractorname,doingbusasname,samexcepEon,street,street2,vendorcity,vendorcountry,vendorphonenumber,vendorstate,zip,is1862landgrantcollege,is1890landgrantcollege,is1994landgrantcollege,isairportauth,isalaskannaEvecorpownedfirm,isalaskannaEveservicinginst,isamericanindianowned,isasianpacificamericanowned,isblackamericanowned,isbothcontractsandgrants,iscity,iscommdevelopedcorpownedfirm,iscommdevelopmentcorp,iscontracts,iscorporateenEtynoaaxexempt,iscorporateenEtytaxexempt,iscouncilofgovernments,iscountryofincorporaEon,iscounty,isdomesEcshelter,isdotcertdisbusent,iseducaEonalinst,isemergingsmallbus,isfederalagency,isfedfundedresanddevcorp,isforprofitorg,isforeigngovernment,isforeignownedandlocated,isfoundaEon,isgrants,ishispanicamericanowned,ishispanicservicinginst,isvendorhbcu,ishospital,ishousingauthpublictribal,isindiantribe,isintermunicipal,isinternaEonalorg,isinterstateenEty,islaborsurplusareafirm,islimitedliabilitycorp,islocalgovernmentowned,ismanufacturerofgoods,isminorityinsts,isminorityownedbus,ismunicipality,isnaEveamericanowned,isnaEvehawaiianorgownedfirm,isnaEvehawaiianservicinginst,isnonprofitorg,isotherminorityowned,isothernoYorprofitorg,ispartnershipllp,isplanningcommission,isportauth,isprivateuniversityorcollege,issbacert8ajointventure,issbacert8aprogparEcipant,issbacerthubzonefirm,issbacertsmalldisbus,isschooldistrict,isschoolofforestry,isselfcerEfedsmalldisbus,isservicedisabledvetownedbus,issmallagriculturalcooperaEve,issoleproprietorship,isstatecontrinsthigherlearn,isstateofincorporaEon,issubchapterscorp,issubcontasianindianamerowned,istheabilityoneprog,istownship,istransitauth,istribalcollege,istriballyowned,isusfederalgovernment,isusgovernmentenEty,isuslocalgovernment,isusstategovernment,isveteranownedbus,isveterinarycollege,isveterinaryhospital,iswomanownedbus,istypeecondiswosb,istypejventecondiswosb,istypejventwosb,istypewosb,contracEngo{ussizeselecEon,reasonnotawardedtosmallbus,reasonnotawardedtosmalldisbus,idvbundledreqs,idvcontracEngagencyid,idvcontracEngagencyname,idvcontracEngo{ussizesel,idvdepartmenEd,idvdepartmentname,idvmajorprogcode,idvmulEpleorsingleawardidv,idvnaicscode,idvnaicsdesc,idvpart8orpart13,idvprogacronym,idvreferencedidvagencycode,idvreferencedidvpiid,idvsubcontractplan,idvsubcontractplandesc,idvtypeofcontractpricing,idvtypeofcontractpricingdesc,idvtypeofidc,idvtypeofidcdesc,idvwhocanuse,idvwhocanusedesc,missing301,currentcontractvalue,acEonobligaEon,ulEmatecontractvalue FROM fpdsrawrecords.records WHERE ( ( ( LOWER(fundingagencyid) = '97as' ) ) AND ( ( LOWER(fiscalyear) = '2013' ) ) AND ( ( LOWER(productorservicecode) LIKE '70%' OR LOWER(productorservicecode) LIKE 'd3%' ) ) ) LIMIT 1000
Complicated Queries are OCen Needed Looking for a combinaEon of keywords with wildcards along with structured values
Query Timing u Looking for combinaEons of text tokens (with wildcards) to known field values
u Queries are done both in Data NormalizaEon plaYorm and by command line interface on Hadoop server for Impala and Hive. Time differences are negligible but all Emes reported here are by CLI – Tables made for: text, Parquet, Parquet parEEoned by ‘fiscalyear’ (6 values) and ‘fundingagencyid’ (approx. 25 values)
0
50
100
150
200
250
300
350
400
Hive Impala SQLServer
FPDS Hadoop Query Times Text Field (secs)
Text Parquet Parquet ParEEoned
EvaluaEng query performance in Hadoop relaEve to format and comparing to RDBMS
0
50
100
150
200
250
100 LIMIT 1000 LIMIT NO LIMIT
FPDS TEXT QUERIES PER LIMIT (SECS)
Hive Text Impala Text Hive Parquet
Impala Parquet Hive Parquet Part Impala Parquet Part
QUERY PERFORMANCE IMPROVEMENT WITH IMPALA
JusEn Erickson | Director, Product Management, Cloudera
Impala’s Benefits u Unlocks BI/analyEcs on Hadoop
– InteracEve SQL in seconds – Highly concurrent to handle 100s of users
u NaEve Hadoop flexibility – No data migraEon, conversion, or duplicaEon required – Query exisEng Hadoop data – Run mulEple frameworks on the same data at the same Eme – Supports Parquet for best-‐of-‐breed columnar performance
u NaEve MPP query engine designed into Hadoop: – Unified Hadoop storage – Unified Hadoop metadata (uses Hive and HCatalog) – Unified Hadoop security – Fine-‐grained role-‐based access controls with Sentry
u Apache-‐licensed open source u Deployed across customers today
©2014 Cloudera, Inc. All Rights Reserved. 27
Impala Architecture
u MPP query engine built naEvely into Hadoop
©2014 Cloudera, Inc. All Rights Reserved. 28
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL request
Impala’s MulE-‐User over 9.5x Faster
©2014 Cloudera, Inc. All Rights Reserved. 29
MulE-‐user hardware uElizaEon
©2014 Cloudera, Inc. All Rights Reserved. 30
Performance Takeaways u Impala’s advantage expands with just 10 users to >9.5x nearest
compeEtor – Predominantly aaributable to CPU efficiency
u Does not parEcularly maaer which DAG is run for Hive – Shark (with Spark) and Tez produce very similar results – Both incrementally faster batch processing but not comparable to MPP databases – Difference is Spark is already proven with broad community and vendor adopEon
u Mid-‐term trends will further favor Impala’s design approach – More data sets move to memory (HDFS caching, in-‐memory joins, Intel joint roadmap) – CPU efficiency will increase in importance – NaEve code enables easy opEmizaEons for CPU instrucEon sets (e.g. floaEng point
operaEons, math operaEons, encrypt/decrypt) – The Intel joint roadmap helps support these opportuniEes
u Upcoming benchmark on latest releases demonstrate Impala’s this gap widening
©2014 Cloudera, Inc. All Rights Reserved. 31
NORMALIZING THE DATA
Capture Business Rules and Make Visible, Changeable, and Useful
Custom MulE-‐Use NormalizaEon Methods Ready for Hadoop Parallel ExecuEon
Data NormalizaEon Library Enables Rapid Build, Deploy, Change Cycles
Special Programming for Hadoop
u Which Hadoop libraries? Intertwined so reference all.
u Otherwise: not much – HDFS filesystem – YARN containers
Parallel Jobs
u Three ways to run parallel jobs – Launch mulEple Java sessions from command line
• Same as in Windows, Linux
– Use Cloudera Hue Job Designer • Easy and has management web pages
– Data NormalizaEon system • Coordinates governance, architecture, data models, codes, business rules • Define, submit YARN containers specifying Java jar, dicEonaries, source files
Key Code Analysis – Invoice data sets extracted with correlaEon • CAGE: 984274, DUNS: 973437
– FPDS DUNS and Names extracted & correlated
• 158181 unique DUNS codes – Will be included in normalized composite IT Asset records
– Composite records for lookup added to Hadoop • By DUNS or Global DUNS: get all related DUNS, CAGE, names
• By CAGE: get all related DUNS, names • By name: get all related DUNS, CAGE, names
Number CAGE Per DUNS Code
0.1
1
10
100
1000
10000
100000
1000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 27 35 40 43 44 46 54 71 78 90 119
Number DUNS Codes With X CAGE Codes
One DUNS code has 119 CAGE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
ToWAWF
Millions
CAGE Codes from LookUp File
Found NotFound
0.1
1
10
100
1000
10000
100000
1000000
0 1 2 3 4 5
FPDS Number DUNS with N Global DUNS
0.1
1
10
100
1000
10000
100000
1 3 5 7 9 11 13 15 17 19 21 24 27 35 112
FPDS: Number DUNS with N Names
6849 instances for code = 12345678
7
0.1
1
10
100
1000
10000
0 50 100 150 200 250
Num
ber G
lobal D
UNS
Number DUNS
FPDS: Number Global DUNS with N DUNS
0.1
1
10
100
1000
0 200 400 600 800 1000 1200 1400 Num
ber G
lobal D
UNS
Number Names
FPDS: Global DUNS with MulEple Names
140827
13302
17363
942
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
DUNS GlobalDUNS
FPDS DUNS Code Matches to WAWF Codes
Found NotFound
DUNS NGlobalDUNS Nnames
123456787 0 6849
136666505 0 112
790238851 0 96
103933453 1 35
103385519 1 33
005149120 1 27
067641597 1 25
005103494 0 24
332619535 0 24
020751082 1 22
054781240 1 22
621599893 1 21
790238638 0 21
834476079 1 21
FPDS DUNS With Most Names 123456787 miscellaneous foreign contractors 123456787 eEsalat c/o us consulate general dubai 123456787 boswedden house 123456787 turner engine controls b. v. 123456787 swissport hellas cargo s a 123456787 orbit couriers sa 123456787 goldair aviaEon handling s.a. 123456787 federal egov iae iniEaEve generic duns 123456787 federal egov iae iniEaEve -‐ generic duns 123456787 miscellaneous foreign contractorsan 123456787 prc-‐desoto 123456787 inversiones sochagota e.u. 123456787 comcel 123456787 transporte y servicio lucio 123456787 jesse james members only maxi taxi svc 123456787 club naval de oficiales 123456787 inchcape shipping services 123456787 dr. thalia abatzi 123456787 central asia development group 123456787 bennea-‐fouch and associates 123456787 noor al-‐sabah company 123456787 ait/arc infrasture soluEons 123456787 not available 123456787 77 construcEon company
136666505 adese genc petrol 136666505 amy lily chung 136666505 anderson erin ruth 136666505 andrew william knef 136666505 anduaga-‐arias laura 136666505 angelica m. de la cruz 136666505 anthony o'brien, 330531-‐5100194 136666505 batac belle 136666505 boaesini beth ms. 136666505 bouck shannon 136666505 bunn amy b. 136666505 carlene clark 136666505 cho, boong haeng 136666505 choe, sun young 136666505 chrisEna michajlyszyn 136666505 christopher cannon 136666505 christopher l. booth 136666505 chun, kil mo 136666505 conflict + transiEon consultancies 136666505 cozzone elaine 136666505 deborah p. carney 136666505 denihan patricia joann 136666505 dong sook mcgeorge, 690525-‐2716816 136666505 dorene d.lukewalton,pharm d. 136666505 dr. terry a. klein
FPDS Global DUNS with Most Names & DUNS
GlobalDUNS NDUNS Nnames 877936518 12 27299 624770475 212 21866 148095086 80 21754 027079776 2 17128 103933453 86 17075 026157235 4 15694 963737366 106 15200 134303192 19 14481 067641597 108 13998 064680213 102 13809 077652761 93 12914 002204600 15 12570 039860122 44 12382 805258373 130 11995
GlobalDUNS NDUNS Nnames 624770475 212 21866 805258373 130 11995 012003349 128 9748 877987347 127 8253 057272486 124 6935 007250079 123 9076 071767334 123 9474 158140041 117 6671 019710586 116 8163 091441089 116 7813 616924770 116 7217 067641597 108 13998
Prompted CollaboraEon and New Business InformaEon
u Showing these results prompted discussions leading to: – There are generic DUNS heavily used but these are being removed from use via policy changes
– System validaEon rules are not current with all policy – AddiEonal “rules” of how to track, audit, align, merge spread by email • All put back into Data NormalizaEon system and then into modified Java
u New results available over all data sets <1day
ADDITIONAL INFORMATION
Impala JusEn Erickson | Director, Product Management September 2014
©2014 Cloudera, Inc. All Rights Reserved. 52
Impala Architecture: Query ExecuEon
u Request arrives via ODBC/JDBC/Hue GUI/Shell
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL request
©2014 Cloudera, Inc. All Rights Reserved. 53
Impala Architecture: Query ExecuEon u Planner turns request into collecEons of plan fragments u Coordinator iniEates execuEon on impalad's local to data
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Hive Metastore HDFS NN Statestore
©2014 Cloudera, Inc. All Rights Reserved. 54
Impala Architecture: Query ExecuEon u Intermediate results are streamed between impalad’s u Query results are streamed back to client
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner Query Coordinator
Query Executor
HDFS DN HBase
Query Planner Query Coordinator Query Executor
HDFS DN HBase
query results
©2014 Cloudera, Inc. All Rights Reserved. 55
Try It Out!
u 100% Apache-‐licensed open source u Downloads on hap://impala.io/: – Live online – VM – InstallaEon
u QuesEons/comments? – Community: hap://impala.io/community – Email: impala-‐[email protected]
©2014 Cloudera, Inc. All Rights Reserved. 56
©2014 Cloudera, Inc. All Rights Reserved. 57