- 1. Deep-Dive into Big Data ETL withODI12c and Oracle Big Data
ConnectorsMark Rittman, CTO, Rittman MeadOracle Openworld 2014, San
FranciscoT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com
2. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3
9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.comAbout the
SpeakerMark Rittman, Co-Founder of Rittman MeadOracle ACE Director,
specialising in Oracle BI&DW14 Years Experience with Oracle
TechnologyRegular columnist for Oracle MagazineAuthor of two Oracle
Press Oracle BI booksOracle Business Intelligence Developers
GuideOracle Exalytics RevealedWriter for Rittman Mead Blog
:http://www.rittmanmead.com/blogEmail :
[email protected] : @markrittman 3. T : +44 (0)
1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comAbout Rittman MeadOracle
BI and DW Gold partnerWinner of five UKOUG Partner of the Year
awards in 2013 - including BIWorld leading specialist partner for
technical excellence,solutions delivery and innovation in Oracle
BIApproximately 80 consultants worldwideAll expert in Oracle BI and
DWOffices in US (Atlanta), Europe, Australia and IndiaSkills in
broad range of supporting Oracle tools:OBIEE, OBIAODIEEEssbase,
Oracle OLAPGoldenGateEndeca 4. Traditional Data Warehouse / BI
ArchitecturesThree-layer architecture - staging, foundation and
access/performanceAll three layers stored in a relational database
(Oracle)ETL used to move data from layer-to-layerT : +44 (0) 1273
911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia
& New Zealand) or +91 997 256 7970 (India)Staging Foundation
/ODSE : [email protected] : www.rittmanmead.comPerformance
/DimensionalETL ETLBI Tool (OBIEE)with metadatalayerOLAP /
In-MemoryTool with data loadinto own
databaseDirectReadDataLoadTraditional structureddata
sourcesDataLoadDataLoadDataLoadTraditional Relational Data
Warehouse 5. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.comIntroducing
HadoopA new approach to data processing and data storageRather than
a small number of large, powerful servers, it spreads processing
overlarge numbers of small, cheap, redundant serversSpreads the
data youre processing overlots of distributed nodesHas
scheduling/workload process that sendsJob Trackerparts of a job to
each of the nodes- a bit like Oracle Parallel ExecutionAnd does the
processing where the data sits- a bit like Exadata storage
serversShared-nothing architectureLow-cost and highly horizontal
scalableTask Tracker Task Tracker Task Tracker Task TrackerData
Node Data Node Task Tracker Task Tracker 6. Hadoop Tenets :
Simplified Distributed ProcessingHadoop, through MapReduce, breaks
processing down into simple stagesMap : select the columns and
values youre interested in, pass through as key/value pairsReduce :
aggregate the resultsMost ETL jobs can be broken down into
filtering,projecting and aggregatingHadoop then automatically runs
job on clusterShare-nothing small chunks of workRun the job on the
node where the data isHandle faults etcGather the results back inT
: +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596
7186 (Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comMapperFilter,
ProjectMapperFilter, ProjectMapperFilter,
ProjectReducerAggregateReducerAggregateOutputOne HDFS file per
reducer,in a directory 7. HDFS: Low-Cost, Clustered, Fault-Tolerant
StorageThe filesystem behind Hadoop, used to store data for Hadoop
analysisUnix-like, uses commands such as ls, mkdir, chown,
chmodFault-tolerant, with rapid fault detection and
recoveryHigh-throughput, with streaming data access and large block
sizesDesigned for data-locality, placing data closed to where it is
processedAccessed from the command-line, via internet (hdfs://),
GUI tools etc[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir
/user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -ls
/user/oracleFound 5 itemsdrwx------ - oracle hadoop 0 2013-04-27
16:48 /user/oracle/.stagingdrwxrwxrwx - oracle hadoop 0 2012-09-18
17:02 /user/oracle/moviedemodrwxrwxrwx - oracle hadoop 0 2012-10-17
15:58 /user/oracle/movieworkdrwxrwxrwx - oracle hadoop 0 2013-05-03
17:49 /user/oracle/my_stuffdrwxrwxrwx - oracle hadoop 0 2012-08-10
16:08 /user/oracle/stageT : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 8. Oracles Big Data ProductsOracle Big Data
Appliance - Engineered System for Big Data Acquisition and
ProcessingCloudera Distribution of HadoopCloudera
ManagerOpen-source ROracle NoSQL Database Community EditionOracle
Enterprise Linux + Oracle JVMNew - Oracle Big Data SQLOracle Big
Data ConnectorsOracle Loader for Hadoop (Hadoop > Oracle
RBDMS)Oracle Direct Connector for HDFS (HFDS > Oracle
RBDMS)Oracle Data Integration Adapter for HadoopOracle R Connector
for HadoopOracle NoSQL Database (column/key-store DB based on
BerkeleyDB)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 9. Moving
Data In, Around and Out of HadoopThree stages to Hadoop ETL work,
with dedicated Apache / other toolsLoad : receive files in batch,
or in real-time (logs, events)Transform : process & transform
data to answer questionsStore / Export : store in structured form,
or export to RDBMS using SqoopRDBMSImportsT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970
(India)LoadingStage!!!!ProcessingStageE : [email protected] :
www.rittmanmead.com!!!!Store / ExportStage!!!!Real-TimeLogs /
EventsFile /UnstructuredImportsFileExportsRDBMSExports 10. T : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comETL OffloadingSpecial
use-case : offloading low-value, simple ETL work to a Hadoop
clusterReceiving, aggregating, filtering and pre-processing data
for an RDBMS data warehousePotentially free-up high-value Exadata /
RBDMS servers for analytic work 11. Core Apache Hadoop ToolsApache
Hadoop, including MapReduce and HDFSScaleable, fault-tolerant file
storage for HadoopParallel programming framework for HadoopApache
HiveSQL abstraction layer over HDFSPerform set-based ETL within
HadoopApache Pig, SparkDataflow-type languages over HDFS, Hive
etcExtensible through UDFs, streaming etcApache Flume, Apache
Sqoop, Apache KafkaReal-time and batch loading into HDFSModular,
fault-tolerant, wide source/target coverageT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 12. Hive as the Hadoop Data WarehouseMapReduce
jobs are typically written in Java, but Hive can make this
simplerHive is a query environment over Hadoop/MapReduce to support
SQL-like queriesHive server accepts HiveQL queries via HiveODBC or
HiveJDBC, automaticallycreates MapReduce jobs against data
previously loaded into the Hive HDFS tablesApproach used by ODI and
OBIEEto gain access to Hadoop dataAllows Hadoop data to be accessed
just likeany other data source (sort of...)T : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 13. How Hive Provides SQL Access over
HadoopHive uses a RBDMS metastore to holdtable and column
definitions in schemasHive tables then map onto HDFS-stored
filesManaged tablesExternal tablesOracle-like query optimizer,
compiler,executorJDBC and OBDC drivers,plus CLI etcT : +44 (0) 1273
911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia
& New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comHive
Driver(CompileOptimize, Execute)Managed
Tables/user/hive/warehouse/External
Tables/user/oracle//user/movies/data/HDFSHDFS or local filesloaded
into Hive HDFSarea, using HiveQLCREATE TABLEcommandHDFS files
loaded into HDFSusing external process, thenmapped into Hive
usingCREATE EXTERNAL TABLEcommandMetastore 14. T : +44 (0) 1273 911
268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia &
New Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.comOracle Loader for HadoopOracle technology for
accessing Hadoop data, and loading it into an Oracle databasePushes
data transformation, heavy lifting to the Hadoop cluster, using
MapReduceDirect-path loads into Oracle Database, partitioned and
non-partitionedOnline and offline loadsKey technology for fast load
ofHadoop results into Oracle DB 15. Oracle Direct Connector for
HDFSEnables HDFS as a data-source for Oracle Database external
tablesEffectively provides Oracle SQL access over HDFSSupports data
query, or import into Oracle DBTreat HDFS-stored files in the same
way as regular filesBut with HDFSs low-cost and fault-toleranceT :
+44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 16. Oracle R Advanced
Analytics for HadoopAdd-in to R that extends capability to
HadoopGives R the ability to create Map and Reduce functionsExtends
R data frames to include Hive tablesAutomatically run R functions
on Hadoopby using Hive tables as sourceT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 17. Just Released - Oracle Big Data SQLPart of
Oracle Big Data 4.0 (BDA-only)Also requires Oracle Database 12c,
Oracle Exadata Database MachineExtends Oracle Data Dictionary to
cover HiveExtends Oracle SQL and SmartScan to HadoopExtends Oracle
Security Model over HadoopFine-grained access controlData
redaction, data maskingT : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)ExadataStorage ServersE :
[email protected] : www.rittmanmead.comExadata
DatabaseServerHadoopClusterOracle BigData SQLSQL QueriesSmartScan
SmartScan 18. Bringing it All Together : Oracle Data Integrator
12cODI provides an excellent framework for running Hadoop ETL
jobsELT approach pushes transformations down to Hadoop - leveraging
power of clusterHive, HBase, Sqoop and OLH/ODCH KMs provide native
Hadoop loading / transformationWhilst still preserving RDBMS
push-downExtensible to cover Pig, Spark etcProcess
orchestrationData quality / error handlingMetadata and
model-drivenT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 19. The Key
to ODI Extensibility - Knowledge ModulesDivides the ETL process
into separate steps - extract (load), integrate, check constraints
etcODI generates native code for each platform, taking a template
for each step + addingtable names, column names, join conditions
etcEasy to extendEasy to read the codeMakes it possible for ODI
tosupport Spark, Pig etc in futureUses the power of the
targetplatform for integration tasksT : +44 (0) 1273 911 268 (UK)
or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com-Hadoop-native ETL 20. Part of the Wider Oracle
Data Integration PlatformOracle Data Integrator for large-scale
data integration across heterogenous sources andtargetsOracle
GoldenGate for heterogeneous data replication and changed data
captureOracle Enterprise Data Quality for data profiling and
cleansingOracle Data Services Integratorfor SOA message-baseddata
federationT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 21. ODI and
Big Data Integration ExampleIn this example, well show an
end-to-end ETL process on Hadoop using ODI12c & BDAScenario:
load webserver log data into Hadoop, process enhance and
aggregate,then load final summary table into Oracle Database
12cProcess using Hadoop frameworkLeverage Big Data
ConnectorsMetadata-based ETL developmentusing ODI12cReal-world
exampleT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61
3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 22. ETL
& Data Flow through BDA SystemFive-step process to load,
transform, aggregate and filter incoming log dataLeverage ODIs
capabilities where possibleMake use of Hadoop power+
scalabilityFlumeAgentT : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)Sqoop extract!posts(Hive Table)IKM Hive
Control Append(Hive table join & load intotarget hive
table)categories_sql_extract(Hive Table)E : [email protected] :
www.rittmanmead.comhive_raw_apache_access_log(Hive
Table)FlumeAgent!!!!!!Apache HTTPServerLog Files (HDFS)Flume
MessagingTCP Port 4545(example)IKM File to Hive1 using RegEx
SerDelog_entries_and post_detail(Hive Table)IKM Hive Control
Append(Hive table join & load intotarget hive
table)hive_raw_apache_access_log(Hive Table)2
3GeocodingIP>Country list(Hive Table)IKM Hive Transform(Hive
streaming throughPython script)4 5hive_raw_apache_access_log(Hive
Table)IKM File / Hive to Oracle(bulk unload to Oracle DB) 23. ETL
Considerations : Using Hive vs. Regular Oracle SQLNot all join
types are available in Hive - joins must be equality joinsNo
sequences, no primary keys on tablesGenerally need to stage Oracle
or other external data into Hive before joining to itHive latency -
not good for small microbatch-type workBut other alternatives exist
- Spark, Impala etcHive is INSERT / APPEND only - no updates,
deletes etcBut HBase may be suitable for CRUD-type loadingDont
assume that HiveQL == Oracle SQLTest assumptions before committing
to platformT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.comvs. 24. T :
+44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comFive-Step ETL Process1.
Take the incoming log files (via Flume) and load into a structured
Hive table2. Enhance data from that table to include details on
authors, posts from other Hive tables3. Join to some additional
ref. data held in an Oracle database, to add author details4.
Geocode the log data, so that we have the country for each calling
IP address5. Output the data in summary form to an Oracle database
25. Using Flume to Transport Log Files to BDAApache Flume is the
standard way to transport log files from source through to
targetInitial use-case was webserver log files, but can transport
any file from A>BDoes not do data transformation, but can send
to multiple targets / target typesMechanisms and checks to ensure
successful transport of entriesHas a concept of agents, sinks and
channelsAgents collect and forward log dataSinks store it in final
destinationChannels store log data en-routeSimple configuration
through INI filesHandled outside of ODI12cT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 26. GoldenGate for Continuous Streaming to
HadoopOracle GoldenGate is also an option, for streaming RDBMS
transactions to HadoopLeverages GoldenGate & HDFS / Hive Java
APIsSample Implementations on MOS Doc.ID 1586210.1 (HDFS) and
1586188.1 (Hive)Likely to be formal part of GoldenGate in future
release - but usable nowT : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 27. Load Incoming Log Files into Hive
TableFirst step in process is to load the incoming log files into a
Hive tableAlso need to parse the log entries to extract request,
date, IP address etc columnsHive table can then easily be used
indownstream transformationsUse IKM File to Hive (LOAD DATA)
KMSource can be local files or HDFSEither load file into Hive HDFS
area,or leave as external Hive tableAbility to use SerDe to parse
file dataT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com1 28. First
Though Need to Setup Topology and ModelsHDFS data servers (source)
defined using generic File technologyWorkaround to support IKM Hive
Control AppendLeave JDBC driver blank, put HDFS URL in JDBC URL
fieldT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3
9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 29. Defining
Physical Schema and Model for HDFS DirectoryHadoop processes
typically access a whole directory of files in HDFS, rather than
single oneHive, Pig etc aggregate all files in that directory and
treat as single fileODI Models usually point to a single file
though -how do you set up access correctly?T : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 30. Defining Topology and Model for Hive
SourcesHive supported out-of-the-box with ODI12c (but requires
ODIAAH license for KMs)Most recent Hadoop distributions use
HiveServer2 rather than HiveServerNeed to ensure JDBC drivers
support Hive versionUse correct JDBC URL format (jdbc:hive2//)T :
+44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 31. Final Model and
Datastore DefinitionsHDFS files for incoming log data, and any
other input dataHive tables for ETL targets and downstream
processingUse RKM Hive to reverse-engineer column definition from
HiveT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3
9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 32. Using
IKM File to Hive to Load Web Log File Data into HiveCreate mapping
to load file source (single column for weblog entries) into Hive
tableTarget Hive table should have column for incoming log row, and
parsed columnsT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 33.
Specifying a SerDe to Parse Incoming Hive DataSerDe
(Serializer-Deserializer) interfaces give Hive the ability to
process new file formatsDistributed as JAR file, gives Hive ability
to parse semi-structured formatsWe can use the RegEx SerDe to parse
the Apache CombinedLogFormat file into columnsEnabled through
OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM optionT : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 34. Executing First
ODI12c MappingEXTERNAL_TABLE option chosen in IKM File to Hive
(LOAD DATA) as Flume will continuewriting to it until source log
rotateView results of data load in ODI StudioT : +44 (0) 1273 911
268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia &
New Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 35. Join to Additional Hive Tables, Transform
using HiveQLIKM Hive to Hive Control Append can be used to perform
Hive table joins, filtering, agg. etc.INSERT only, no DELETE,
UPDATE etcNot all ODI12c mapping operators supported, but basic
functionality works OKUse this KM to join to other Hive
tables,adding more details on post, title etcPerform DISTINCT on
join output, loadinto summary Hive tableT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com2 36. T : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.comJoining Hive TablesOnly equi-joins supportedMust
use ANSI syntaxMore complex joins may not producevalid HiveQL
(subqueries etc) 37. Filtering, Aggregating and Transforming Within
HiveAggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT
etc mappingoperators can be added to mapping to manipulate
dataGenerates HiveQL functions, clauses etcT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 38. Executing Second MappingODI IKM Hive to
Hive Control Append generates HiveQL to perform data loadingIn the
background, Hive on BDA creates MapReduce job(s) to load and
transform HDFS dataAutomatically runs across the cluster, in
parallel and with fault tolerance, HAT : +44 (0) 1273 911 268 (UK)
or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 39. Bring in Reference Data from Oracle
DatabaseIn this third step, additional reference data from Oracle
Database needs to be addedIn theory, should be able to add
Oracle-sourced datastores to mapping and join as usualBut Oracle /
JDBC-generic LKMs dont get work with HiveT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com3 40. Options for Importing Oracle / RDBMS Data
into HadoopCould export RBDMS data to file, and load using IKM File
to HiveOracle Big Data Connectors only export to Oracle, not import
to HadoopBest option is to use Apache Sqoop, and newIKM SQL to
Hive-HBase-File knowledge moduleHadoop-native, automatically runs
in parallelUses native JDBC drivers, or OraOop (for
example)Bi-directional in-and-out of Hadoop to RDBMSRun from OS
command-lineT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 41. Loading
RDBMS Data into Hive using SqoopFirst step is to stage Oracle data
into equivalent Hive tableUse special LKM SQL Multi-Connect Global
load knowledge module for Oracle sourcePasses responsibility for
load (extract) to following IKMThen use IKM SQL to Hive-HBase-File
(Sqoop) to load the Hive tableT : +44 (0) 1273 911 268 (UK) or
(888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 42. Join Oracle-Sourced Hive Table to Existing
Hive TableOracle-sourced reference data in Hive can then be joined
to existing Hive table as normalFilters, aggregation operators etc
can be added to mapping if requiredUse IKM Hive Control Append as
integration KMT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 43. ODI
Static and Flow Control : Data Quality and Error HandlingCKM Hive
can be used with IKM Hive to Hive Control Append to filter out
erroneous dataStatic controls can be used to create data
firewallsFlow control used in Physical mapping view to handle
errors, exceptionsExample: Filter out rows where IP address is from
a test harnessT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 44. Enabling
Flow Control in IKM Hive to Hive Control AppendCheck the
ENABLE_FLOW_CONTROL option in KM settingsSelect CKM Hive as the
check knowledge moduleErroneous rows will get moved to E_ table in
Hive, not loaded into target Hive tableT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 45. Using Hive Streaming and Python for
Geocoding DataAnother requirement we have is to geocode the
webserver log entriesAllows us to aggregate page views by
countryBased on the fact that IP ranges can usually be attributed
to specific countriesNot functionality normally found in Hive etc,
but can be done with add-on APIsT : +44 (0) 1273 911 268 (UK) or
(888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com4 46. How GeoIP Geocoding WorksUses free
Geocoding API and database from MaxmindConvert IP address to an
integerFind which integer range our IP address sits withinBut Hive
cant use BETWEEN in a joinT : +44 (0) 1273 911 268 (UK) or (888)
631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or
+91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 47. Solution : IKM Hive TransformIKM Hive
Transform can pass the output of a Hive SELECT statement througha
perl, python, shell etc script to transform contentUses Hive
TRANSFORM USING AS functionalityhive> add file
file:///tmp/add_countries.py;Added resource:
file:///tmp/add_countries.pyhive> select transform
(hostname,request_date,post_id,title,author,category)> using
'add_countries.py'> as
(hostname,request_date,post_id,title,author,category,country)>
from access_per_post_categories;T : +44 (0) 1273 911 268 (UK) or
(888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com 48. Creating the Python Script for Hive
StreamingSolution requires a Python API to be installed on all
Hadoop nodes, along with geocode DBwget
!https://raw.github.com/pypa/pip/master/contrib/get-pip.pypython
!get-pip.py pipinstall pygeoip!Python script then parses incoming
stdin lines using tab-separation of fields, outputs same(but with
extra field for the country)#!/usr/bin/pythonimport
syssys.path.append('/usr/lib/python2.6/site-packages/')import
pygeoipgi = pygeoip.GeoIP('/tmp/GeoIP.dat')for line in
sys.stdin:line =
line.rstrip()hostname,request_date,post_id,title,author,category =
line.split('t')country = gi.country_name_by_addr(hostname)print
hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+authorT : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] :
www.rittmanmead.com+'t'+country+'t'+category 49. T : +44 (0) 1273
911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia
& New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comSetting up the
MappingMap source Hive table to target, which includes column for
extra country column!!!!!!!Copy script + GeoIP.dat file to every
nodes /tmp directoryEnsure all Python APIs and libraries are
installed on each Hadoop node 50. Configuring IKM Hive
TransformTRANSFORM_SCRIPT_NAME specifies name ofscript, and path to
scriptTRANSFORM_SCRIPT has issues with parsing;do not use, leave
blank and KM will use existing oneOptional ability to specify sort
and distributioncolumns (can be compound)Leave other options at
defaultT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61
3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.com 51. T : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comExecuting the MappingKM
automatically registers the script with Hive (which caches it on
all nodes)HiveQL output then runs the contents of the first Hive
table through the script, outputtingresults to target table 52.
Bulk Unload Summary Data to Oracle DatabaseFinal requirement is to
unload final Hive table contents to Oracle DatabaseSeveral
use-cases for this:Use Hadoop / BDA for ETL offloadingUse analysis
capabilities of BDA, but then output results to RDBMS data mart or
DWPermit use of more advanced SQL query toolsShare results with
other applicationsCan use Sqoop for this, or use Oracle Big Data
ConnectorsFast bulk unload, or transparent Oracle access to HiveT :
+44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com5 53. IKM File/Hive to
Oracle (OLH/ODCH)KM for accessing HDFS/Hive data from OracleEither
sets up ODCH connectivity, or bulk-unloads via OLHMap from HDFS or
Hive source to Oracle tables (via Oracle technology in Topology)T :
+44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 54. Configuring the KM
Physical SettingsFor the access table in Physical view, change LKM
to LKM SQL Multi-ConnectDelegates the multi-connect capabilities to
the downstream node, so you can use a multi-connectIKM such as IKM
File/Hive to OracleT : +44 (0) 1273 911 268 (UK) or (888) 631-1410
(USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997
256 7970 (India)E : [email protected] : www.rittmanmead.com 55.
Configuring the KM Physical SettingsFor the target table, select
IKM File/Hive to OracleOnly becomes available to select onceLKM SQL
Multi-Connect selected for access tableKey option values to set
are:OLH_OUTPUT_MODE (use JDBC initially, OCIif Oracle Client
installed on Hadoop client node)MAPRED_OUTPUT_BASE_DIR (set to
directoryon HFDS that OS user running ODI can access)T : +44 (0)
1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 56. T : +44 (0) 1273
911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia
& New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comExecuting the
MappingExecuting the mapping will invokeOLH from the OS command
lineHive table (or HDFS file) contentscopied to Oracle table 57.
Create Package to Sequence ETL StepsDefine package (or load plan)
within ODI12c to orchestrate the processCall package / load plan
execution from command-line, web service call, or scheduleT : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.com 58. T : +44 (0) 1273
911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia
& New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comExecute Overall
PackageEach step executed in sequenceEnd-to-end ETL process, using
ODI12cs metadata-driven development process,data quality handing,
heterogenous connectivity, but Hadoop-native processing 59. T : +44
(0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186
(Australia & New Zealand) or +91 997 256 7970 (India)E :
[email protected] : www.rittmanmead.comConclusionsHadoop, and
the Oracle Big Data Appliance, is an excellent platform for data
capture,analysis and processingHadoop tools such as Hive, Sqoop,
MapReduce and Pig provide means to process andanalyse data in
parallel, using languages + approach familiar to Oracle
developersODI12c provides several benefits when working with ETL
and data loading on HadoopMetadata-driven design; data quality
handling; KMs to handle technical complexityOracle Data Integrator
Adapter for Hadoop provides several KMs for Hadoop sourcesIn this
presentation, weve seen an end-to-end example of big data ETL using
ODIThe power of Hadoop and BDA, with the ETL orchestration of
ODI12c 60. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA)
or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970
(India)E : [email protected] : www.rittmanmead.comThank You for
Attending!Thank you for attending this presentation, and more
information can be found at http://www.rittmanmead.comContact us at
[email protected] or [email protected] out for
our book, Oracle Business Intelligence Developers Guide out
now!Follow-us on Twitter (@rittmanmead) or Facebook
(facebook.com/rittmanmead) 61. Deep-Dive into Big Data ETL
withODI12c and Oracle Big Data ConnectorsMark Rittman, CTO, Rittman
MeadOracle Openworld 2014, San FranciscoT : +44 (0) 1273 911 268
(UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New
Zealand) or +91 997 256 7970 (India)E : [email protected] :
www.rittmanmead.com