2013 Informatica Corporation. No part of this document may be
reproduced or transmitted in any form, by any means (electronic,
photocopying, recording or otherwise) without prior consent of
Informatica Corporation. All other company and product names may be
trade names or trademarks of their respective owners and/or
copyrighted materials of such owners. Hadoop User Guide 2 Abstract
Hadoop user guide provides a brief introduction on cloud connectors
and its features.The guide provides detailed information on setting
up the connector and running data synchronization tasks (DSS). A
brief overview of supported features and task operations that can
be performed using Hadoop connector is mentioned. Table of Contents
Overview
..........................................................................................................................................
3 Hadoop
............................................................................................................................................
3 Hadoop Plugin
.................................................................................................................................
4 Supported Objects and Task Operations
........................................................................................
5 Enabling Hadoop Connector
...........................................................................................................
5 Instructions while installing the Secure Agent
..........................................................................
5 Creating a Hadoop Connection as a Source
...................................................................................
5 JDBC URL
....................................................................................................................................
7 JDBC Driver class
........................................................................................................................
8 Installation Paths
..........................................................................................................................
8 Setting Hadoop Classpath for various Hadoop Distributions
....................................................... 8 Creating
Hadoop Data Synchronization Task (Source)
................................................................ 12
Enabling a Hadoop Connection as a Target
.................................................................................
14 Creating Hadoop Data Synchronization Task (Target)
.................................................................
15 Data Filters
....................................................................................................................................
18 Troubleshooting
.............................................................................................................................
19 Increasing Secure Agent
Memory..............................................................................................
19 Additional Troubleshooting Tips
.................................................................................................
21 Known Issues
................................................................................................................................
21 3 Overview Informatica cloud connector SDKs are off-cycle, off
release add-ins that provide data integration to SaaS and
on-premise applications, which are not supported natively by
Informatica cloud. The cloud connectors are specifically designed
to address most common use cases such as moving data into cloud and
retrieving data from cloud for each individual application. Figure
1: Informatica Cloud Architecture Once the Hadoop cloud connector
is enabled for your ORG Id, you need to create a connection in
Informatica cloud to access the connector. Hadoop The Apache Hadoop
project develops open-source software for reliable, scalable,
distributed computing. The Apache Hadoop software library is a
framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than
rely on hardware to deliver high-availability, the library itself
is designed to detect and handle failures at the application layer,
so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures. 4 The project
includes these modules: Hadoop Common: The common utilities that
support the other Hadoop modules. Hadoop Distributed File System
(HDFS): A distributed file system that provides high-throughput
access to application data. Hadoop YARN: A framework for job
scheduling and cluster resource management. Hadoop MapReduce: A
YARN-based system for parallel processing of large data sets. Other
Hadoop-related projects include: Ambari: A web-based tool for
provisioning, managing, and monitoring Apache Hadoop clusters which
includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog,
HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a
dashboard for viewing cluster health such as heatmaps and ability
to view MapReduce, Pig and Hive applications visually alongwith
features to diagnose their performance characteristics in a
user-friendly manner. Avro: A data serialization system. Cassandra:
A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed
systems. HBase: A scalable, distributed database that supports
structured data storage for large tables. Hive: A data warehouse
infrastructure that provides data summarization and ad hoc
querying. Mahout: A Scalable machine learning and data mining
library. Pig: A high-level data-flow language and execution
framework for parallel computation. ZooKeeper: A high-performance
coordination service for distributed applications. Cloudera Impala:
It is the industrys leading massively parallel processing (MPP) SQL
query engine that runs natively in Apache Hadoop. The
Apache-licensed, open source Impala project combines modern,
scalable parallel database technology with the power of Hadoop,
enabling users to directly query data stored in HDFS and Apache
HBase without requiring data movement or transformation. Impala is
designed from the ground up as part of the Hadoop ecosystem and
shares the same flexible file and data formats, metadata, security
and resource management frameworks used by MapReduce, Apache Hive,
Apache Pig and other components of the Hadoop stack. Hadoop Plugin
The Informatica Hadoop connector allows you to perform Query and
Insert operations on Hadoop.The plug-in supports Cloudera 5.0, MapR
3.1, Pivotal HD 2.0, Amazon EMR and Horton Works 2.1 and has been
certified to work on CDH 4.2 and HDP 1.1 The Informatica Cloud
Secure Agent must be installed on one of the nodes of the Hadoop
Cluster where Hiveserver or Hiveserver2 is running. The plug-in is
used as a target to insert data into Hadoop.The plug-in connects to
Hive and Cloudera Impala to perform relevant data operations. The
plug-in can easily be integrated with the Informatica Cloud. The
plugin supports all operators supported in HiveQL. 5 The plug-in
supports the AND conjunction between filters. It supports both AND
and OR conjunctions in advanced filters. The plug-in supports
filtering on all filterable columns in Hive/Impala tables.
Supported Objects and Task Operations The table below provides the
list of objects and task operations supported by ReST connector.
Objects Task Operation Data Preview Look Up DSS Source DSS Target
QueryInsertUpdateUpsertDelete All tables in Hive NANANA NA All
tables in Impala NA NANANANA NA NAEnabling Hadoop ConnectorTo
enable Hadoop connector, get in touch with Informatica support or
Informatica representative. It usually takes 15 minutes for the
connector to download to secure agent, after it is
enabled.Instructions while installing the Secure Agent Follow the
given instructions while installing the secure agents:You must
install the secure agent on Hadoop cluster. If you install it
outside the Hadoop cluster you can only read from Hadoop, but you
cannot write into the Hadoop. You must also install the secure
agent on the node where hive server 2 is running. Creating a Hadoop
Connection as a Source To use Hadoop connector in data
synchronization task, you must create a connection in Informatica
Cloud. The following steps help you to create Hadoop connection in
Informatica Cloud. 1.In Informatica Cloud home page, click
Configure. 2.The drop-down menu appears, select Connections. 3.The
Connections page appears. 4.Click New to create a connection. 5.The
New Connection page appears. : Supported : Not Applicable 6 Figure
2: Connection Parameter
6.Specify the values to the connection parameters. Connection
Property Description Connection NameEnter a unique name for the
connection. DescriptionProvide a relevant description for the
connection. TypeSelect Hadoop from the list. Secure AgentSelect the
appropriate secure agent from the list. UsernameMention the
username of Schema of Hadoop component. PasswordMention the
password of the schema of Hadoop component. JDBC Connection URL
Mention the JDBC URL to connect to the Hadoop Component. Refer JDBC
URL.Driver Mention the JDBC driver class to connect to the Hadoop
Component. Refer JDBC Driver class. 7 Commit Interval Mention the
commit interval. It is the Batch size (in rows) of data loaded into
hive. Hadoop Installation Path Mention Hadoop Installation path.The
Installation path of the Hadoop component* used to connect to
Hadoop. Only one of these installation Hive Installation
PathMention the Hive installation path. HDFS Installation Path
Mention the HDFS installation path. HBase Installation Path Mention
HBase installation path. Impala Installation Path Mention Impala
installation path. Miscellaneous Library Path Mention the
Miscellaneous Library Path. This is an additional library that
could be used to communicate with Hadoop. Enable Logging Check the
Enable Logging box.This Enables verbose log messages. Note:
Installation paths are the paths where Hadoop jars are listed. The
connector loads and sets one of these or more. Connector loads the
libraries from these paths before sending any instructions to
Hadoop. If you do not want to mention the installation path, you
can generate the setHadoopclasspath.sh file for amazon, HortonWorks
and MapR. Refer Setting Hadoop Classpath for various Hadoop
Distributions 7.Click Test to evaluate the connection. 8.Click Ok
to save the connection. JDBC URL The connector connects to
different components of Hadoop using JDBC. The URL format and
parameters vary among components. Hive uses the JDBC URL format
mentioned below: . jdbc:://:/ The significance of URL parameters is
discussed below: hive/hive2 protocol information depending on the
version of the Thrift Server used, hive forHiveServer and hive2 for
HiveServer2. Server, port server and port information where the
Thrift Server is running. Schema hive schema to which the connector
needs to access. For example,
jdbc:hive2://invrlx63iso7:10000/default connects the default schema
of Hive, using a Hive Thrift server HiveServer2 that stars on the
server invrlx63iso7 on port 10000. 8 The Hive thrift serve runs for
the connector to communicate with Hive.The command to start the
Thrift server is hive service hiveserver2. Cloudera Impala uses the
JDBC URL format given below: jdbc:hive2://:/;auth= In this case,
the parameter auth must be set to the security mechanism used by
the Impala Server, Kerberos. For example,
jdbc:hive2://invrlx63iso7:21050/;auth=noSasl connects to the
default schema of Impala. JDBC Driver class The JDBC Driver class
tends to vary among Hadoop components.For example,
org.apache.hive.jdbc.HiveDriver for Hive and Impala: Installation
Paths The following table displays sample installation paths for
different Hadoop distributions: Installation PathsDefault Hive
installation pathDefault Hadoop installation path CloudEra 5 VM
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hive
HortonWorks 2.1 Sandbox/usr/lib/hadoop/usr/lib/hive Amazon
EMR/home/hadoop/home/hadoop/hive/hive-0.11.0 MapR 3.1
demo/opt/mapr/hadoop/hadoop-0.20.2/opt/mapr/hive/hive-0.12 Pivotal
HD2.0/usr/lib/gphd/hadoop/usr/lib/gphd/hive Note: When you do not
mention the installation paths, you can simply set the classpath
and proceed with the connection configuration and creating DSS
tasks. Setting Hadoop Classpath for various Hadoop Distributions In
the connection parameters if you do not mention the installation
paths, you can perform the connection operations by generating the
setHadoopConnectorClasspath.sh file.This section helps you to set
the classpath for the distributions of Hadoop and procedure to set
Classpath. Follow the procedure for generating
setHadoopConnectorClasspath.sh for Amazon, Horton works and
Pivotal. 1.Make Changes in /main/tomcat/saas-infaagentapp.sh file
as shown in the figure. 9 2.Start the Agent as shown in the below
command prompt. 3.Create the Hadoop Connection using the connector
4.Test the connection. This will generate the
setHadoopConnectorClasspath.sh file in Infa_Agent_DIR/main/tomcat
path. 5.Stop the Agent using Ctrl+C keys. 10 6.From Infa_agent_DIR,
execute the . ./main/tomcat/setHadoopConnectorClasspath.sh using
the command. 7.Restart the Agent. And execute the DSS tasks Note:
If you want to generate the setHadoopConnectorClasspath.sh file
again, then delete the existing one and regenerate. After
generating the above steps, if the Hadoop classpath does not point
towards the correct class path, then you must execute following
steps to undo the commands executed above: 1.Enter vi
saas-infaagentapp.sh 2.Enter Insert command 3.Press Delete or
backspace to delete the following entries: 4.Press Escape key
5.Type :(Collon) + w +q keys Once you follow the above procedure
the commands will be deleted, and then you can move on to the
section given below to go direct the Hadoop to the correct
classpath. Directing the Hadoop classpath to the correct classpath
In certain cases the Hadoop may point to the incorrect classpath.
Follow the procedure given below to direct it to the correct
classpath . 1.Enter the command hadoop classpath from the terminal.
This will display the stream of jars. 11 2. Copy and paste the
above stream in a notepad. 3.Delete the following entries from the
notepad file:
a.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar
b.:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar
(retain the latest version and delete the previous) 4.Copy the
remaining content and export it to a variable called
HADOOP_CLASSPATH. 5.In the command prompt window, mention the path
where this file resides, that is,
InfaAgentDir/main/tomcat/saas-infaagentapp.sh. 12 6.Now follow
Steps for generating setHadoopConnectorClasspath.sh mentioned
above. Refer Setting Hadoop Classpath for various Hadoop
Distributions. Creating Hadoop Data Synchronization Task (Source)
Note: You need to create a connection before getting started with
data synchronization task. The following steps help you to setup a
data synchronization task in Informatica Cloud. Let us consider the
task operation Insert (Fetch/Read) to perform the Data
synchronization task. 1.In Informatica Cloud home page, click
Applications. 2.The drop-down menu appears, select Data
Synchronization. 3.The Data Synchronization page appears. 4.Click
New to create a data synchronization task. 5.The Definition tab
appears. Figure 3: Definition Tab 6.Specify the Task Name, provide
a Description and select the Task Operation Insert. 13 7.Click
Next. 8.The Source tab appears. Figure 4: Source Tab 9.Select the
source Connection, Source Type and Source Object to be used for the
task.10.Click Next. 11.The Target tab appears. Select the target
Connection and Target Object required for the task. Figure 5:
Target Tab 12.Click Next. 13.In Data Filters tab by default,
Process all rows is chosen.14.Click Next. 15.In Field Mapping tab,
map source fields to target fields accordingly. 14 Figure 6: Field
Mapping 16.Click Next. 17.The Schedule tab appears. 18.In Schedule
tab, you can schedule the task as per the requirement and save.
19.If you do not want schedule the task, click Save and Run the
task. Figure 7: Save and Run the Task After you Save and Run the
task, you will be redirected to monitor log page. In monitor log
page, you can monitor the status of data synchronization tasks.
Enabling a Hadoop Connection as a Target To use Hadoop connector in
data synchronization task, you must create a connection in
Informatica Cloud. The following steps help you to create Hadoop
connection in Informatica Cloud. 15 1.In Informatica Cloud home
page, click Configure. 2.The drop-down menu appears, select
Connections. 3.The Connections page appears. 4.Click New to create
a connection. 5.The New Connection page appears. Figure 8:
Connection Parameter 6.Specify the values to the connection
parameters. Refer Creating a Hadoop Connection as a Source. 7.Click
Test to evaluate the connection. 8.Click Ok to save the connection.
Creating Hadoop Data Synchronization Task (Target) Note: You need
to create a connection before getting started with data
synchronization task. The following steps help you to setup a data
synchronization task in Informatica Cloud. Let us consider the task
operation Insert (Fetch/Read) to perform the Data synchronization
task. 1.In Informatica Cloud home page, click Applications. 1.The
drop-down menu appears, select Data Synchronization. 2.The Data
Synchronization page appears. 3.Click New to create a data
synchronization task. 16 4.The Definition tab appears. Figure 9:
Definition Tab 5.Specify the Task Name, provide a Description and
select the Task Operation Insert. 6.Click Next. 7.The Source tab
appears. Figure 10: Source Tab 8.Select the source Connection,
Source Type and Source Object to be used for the task.9.Click Next.
10.The Target tab appears. Select the target Connection and Target
Object required for the task. 17 Figure 11: Target Tab 11.Click
Next. 12.In Data Filters tab by default, Process all rows is
chosen. See Also Data Filters. 13.Click Next. 14.In Field Mapping
tab, map source fields to target fields accordingly. Figure 12:
Field Mapping 15.Click Next. 16.The Schedule tab appears. 17.In
Schedule tab, you can schedule the task as per the requirement and
save. 18.If you do not want schedule the task, click Save and Run
the task. 18 Figure 13: Save and Run the Task After you Save and
Run the task, you will be redirected to monitor log page. In
monitor log page, you can monitor the status of data
synchronization tasks.
Data Filters Data filters help you to fetch specific data based
on the APIs configured in Config.csv file. The data synchronization
task will process the data based on the filter field assigned.
Note: Advanced data filters are not supported by Hadoop Connector
The following steps help you to use data filters. 1.In Data
synchronization task, select Data Filters tab. 2.The Data Filters
tab appears. 3.Click New as shown in the figure below. Figure 14:
Data Filters 4.The Data Filter dialog box appears. 19 Figure 15:
Data Filters-2 5.Specify the following details. Field
TypeDescription ObjectSelect Object for which you want to assign
filter fields Filter BySelect the Filter Field OperatorSelect
Equals operator. Only Equals operator is supported with this
release. Filter ValueEnter the Filter value 6.Click Ok.
Troubleshooting Increasing Secure Agent Memory To overcome memory
issues faced by secure agent follow the steps given below. 1.In
Informatica Cloud home page, click Configuration. 2.Select Secure
Agents. 3.The secure agent page appears. 4.From the list of
available secure agents, select the secure agent for which you want
to increase memory.\ 5.Click pencil icon corresponding to the
secure agent. The pencil icon is to edit the secure agent. 6.The
Edit Agent page appears. 7.In System Configuration section, select
the Type as DTM. 8.Edit JVMOption1 as -Xmx512m as shown in the
figure below. 20 Figure 16: Increasing Secure Agent Memory-1
9.Again in System Configuration section, select the Type as
TomCatJRE. 10.Edit INFA_memory to -Xms256m -Xmx512m as shown in the
figure below. Figure 17: Increasing Secure Agent Memory-2
11.Restart the secure agent. The secure agent memory has been
increased successfully. 21 12.. Additional Troubleshooting Tips
When the connection is used as a target, the last batch of the
insert load is not reflected in the record count. Refer the session
logs for the record count of the last batch inserted.For example,
if the commit interval is set to 1 million and the actual rows
inserted are 1.1 million, the record count in the UI shows 1
million and the session logs reveal the row count of the reminder
100k records. Set the commit interval to the highest value possible
before java.lang.OutofMemoryError is encountered. When the
connection is used as a target to load data into Hadoop, ensure
that all the fields are mapped. After a data load in Hive, Impala
needs to be refreshed manually for the latest changes to the table
to be reflected in Impala. In the current version, the connector
does not automatically refresh Impala upon a Hive dataset insert.
Known Issues The connector is currently certified to work with
Cloudera CDH 4.2. andHortonWorks HDP 1.1. The connector may
encounter java.lang.OutOfMemory exception while fetching large data
sets for tables with a large number of columns (for example, 5
million for a 15 column table). In such scenarios, restrict the
resultset by adding appropriate filters or by decreasing the number
of field mappings. The Enable Logging connection parameter is
place-holder for a future release, and its state has no impact on
connector functionality. The connector has been certified and
tested on Hadoops pseudo-distributed mode. Performance is a factor
of Hadoops cluster setup. Ignore log4j initialization warnings in
the session logs.