Top Banner
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector Guglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
23

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Jan 23, 2018

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data CollectorGuglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland

Page 2: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Data Ingestion for Analytics: a real scenario

In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to:

● Defect analysis● Outage analysis● Cyber-Security

Page 3: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

“Data is the second most important thing in analytics”

Page 4: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Data Ingestion: multiple sources...

● Legacy systems● DB2● Lotus Domino● MongoDB● Application logs● System logs● New Relic● Jenkins pipelines● Testing tools output● RESTful Services

Page 5: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

… and so many tools available to get the data

Page 6: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

What are we going to do with all those data?

Page 7: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Issues

● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times.

● A small team.● Lack of skills and experience across the team (and the business area in

general) in managing Big Data tools.● Low budget.

Page 8: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Alternatives

#1 Panic

Page 9: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Alternatives

#2 Cloning team members

Page 10: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Alternatives

#3 Find a smart way to simplify the data ingestion process

Page 11: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

A single tool needed...

● Design complex data flows with minimal coding and the maximum flexibility.● Provide real-time data flow statistics, metrics for each flow stage.● Automated error handling and alerting.● Easy to use by everyone.● Zero-downtime when upgrading the infrastructure due to logical isolation of

each flow stage.● Open Source

Page 12: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

… something like this

Page 13: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector

Page 14: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector

Page 15: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector: supported origins

Page 16: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector: available destinations

Page 17: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector: available processors

● Base64 Field Decoder● Base64 Field Encoder● Expression Evaluator● Field Converter● JavaScript Evaluator● JSON Parser● Jython Evaluator● Log Parser● Stream Selector● XML Parser

...and many others

Page 18: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector

Demo

Page 19: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets DC: performance and reliability

● Two available execution modes: standalone or cluster● Implemented in Java: so any performance best practice/recommendation for

Java applications applies here● REST services for performance monitoring available● Rules and alerts (metric and data both)

Page 20: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Streamsets Data Collector: security

● You can authenticate user accounts based on LDAP● Authorization: the Data Collector provides several roles (admin, manager,

creator, guest)● You can use Kerberos authentication to connect to origin and destination

systems● Follow the usual security best practices in terms of iptables, networking, etc.

for Java web applications running on Linux machines.

Page 21: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Useful Links

Streamsets Data Collector:

https://streamsets.com/product/

Page 22: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector

Thanks!

My contacts:

Linkedin: https://ie.linkedin.com/in/giozzia

Blog: http://googlielmo.blogspot.ie/

Twitter: https://twitter.com/guglielmoiozzia

Page 23: Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector