Top Banner
Big Data and Lynda.com Subash DSouza 1
32

Big data and lynda_Subash_DSouza.com

Jan 26, 2015

Download

Technology

Big Data Camp LA 2014, How Lynda.com is getting started with Big Data By Subash DSouza of Lynda.com
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big data and lynda_Subash_DSouza.com

1

Big Data and Lynda.comSubash DSouza

Page 2: Big data and lynda_Subash_DSouza.com

2

• lynda.com is an online learning company that helps anyone learn software, design, and business skills to achieve their personal and professional goals

• Founded in 1995 by Lynda Weinman and Bruce Heavin.• Went online in 2002.• As of January 2014, lynda.com offers more than 2,400 courses in business, design, web,

programming, photography, video, 3D and animation, audio, education, and CAD

Who is Lynda.com?

Page 3: Big data and lynda_Subash_DSouza.com

3

Why Big Data?

• With the growth of users on Lynda.com, data has increased rapidly.

• With the amount of data we collect, there has a been a drive to derive more insights from the data.

• We collect data from multiple sources such as Google Analytics, internal logs and user sessions.

Page 4: Big data and lynda_Subash_DSouza.com

4

Current Use cases of Big Data at Lynda.com

• We use MongoDB for a Learning Record Store, host user configuration for Notifications, as well as for a data source for the localized text on the main web site.

• A Learning Record Store (LRS) is a data store that serve as a repository for learning records necessary for using the Tin Can API.

Page 5: Big data and lynda_Subash_DSouza.com

5

Current Use cases of Big Data at Lynda.com

• Recommendation algorithms using Myrrix. We have data that is fed once a day to our recommendations servers which run on Myrrix.

• Myrrix was a Machine “Big Learning” Software built on top of Apache Hadoop and Apache Mahout.

• It was brought out by Cloudera last August• Succeeded by Oryx, which has tighter integration with CDH• Working on migrating to Oryx

Page 6: Big data and lynda_Subash_DSouza.com

6

The future of Big Data at Lynda.com

• Use the data we collect to gain better insights into our business decision making

• Combine Google Analytics with our own internal logs and User Sessions to understand our users better. This will allow us to create customized experiences for our users.

• A better user experience will keep the user on the site for longer and will also be better for turnover rate

Page 7: Big data and lynda_Subash_DSouza.com

7

How we are achieving that?

• Building out Hadoop Clusters on YARN• Use HBase for some of our real time use cases • Testing out Spark and Storm• Still in early stages

Page 8: Big data and lynda_Subash_DSouza.com

• Introduction of Hadoop to lynda.com

Big Data Overview

8

Agenda

Page 9: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

9

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

GovernanceH

adoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Page 10: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

10

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

GovernanceH

adoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Data Collecting/Acquisition

Start with Archiving User Sessions

Data AcquisitionGoogle AnalyticsLynda Logs.

Page 11: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

11

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

GovernanceH

adoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Page 12: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

12

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

GovernanceH

adoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

StagingData Processing

ELT Put the data in one place so that it can be Transformed efficiently by another process.This will be the “Extract” and “Load” part of the ELT process.

Page 13: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

13

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

HDFSWith HDFS and the other components of the Hadoop Stack lynda.com will be able to acquire and store large amounts of data quickly and accurately.

Page 14: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

14

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Consumable DataThis is data that has been transformed and can be consumed by systems outside of Hadoop.

Given our lack of expertise in Java we will probably rely on our ingestion or rather use an ETL rather than a ELT strategy.

Page 15: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

15

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

HBaseThis interface to Hadoop is tightly integrated with HDFS. Hive and Pig do not have this tight integration.

Page 16: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

16

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Hive/PigHive and Pig are SQL/Scripting interfaces into Hadoop. Both of these interfaces sit outside of Hadoop.

Page 17: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

17

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

RDBMS/Flat FilesHadoop data will be “pushed” and/or “pulled” into RDMS’ or Flat Files for consumption outside of the Hadoop stack.

Page 18: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

18

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Services and API’sAPI’s will be available for the consumption of data. These API’s will make data available from Hadoop and RDMBS’s.

Page 19: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

19

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

SecurityAuthentication & Access to the HDFS data will be done with Kerberos.

Note: This Security will not be comparable to an RDBMS.

Page 20: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

20

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Hcatalog HCatalog abstracts data locations and standardizes data types across Pig, Hive, and MapReduce. It is a Meta Data tool that is part of the Hadoop ecosystem.

Page 21: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

21

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Map ReduceIn regards to Hadoop and manipulating data in HDFS this is “lower level” programming. It will be awhile before we venture into this area of expertise. This is all written in Java and requires a strong understanding of the Hadoop File System (HDFS).

Page 22: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

22

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

oozieSchedulingMap Reduce Jobs Need Scheduling.Put Map Reduce Jobs somewhere for consumption

This could be in Hadoop itselfOozie – Workflow organizerPython or Cron Scripts

Data Output – Data Output of Scheduled jobs.Send emails for reportsWhere the data will be putIn what format will they be put like into a SQL table or file

Page 23: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

23

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

sqoopSqoop is an Apache project that is designed to “sqoop” export data between Hadoop and Relational Databases.

Data is “sqooped up” and put into SQLServer or dumped into a file.

Remember: “The tyranny of “OR” and the inclusiveness of “AND””.

We are not going to use SqlServer OR Hadoop. We will use SqlServer AND Hadoop. Facebook has to use both and when it comes to this technology we are not better than Facebook.

Page 24: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

24

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

flumeFlume is part of the Hadoop ecosystem that is used to collect data and or data files from multiple locations and load it into HDFS.

Page 25: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

25

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Nagios, Ganglia, Ambari, Cloudera ManagerGanglia, Nagios, Ambari, and Cloudera Manager can be used to monitor the Map Reduce Operations. This will ensure that jobs are running on time and it will ensure that alerts are sent when jobs are running too long. These tools will also assist in performance monitoring and optimization.

Page 26: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

26

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Services and API Access to Hive/Pig

Page 27: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

27

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Hue aggregates the most common Hadoop components (i.e. file browser for HDFS, Job Browser (Map Reduce, YARN), Hbase, Hive, Pig) into a single interface.

Page 28: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

28

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

AvroAvro – It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Page 29: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

29

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

Business IntelligenceB.I. Strategy will need to developed and enabled.

This will be critical because one of the cited “Greatest” benefits of Hadoop is that of discovery. We will need to Enable discovery in this paradigm.

Page 30: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

30

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

Governance

Hadoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s

GovernanceThe fundamental essentials of Data Governance will need to established. Core values like “Master Data” will need to be established and the “Big Data” Platform will need to be beholden and integrated with these Data Governance Values. Issues like data life cycle and entitlements to Pii data will be part of the Big Data implementation.

Page 31: Big data and lynda_Subash_DSouza.com

Hadoop Architecture Stack

31

fl umeIngest

Describe Hcatalog

Compute Map Reduce

Persist HDFS/Hbase

Monitor Nagios

Propagate Sqoop

Develop Hive/Pig

/avros

Process Implementation

Hadoop Anthology

.

Page 32: Big data and lynda_Subash_DSouza.com

32

Thank you!!

• @sawjd22• [email protected]• www.linkedin.com/in/sawjd/• Q&A!!