Big data and lynda_Subash_DSouza.com

1

Big Data and Lynda.comSubash DSouza

2

• lynda.com is an online learning company that helps anyone learn software, design, and business skills to achieve their personal and professional goals

• Founded in 1995 by Lynda Weinman and Bruce Heavin.• Went online in 2002.• As of January 2014, lynda.com offers more than 2,400 courses in business, design, web,

programming, photography, video, 3D and animation, audio, education, and CAD

Who is Lynda.com?

3

Why Big Data?

• With the growth of users on Lynda.com, data has increased rapidly.

• With the amount of data we collect, there has a been a drive to derive more insights from the data.

• We collect data from multiple sources such as Google Analytics, internal logs and user sessions.

4

Current Use cases of Big Data at Lynda.com

• We use MongoDB for a Learning Record Store, host user configuration for Notifications, as well as for a data source for the localized text on the main web site.

• A Learning Record Store (LRS) is a data store that serve as a repository for learning records necessary for using the Tin Can API.

5

Current Use cases of Big Data at Lynda.com

• Recommendation algorithms using Myrrix. We have data that is fed once a day to our recommendations servers which run on Myrrix.

• Myrrix was a Machine “Big Learning” Software built on top of Apache Hadoop and Apache Mahout.

• It was brought out by Cloudera last August• Succeeded by Oryx, which has tighter integration with CDH• Working on migrating to Oryx

6

The future of Big Data at Lynda.com

• Use the data we collect to gain better insights into our business decision making

• Combine Google Analytics with our own internal logs and User Sessions to understand our users better. This will allow us to create customized experiences for our users.

• A better user experience will keep the user on the site for longer and will also be better for turnover rate

7

How we are achieving that?

• Building out Hadoop Clusters on YARN• Use HBase for some of our real time use cases • Testing out Spark and Storm• Still in early stages

• Introduction of Hadoop to lynda.com

Big Data Overview

8

Agenda

Hadoop Architecture Stack

9

Extract Load Transform

HD

FS

Propagate

RDBM

S/Fi

les

API Access

Business Intelligence

Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie

Extract to RDBMSsqoop

Monitoring ToolsNagios, Ganglia, Ambari

Direct Access to Raw DataHue

Data SerializationAvro

GovernanceH

adoop Stack and Data Access

Data ExtractionFlume

Google Analytics

Data MovementMap Reduce

lyndaLogs

User Sessions Serv

ices

and

API

`s


10


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





GovernanceH



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Data Collecting/Acquisition

Start with Archiving User Sessions

Data AcquisitionGoogle AnalyticsLynda Logs.


11


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





GovernanceH



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s


12


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





GovernanceH



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

StagingData Processing

ELT Put the data in one place so that it can be Transformed efficiently by another process.This will be the “Extract” and “Load” part of the ELT process.


13


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance

Hadoop Stack and Data Access


Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

HDFSWith HDFS and the other components of the Hadoop Stack lynda.com will be able to acquire and store large amounts of data quickly and accurately.


14


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Consumable DataThis is data that has been transformed and can be consumed by systems outside of Hadoop.

Given our lack of expertise in Java we will probably rely on our ingestion or rather use an ETL rather than a ELT strategy.


15


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

HBaseThis interface to Hadoop is tightly integrated with HDFS. Hive and Pig do not have this tight integration.


16


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Hive/PigHive and Pig are SQL/Scripting interfaces into Hadoop. Both of these interfaces sit outside of Hadoop.


17


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

RDBMS/Flat FilesHadoop data will be “pushed” and/or “pulled” into RDMS’ or Flat Files for consumption outside of the Hadoop stack.


18


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Services and API’sAPI’s will be available for the consumption of data. These API’s will make data available from Hadoop and RDMBS’s.


19


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

SecurityAuthentication & Access to the HDFS data will be done with Kerberos.

Note: This Security will not be comparable to an RDBMS.


20


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Hcatalog HCatalog abstracts data locations and standardizes data types across Pig, Hive, and MapReduce. It is a Meta Data tool that is part of the Hadoop ecosystem.


21


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Map ReduceIn regards to Hadoop and manipulating data in HDFS this is “lower level” programming. It will be awhile before we venture into this area of expertise. This is all written in Java and requires a strong understanding of the Hadoop File System (HDFS).


22


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

oozieSchedulingMap Reduce Jobs Need Scheduling.Put Map Reduce Jobs somewhere for consumption

This could be in Hadoop itselfOozie – Workflow organizerPython or Cron Scripts

Data Output – Data Output of Scheduled jobs.Send emails for reportsWhere the data will be putIn what format will they be put like into a SQL table or file


23


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

sqoopSqoop is an Apache project that is designed to “sqoop” export data between Hadoop and Relational Databases.

Data is “sqooped up” and put into SQLServer or dumped into a file.

Remember: “The tyranny of “OR” and the inclusiveness of “AND””.

We are not going to use SqlServer OR Hadoop. We will use SqlServer AND Hadoop. Facebook has to use both and when it comes to this technology we are not better than Facebook.


24


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

flumeFlume is part of the Hadoop ecosystem that is used to collect data and or data files from multiple locations and load it into HDFS.


25


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Nagios, Ganglia, Ambari, Cloudera ManagerGanglia, Nagios, Ambari, and Cloudera Manager can be used to monitor the Map Reduce Operations. This will ensure that jobs are running on time and it will ensure that alerts are sent when jobs are running too long. These tools will also assist in performance monitoring and optimization.


26


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Services and API Access to Hive/Pig


27


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Hue aggregates the most common Hadoop components (i.e. file browser for HDFS, Job Browser (Map Reduce, YARN), Hbase, Hive, Pig) into a single interface.


28


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

AvroAvro – It uses JSON for defining data types and protocols, and serializes data in a compact binary format. It can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.


29


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

Business IntelligenceB.I. Strategy will need to developed and enabled.

This will be critical because one of the cited “Greatest” benefits of Hadoop is that of discovery. We will need to Enable discovery in this paradigm.


30


HD

FS

Propagate

RDBM

S/Fi

les

API Access


Stag

ing

Cons

umab

le D

ata

SecurityKerberos

Data Chronology

Hiv

e/Pi

g

Hba

se

Meta DataHCatalog

Job Schedulingoozie





Governance



Google Analytics


lyndaLogs

User Sessions Serv

ices

and

API

`s

GovernanceThe fundamental essentials of Data Governance will need to established. Core values like “Master Data” will need to be established and the “Big Data” Platform will need to be beholden and integrated with these Data Governance Values. Issues like data life cycle and entitlements to Pii data will be part of the Big Data implementation.


31

fl umeIngest

Describe Hcatalog

Compute Map Reduce

Persist HDFS/Hbase

Monitor Nagios

Propagate Sqoop

Develop Hive/Pig

/avros

Process Implementation

Hadoop Anthology

.

32

Thank you!!

• @sawjd22• [email protected]• www.linkedin.com/in/sawjd/• Q&A!!

mailto:[email protected]

http://www.linkedin.com/in/sawjd/

http://www.linkedin.com/in/sawjd/

Big data and lynda_Subash_DSouza.com

Technology

data store

data source

big data overview

future of big data

s data collectingacquisition

large amounts of data

current use cases of

s staging data processing