Top Banner
INTRO TO APACHE SPARK BIG DATA FOR THE BUSINESS ANALYST Created by / Gus Cavanaugh @GusCavanaugh
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark For The Business Analyst

INTRO TO APACHESPARK

BIG DATA FOR THE BUSINESS ANALYSTCreated by / Gus Cavanaugh @GusCavanaugh

Page 2: Spark For The Business Analyst

WHY ARE WE HERE?Business analysts use data to inform business decisions.

Spark is one of many tools that can help you do that.

Page 3: Spark For The Business Analyst

SO LET'S DIVE RIGHT INval input = sc.textfile("file:///test.csv")

input.collect().foreach(println)

This code just loads a file and prints it out to the screen

Page 4: Spark For The Business Analyst

BIG CAVEATWe will be coding

No, there is no other way

Yes, it will be hard

But you can do it

Page 5: Spark For The Business Analyst

HERE'S HOW I KNOW...Excel formulas are super hard

=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)

=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))

If you learned how to write VLOOKUPs, you can learn tocode

Page 6: Spark For The Business Analyst

DISTINCTION: WE ARE NOTENGINEERS

We are not building production applications

We just want to answer questions with data rather than withspeculation

Page 7: Spark For The Business Analyst

WE MAY SHARE TOOLS WITHENGINEERS, BUT OUR PROCESS IS

DIFFERENTPrincipally, we emphasize interactive analysis

This means we want the flexibility to change the questionswe ask as we work

Page 8: Spark For The Business Analyst

AND THE ABILITY TO STOP OURANALYSIS AT ANY POINT

We are not doing analysis for the sake of doing analysis

Good may be the enemy of great, but better is the enemy ofdone

Page 9: Spark For The Business Analyst

IN BUSINESS LANGUAGEWe want the highest analytic return for our time investment

Page 10: Spark For The Business Analyst

OUR ANALYTIC PROCESSDon't measure, just cutGoogle is your best friendYou don't have to know how to do anythingYou just have to be able to find out

Page 11: Spark For The Business Analyst

WHAT IS SPARK?Spark is an open-source processing framework designed for

cluster computing

Page 12: Spark For The Business Analyst

WHY IS IT POPULAR?Super fast...

Plays well with Hadoop

Native APIs for analyst friendly languages like Python andR

Page 13: Spark For The Business Analyst

WAIT...I'VE HEARD THIS BEFORESounds like the original promise of Hadoop...

How is Spark different?

Page 14: Spark For The Business Analyst

FAST REVIEW OF HADOOPGoogle was indexing the web every day

They wrote some custom software to store and processthose documents (web pages)

The open source version of that software is called Hadoop

Page 15: Spark For The Business Analyst

HADOOP CONSISTS OF TWO MAINPIECES

The Hadoop Distributed File System: HDFS

And a processing framework called MapReduce

HDFS enabled fault-tolerant storage on commodity serversat scale

And MapReduce allowed you to process what you stored inparallel

Page 16: Spark For The Business Analyst

THIS IS A BIG DEAL...Companies storing ever increasing amounts of data could:

Do so much cheaperWith more flexibility

Page 17: Spark For The Business Analyst

HADOOP CAME WITH A COSTParallel processing, but not necessarily fast (batch

processing)

Difficult to programpackage org.myorg; import java.io.IOException;import java.util.*;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;

public class WordCount

public static class Map extends MapReduceBase implements Mapper<longwritable private final static IntWritable one = new IntWritable(1); private Text word = new Text();

Page 18: Spark For The Business Analyst

NOT INTERACTIVEWriting MapReduce jobs in Java is an inefficient way for

business analysts to process data in parallel

We get the parallel processing speed, but the developmenttime is long (or the time spent asking a dev to write it...)

Page 19: Spark For The Business Analyst

BUT WHAT ABOUT PIG..?Pig is a sort of scripting language for Hadoop with friendly

syntax that lets you read from any data sourceA = load './input.txt';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into './wordcount';

While it works well, it's another language to learn and it isonly used in Hadoop

Page 20: Spark For The Business Analyst

BUT WHAT ABOUT SQL-ON-HADOOP?

A few options: Hive, Impala, Big SQL

If you have these options, use them

But they all involve substantial ETL and (maybe) additionalhardware

In D.C. we know what that means: you get it on next year'scontract

Page 21: Spark For The Business Analyst

WHAT IS ETL? AND WHY WOULD WENEED IT?

Because unlike most Hadoop tutorials, the data analystsaccess are not in flat files

For analytics, it is very likely you'll want data from yourHadoop application's database

But what is your Hadoop application's database?

Page 22: Spark For The Business Analyst

HBASE - THE HADOOP DATABASEOne big freakin' table

No joins - row keys are everything

Great for applications, terrible for analysts

Page 23: Spark For The Business Analyst

WHY AM I TALKING ABOUT HBASEDURING A SPARK PRESENTATION?

Because I want you to know that your data will not be in theformat you want

ETL - Extract, Transform, Load, is a real process thatengineers will have to spend time on to get your data into a

SQL friendly environment

This will not be an application feature, but an analytics one(so don't be surprised if this gets skipped)

Page 24: Spark For The Business Analyst

MY RAMBLING POINT IS THAT YOUWILL HAVE MESSY DATA

Hadoop, Spark, Tableau, nor anything else will solve that

You still have to rely on the tools you use for data wrangling

Like Python and R

Page 25: Spark For The Business Analyst

TOOL COMPARISONTool Powerful? Friendly?

Excel No Hell Yes

Python/R Meh... Yes

Hadoop Yes Hell no

Spark Hell yes Just right

Page 26: Spark For The Business Analyst

IDEAL SCENARIOI can write the same Python scripts that I use to process data

on my local machine

Page 27: Spark For The Business Analyst

SPARK IS OUR BEST ANSWERYou can write Python and iterative computations are

processed in memory, so they are easier to write and muchfaster than MapReduce

Page 28: Spark For The Business Analyst

HOW YOU CAN GET STARTEDBig Data UniversitySpark on Bluemix