Top Banner
The Briefing Room
38
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Self-Service Access and Exploration of Big Data

The Briefing Room

Page 2: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected]

Page 3: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Page 4: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

December: Innovators

January: Big Data

February: Analytics

March: Data in Motion

Page 5: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Innovators

!   Charles Babbage conceived the Analytical Engine in 1834.

!   Automation and ease of use have driven innovation in computing ever since.

!   The Cloud and Big Data are raising the bar.

Page 6: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

 Robin Bloor is Chief Analyst at The Bloor Group

Analyst: Robin Bloor

[email protected]

Page 7: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

! Cirro provides a single method to access any type of data, on any platform, in any environment.

!   Its product suite consists of Cirro Data Hub, Analyst for Excel and Multi Store – all designed to remove complexity from Big Data analytics.

! Cirro’s products are cloud based and can run in public, private and on-premise environments.

Cirro

Page 8: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Mark Theissen

Mark is CEO at Cirro. He is a respected analytics and data warehousing expert with more than 22 years in the industry. Most recently Mark was the worldwide data warehousing technical lead at Microsoft following the acquisition of DATAllegro. At DATAllegro Mark was the COO and a member of the board of directors. Prior to joining DATAllegro, Mark was Vice President and Research Lead at META Group

(Gartner Group) for Enterprise Analytics Strategies, covering data warehousing, business intelligence and data integration markets. Before META, Mark was VP of Professional Services at Accruent where he was responsible for domestic and overseas services and operations. Mark has a BS in Computer Information Systems from Chapman University and a MBA from the University of California, Irvine.

Page 9: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Corporate Overview

Bringing Big Data to the Desktop

Page 10: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

The Big Data Dilemma

Page 11: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

The Big Data Dilemma

Page 12: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

The Big Data Dilemma

Page 13: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Accessing Big Data

Page 14: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Accessing Big Data

Incumbent  Approach   Hadoop  Approach  

Page 15: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Accessing Big Data

Incumbent  Approach   Hadoop  Approach  

Page 16: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Accessing Big Data

Incumbent  Approach   Hadoop  Approach  

Page 17: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

What the Market Needs

An enterprise data hub to access any type of data, on

any platform, in any environment

Page 18: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

The Enterprise Data Hub

Page 19: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Simplifying the Access to Your Data

Structured  -­‐  Unstructured  Mashups  

SQL  (mul;ple  versions)  

Java  

Sqoop  

Map  Reduce  

HIVE   Hadoop  Install  &  Config  

Hive  –  Scoop  Install  &  Config  

Source  Control  

DataBase  Management  

Cirro  Data  Hub  

Access  tool  

Conven/onal  Approach  People  manage  the  access  to  data  

Cirro  Approach  Cirro  Data  Hub  manages    

access  to  data  

Page 20: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Architecture Overview

Cirro  Data  Hub  •  Cost  based  federa;on  op;mizer  •  Smart  caching    •  Dynamic  op;miza;on  •  Normalized  cost  es;mates  •  Metadata  for  unstructured  sources  

 Cirro  Func;on  Library  

•  Library  of  Func;ons  •  Logic  to  build  complex  specific  formulas  

 Cirro  Analyst  

•  Excel  plug-­‐in  that  allows  analysts  to  explore                  &  process  Big  Data  and  tradi;onal  data  

 Cirro  Mul;  Store  (op;onal)  

•  Pre-­‐built  structured/unstructured  data  store  •  Used  for  holding  data  or  addi;onal  workspace  

 

Page 21: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Typical Deployment

IT Staff •  Programmers •  Developers •  DBA’s

Extend, Add Proprietary

Functions to CFL

Excel Analyst Users •  Design Views

•  Minimal IT Support

•  Publish Views •  Data Exploration •  Analysis Tableau

Business Objects

Other BI Tools

Data Consumers Access CDH Views via ODBC & JDBC across all data types

RDBMS  Oracle  Teradata  MySQL  SQL    Ver;ca  

HQL  

No  SQL  Splunk  Cassandra  MongoDB  

MapReduce  

Cirro Data Hub •  Cirro Function Library • Proprietary MapReduce

• Custom Views

MapReduce

Hadoop Distributed File System

Hive

Page 22: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Sample Use Case

Summarize the number of tweets per hour with certain keywords from a raw twitter feed.

Requirements: •  Use raw twitter data files in Hadoop •  Keywords stored in SQL table for easy

manipulation •  Results into Tableau Excel for visualization

Page 23: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Too Many Skills, Coding, Processing

Write  mapper/reducer  in  java  using  development  tool  :    • parse  twi[er  text  -­‐  convert  to  lower  case  -­‐  parse  words  -­‐  exclude  common  words  -­‐  group  words  by  hour  

Import  java  classes  into  Hadoop  

Execute  command  line  hadoop  using  CLI  • bin/hadoop    jar  Twi[erParse    /home/cloudera/WordCount.jar  /usr/tweet/input  /usr/local/output  –libjars    

Move  result  into  HIVE  using  JDBC  SQL  tool  • create  table  output1  (text  STRING,created_at  STRING,count  BIGINT)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  '\t'  STORED  AS  TEXTFILE    

• LOAD  DATA  INPATH  '/usr/data/1-­‐88f1-­‐864e22e77801/part*'OVERWRITE  INTO  TABLE  output1  

Move  SQL  table  with  keywords  to  HIVE  through  Scoop  using  CLI  • export  -­‐-­‐connect  jdbc:mySQL://10.17.185.44/mytable  -­‐-­‐password    mypasswd  -­‐-­‐username  root  -­‐-­‐table  words  -­‐-­‐export-­‐dir  '/home/cloudera/inpumile  

• create  table  mytable  (word  STRING)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ','  STORED  AS  TEXTFILE    • LOAD  DATA  INPATH  '/home/cloudera/inpumile/part*'OVERWRITE  INTO  TABLE  mytable  

Run  HIVE  query  using  JDBC  SQL  tool  • select  a.text  ,a.created_at  ,a.count  from  output1  a    join  mytable  b    on  (a.text    =  b.word  )    

Import  results  into  Excel  using  Excel  

Page 24: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Too Many Skills, Coding, Processing

Write  mapper/reducer  in  java  using  development  tool  :    • parse  twi[er  text  -­‐  convert  to  lower  case  -­‐  parse  words  -­‐  exclude  common  words  -­‐  group  words  by  hour  

Import  java  classes  into  Hadoop  

Execute  command  line  hadoop  using  CLI  • bin/hadoop    jar  Twi[erParse    /home/cloudera/WordCount.jar  /usr/tweet/input  /usr/local/output  –libjars    

Move  result  into  HIVE  using  JDBC  SQL  tool  • create  table  output1  (text  STRING,created_at  STRING,count  BIGINT)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  '\t'  STORED  AS  TEXTFILE    

• LOAD  DATA  INPATH  '/usr/data/1-­‐88f1-­‐864e22e77801/part*'OVERWRITE  INTO  TABLE  output1  

Move  SQL  table  with  keywords  to  HIVE  through  Scoop  using  CLI  • export  -­‐-­‐connect  jdbc:mySQL://10.17.185.44/mytable  -­‐-­‐password    mypasswd  -­‐-­‐username  root  -­‐-­‐table  words  -­‐-­‐export-­‐dir  '/home/cloudera/inpumile  

• create  table  mytable  (word  STRING)  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ','  STORED  AS  TEXTFILE    • LOAD  DATA  INPATH  '/home/cloudera/inpumile/part*'OVERWRITE  INTO  TABLE  mytable  

Run  HIVE  query  using  JDBC  SQL  tool  • select  a.text  ,a.created_at  ,a.count  from  output1  a    join  mytable  b    on  (a.text    =  b.word  )    

Import  results  into  Excel  using  Excel  

B1=Twi[erParse("/user/twi[er/sample","text,created_at")  

B2=ToLower(B1,"text")  

B3=WordSeparate(B2,"text")  

B4=Exclude(B3,"text")  

B5=GroupBy(B4,"text,created_at")  

B6=Cirro_Match(B5,"text","MYSQL.KeyWords","word",C9)  

Results  displayed  at  cell  C9  

Page 25: Self-Service Access and Exploration of Big Data

©2012 Cirro Inc. All rights reserved.

Corporate Overview

Bringing Big Data to the Desktop

Page 26: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Analyst: Robin Bloor

Perceptions & Questions

Page 27: Self-Service Access and Exploration of Big Data

The Bloor Group

Big Data, Hot Data?

Page 28: Self-Service Access and Exploration of Big Data

The Bloor Group

Hadoop & The Big Data Dynamic

Hadoop has become the de facto reservoir for data

Page 29: Self-Service Access and Exploration of Big Data

The Bloor Group

Hadoop & The Big Data Dynamic

– We witnessed something like this a long time ago, with ISAM files - before the advent of RDBMS

– The difference this time is that Hadoop has an ecosystem and it is growing

–  Big Data (usually caught first by Hadoop) is mostly new data and mostly event data

– Hadoop is not (yet) a performance engine. It is an all-purpose capability

–  It is delivering business benefits in a big way: it is hot….

Page 30: Self-Service Access and Exploration of Big Data

The Bloor Group

BI Categories

Regular reporting/operational BI, Excel

Dashboards, OLAP, BPM, Excel

Data mining, statistical analysis (trends and relationships)

Predictive analytics

HINDSIGHT

OVERSIGHT

INSIGHT

FORESIGHT

Page 31: Self-Service Access and Exploration of Big Data

The Bloor Group

The New BI Universe (?)

Page 32: Self-Service Access and Exploration of Big Data

The Bloor Group

Data Sources

Hadoop and

Hadoop ++

Standard SQL NoSQL

Graph DBMS, XML

DBMS, Flat files

Metadata Hub?

Page 33: Self-Service Access and Exploration of Big Data

The Bloor Group

Problems Of The Data Layer

Hadoop is capable of ETL and often used for ETL, but that usually

involves coding of a kind

A connectivity architecture is needed

IT REQUIRES SIMPLE CONNECTORS

Point to point connectivity usually was, is and may always be a bad

idea

BI tools, which had good-enough interfaces to RDBMS, don’t link to

Hadoop directly, and probably shouldn’t

The data layer is more complicated than it was and its

complexity is increasing

Hadoop is multi-role and hence can spawn multiple instances

Page 34: Self-Service Access and Exploration of Big Data

The Bloor Group

!  How would one use the Cirro Multi Store?

!  Which companies/products do you regard as competitors (either directly or close competitors)?

!  How does a Cirro implementation proceed, i.e., where do you start, what are the medium term goals, what do you replace?

!  Conceptually a hub for the data layer is attractive. But how well does it scale out?

Page 35: Self-Service Access and Exploration of Big Data

The Bloor Group

!  Can the hub be physically distributed, i.e., one logical instance with multiple physical instances?

!  How does your proprietary MapReduce differ from Hadoop MapReduce?

!  Is there any aspect of BI that you don’t or can’t cater for (CEP, Data governance, MDM, etc.)?

Page 36: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Page 37: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Upcoming Topics

January: Big Data

February: Analytics

March: Data in Motion

2013 Editorial Calendar www.insideanalysis.com

Page 38: Self-Service Access and Exploration of Big Data

Twitter Tag: #briefr

The Briefing Room

Thank You for Your

Attention