Top Banner
Understanding the Elements of Big Data: More than a Hadoop Distribution Prepared by: Martin Hall, Founder, Karmasphere May 2011 White Paper
16

Understanding the Elements of Big Data: More than a Hadoop Distribution

Jan 13, 2015

Download

Technology

Rajesh Nambiar

Understanding the
Elements of Big Data:
More than a Hadoop
Distribution
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop Distribution

Prepared by:Martin Hall, Founder, KarmasphereMay 2011

White Paper

Page 2: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE

Executive Summary ................................................................................................................................................3

The Elements Of Big Data ..................................................................................................................................3

Big Data Challenges ...........................................................................................................................................4

Big Data Ecosystem ............................................................................................................................................4

Who Should Read This White Paper ..................................................................................................................4

Situation Analysis And Industry Trends ................................................................................................................5

Who Employs Big Data Technology And Techniques? .......................................................................................5

Big Data Macro Trends .......................................................................................................................................5

Hadoop Adoption ................................................................................................................................................6

The Elements Of Big Data.......................................................................................................................................7

Architecture .........................................................................................................................................................7

Data Management ..............................................................................................................................................7

Data Analysis .................................................................................................................................................... 11

Data Use ...........................................................................................................................................................12

The Players ............................................................................................................................................................12

Open Source Projects, Developers And Communities .....................................................................................13

Big Data Developers, Analysts And Other End-Users ......................................................................................13

Commercial Suppliers .......................................................................................................................................14

Conclusion .............................................................................................................................................................15

Choices .............................................................................................................................................................15

Short Path To Big Data Insight ..........................................................................................................................15

Learn More ........................................................................................................................................................15

Glossary .................................................................................................................................................................16

Table of Contents

Page 3: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 3

Executive Summary It is perhaps no coincidence that the Hadoop mascot is an elephant. Big Data can seem like the proverbial pachyderm as described by blindfolded observers. The definition of “Big Data” varies greatly depending upon which part of the “animal” you touch, and where your interests lie.

The “big” in Big Data refers to unprecedented quantities of information –terabytes, petabytes and more of new and legacy data generated by today’s fast-moving businesses and technology. In many instances, data collected over the course of days or weeks exceeds the entire corpus of legacy data in a given domain – examples abound in retail and social media and financial services, and also in scientific disciplines like genetics and astronomy and climate science. The data deluge is even challenging the physical logistics of storage

It is very sad that today there is so little truly useless information. – Oscar Wilde, 1894

A distinguishing feature of Big Data is a mixture of traditional structured data together with massive amounts of unstructured information. The data can come from legacy databases and data warehouses, from web server logs of ecommerce companies and other high-traffic web sites, from M2M (Machine-to-Machine) data traffic and sensor nets.

This white paper outlines the structure of Big Data solutions based on Hadoop and explores the particulars of the elements that comprise it.

The Elements of Big Data

At the highest level, Big Data presents three top-level elements:

• Data Management – data storage infrastructure, and resources to manipulate it

• Data Analysis – technologies and tools to analyze the data and glean insight from it

• Data Use – putting Big Data insights to work in Business Intelligence and end-user applications

Underlying and pervading these high-level categories are the data (legacy and new, structured and unstructured) and the IT infrastructure that supports managing and operating upon it.

Figure 1 – Key Elements of Big Data

Source: Karmasphere

Page 4: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 4

Big Data Challenges

Besides the obvious difficulty of storing and parsing terabytes and exabytes of mostly unstructured information, Big Data itself – the platforms and tools – presents developers and analysts with important challenges:

• Despite fast-growing deployment, Hadoop and other Big Data technologies are still time-consuming to set up, deploy and use

• Building and running Hadoop jobs and queries is non-trivial for developers and analysts. They need “deep” understanding of Hadoop particulars – cluster size and structure, job performance, etc.

• Analyzing and iterating queries and results with Hadoop does not leverage existing skills and tools for Business Intelligence

Many companies and open source projects are being launched to ease entry into Big Data and to ensure higher success rates of Hadoop-based data mining. To understand the impact and value-add for these technologies and products, it is important to comprehend the audiences they address.

Big Data Ecosystem

In each element of Big Data (Figure 1), there are multiple participants with complex relationships among them. Under Data Management there are suppliers of Hadoop-based solutions and other MapReduce technology suppliers with both Cloud and data center solutions. There are offerings in Big Data Analytics that address specific development and analysis requirements, complementing one another and addressing multiple phases in the Big Data application life cycle. And while most Big Data applications reflect and support the operations of particular end-users companies and products, there are others that cross industry and corporate boundaries.

In assimilating the particulars of the ecosystem, the players and the layers and the niches within it, you should always remember that:

• Not everyone who works with Hadoop is in competition

• Not everyone in the ecosystem is a Hadoop distribution vendor

• While building upon open source technologies like Hadoop, Hive and Java, the value in Big Data offerings encompasses an increasingly rich mix of services and commercial software that goes beyond that open source core

Who Should Read This White Paper

This White Paper provides a pragmatic vision and realistic overview of the elements that comprise Big Data. Its intended audience comprises both business people new to Big Data and technologists looking for perspective upon this emerging industry.

In particular, this white paper speaks to:

• Data Scientists

• Big Data applications developers

• Big Data Analysts and the IT staff that support them

Page 5: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 5

Situation Analysis and Industry TrendsBig Data is not just defined by the sheer volume of information, but also by the trends in the growth of that data and how the IT industry and its customers are meeting the Big Data challenge.

Who Employs Big Data Technology and Techniques?

Not every massive data store or data- intensive segment is ready to embrace Big Data. However, numerous industries and segments stand out as leading deployers of Big Data platforms and analytics (Figure 2).

Big Data Macro Trends

Cross-industry and IT industry-wide trends show data creation and consumption overwhelming conventional (legacy) approaches to data management, predicating new approaches:

• Growth in all types of data collection is estimated at 60% CAGR and the $100B information management industry is growing at 10% CAGR1

• Information generation is outstripping growth in storage capacity by a factor of two and the gap continues to grow2 (however, old data never dies – retention of historical data is on the upswing as well)

• Sources of Big Data stores are becoming more varied, e.g., sensor nets and mobile devices: globally, there are today 4.5B mobile phone subscribers

• There are almost 2B regular Internet users globally and total internet data traffic will top 667 exabytes by 20133

• Data marketplaces (the places you go to get the data you need4) are growing – third party data availability is on the rise, with the estimated worldwide market valued at $100B5

• Hadoop is increasingly the preferred Big Data Management platform for applications and analytics

1 IDC

2 Ibid

3 Cisco

4 Stratus Security

5 BuzzData

Figure 2 – Industries Deploying Hadoop

Source: Karmasphere

Page 6: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 6

Hadoop Adoption

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

• Hadoop adoption impetus is greatest when projects combine “Big Analytics” (fast, comprehensive analysis of complex data) and massive, unstructured data sets (Figure 3)

• Hadoop forms the foundation of infrastructure at leading social media companies Facebook, LinkedIn and Twitter

• Hadoop is the fastest growing Big Data technology, with 26% of organizations using it today in data centers and in the Cloud, and an additional 45% seriously considering its deployment (Figure 4)

• Hadoop downloads increased 300% from 2009 to 20106

• Google searches for the term “Hadoop” outstrip all other related queries – in fact, “Hadoop” searches outnumber even those for “Big Data” by a factor of four7

• Hadoop-related hiring (job descriptions) rose 7,074% between Q3 2009 and Q1 20118

• Sold-out attendance at Hadoop and Big Data conferences such as Hadoop Summit, Hadoop World, Strata, Data Scientist Summit

6 DBTA Survey, Q1 2011

7 451 Group 2010

8 Hive, Hbase, Pig, Hadoop Job Trends – SimplyHired.com

Figure 3 – Data Set Attributes and Hadoop Adoption “Sweet Spot”

Figure 4 – Hadoop Adoption Trends

Source: Karmasphere and Booz Allen Hamilon

Source: Karmasphere

Page 7: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 7

The Elements of Big Data

Architecture

Figure 5 details the key elements of Big Data and the relationships among them. The following sections of this white paper explore these elements. Later in the document, we’ll also examine some of the companies and communities implementing the “fabric” of Big Data and contributing to it.

Figure 5 – Big Data Architecture

Data Management

Data Management is the logical starting place in exploring Big Data. It is where the data “lives” and where analytics acts upon it.

Legacy Systems

For the last two decades, Data Management has built upon three related primary technologies:

• Relational Data Base Management Systems – to store and manipulate structured data

• MPP Systems – to crunch increasingly massive data sets and scale with data growth

• Data Warehousing – to subset and host data for subsequent reporting

Dat

a A

nal

ytic

s &

Use

Dat

a M

anag

emen

t &

Sto

rag

e

UnstructuredData

101010101010101010101

RDBMS DW

Structured Data

MPP

Hadoop(MapReduce

& HDFS)

Hadoop(MapReduce

& HDFS)

SystemTools

ETL & DataIntegrationProducts

Workflow /SchedulerProducts

BIG DATA

Administrators

(HadoopBased)

NoSQL

BIG ANALYTICS

Data Analysts

AnalyticsProducts

DeveloperEnvironments

Developers

End Users

Business Analysts

BI &Visualization

Tools

Apps

RDBMS DWMPPNoSQL(non Hadoop)

Operational Data

Source: Karmasphere

Page 8: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 8

Limitations in Legacy Systems

While these technologies continue as important within Big Data, their role is more circumscribed due to limitations from:

• Scalability: as data sets on RDBMSs grow, performance slows, requiring major (not incremental) investments in compute capacity. These investments are today outstripping the budgets of organizations, especially as data grows exponentially.

• Representative Data: With declining ability to process whole data sets, information in Data Warehouses is no longer statistically representative of the data from which it is derived. As such, business intelligence derived from it is less reliable.

• Unstructured Data: RDBMS and Data Warehousing are definitively structured data entities. However, data growth is focused on unstructured data by a factor of 20:1.

RDBMS, MPP and DW are not going away any time soon. Rather, they are taking on new roles in support of Big Data management, most importantly to process and host the output of paradigms such as MapReduce and to continue to provide input to BI software and to applications.

The Data

The “Data” in Big Data originates from a wide variety of sources and can be organized into two broad categories: structured and unstructured data.

Structured Data

Structured Data by definition already resides in formal data stores, typically in an RDBMS, a Data Warehouse or an MPP system, and accounts for approximately 5% of the total data deluge9 (the rest is unstructured). It is often categorized as “legacy data” carried forward from before Big Data had currency, but can also be recently derived data stored in pre-Big Data paradigms (RDBMS, DW, MPP, etc.). The “structure” typically refers to formal data groupings into database records with named fields and/or row and column organization, with established associations among the data elements.

While most Big Data discussions see Structured Data as an input, Big Data Management derives Structured Data sets as an output as well (Operational Data).

Unstructured Data

Unstructured Data, by contrast, comprises data collected during other activities and stored in amorphous logs or other files in a file system. Unstructured data can include raw text or binary and contain a rich mix of lexical information and/or numerical values, with or without delimitation, punctuation or metadata.

Figure 6 – Data Sources and Operational Data

9 The Economist

Source: Karmasphere

Page 9: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 9

Figure 7 – Typical Unstructured Data – Web Server Log Files

Data Sources and Size

To comprehend the extent and challenges of handling Big Data, it is imperative to understand where the data comes from, its scope and scale.

Unstructured Data Structured Data• Web server and search engine logs

(“data exhaust”)

• Logs from other types of servers (e.g., telecom switches and gateways)

• E-Commerce / Web Commerce records

• Social Media / Gaming messages

• Multimedia – voice, video, images

• Sensor data / M2M communications

• Customer Databases

• Legacy BI / CRM / ERP systems

• Inventory and Supply Chain

Figure 8 – Sources for Structured and Unstructured Data

The “Big” in Big Data is to some degree in the eye of the beholder, but generally refers to data sets in the range of terabytes and beyond, composed of unstructured and structured data. These data sets can emanate from massive short-term activity (e.g., traffic on popular web sites or real-time telemetry from thousands of sensors) or from more modest collection of data over longer time periods (e.g., decade-scale climate data or long-term health studies).

Hadoop and MapReduce

Apache Hadoop and other MapReduce implementations constitute the core of modern Data Management. Hadoop and its underlying distributed file system (HDFS) offer numerous advantages over legacy Data Management, in particular:

69.178.92.118 - - [15/Mar/2011:04:29:05 -0400] “GET /images/logos/fst-website-logo-01.png HTTP/1.0” 200 8507 “http://www.linuxpundit.com/about.php” “Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 ( .NET CLR 3.5.30729) SearchToolbar/1.2” 67.195.112.226 - - [15/Mar/2011:04:34:27 -0400] “GET /robots.txt HTTP/1.1” 200 442 “-” “Mozilla/5.0 (compati-ble; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” 209.85.226.83 - - [15/Mar/2011:04:39:18 -0400] “GET / HTTP/1.0” 200 4037 “-” “Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5960076085481870538)” 67.195.114.55 - - [15/Mar/2011:04:39:20 -0400] “GET /cv/docs/LUD67_CGL.pdf HTTP/1.0” 200 1247313 “-” “Mozil-la/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” 65.52.110.77 - - [15/Mar/2011:04:39:22 -0400] “GET /robots.txt HTTP/1.1” 200 442 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” 209.85.226.80 - - [15/Mar/2011:05:03:38 -0400] “GET / HTTP/1.0” 200 4045 “-” “Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=5960076085481870538)” 189.61.149.200 - - [15/Mar/2011:05:04:09 -0400] “GET / HTTP/1.0” 200 12003 “-” “Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)” 65.52.110.36 - - [15/Mar/2011:05:05:27 -0400] “GET /robots.txt HTTP/1.0” 200 442 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” 65.52.110.77 - - [15/Mar/2011:05:10:20 -0400] “GET /index.php HTTP/1.0” 200 11796 “-” “Mozilla/5.0 (com-patible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” 95.108.128.241 - - [15/Mar/2011:05:13:59 -0400] “GET /documents/white_paper_motorola_evoke_teardown.pdf HTTP/1.0” 200 1165232 “-” “Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)” 65.52.110.36 - - [15/Mar/2011:05:26:51 -0400] “GET /articles.php HTTP/1.0” 200 37434 “-” “Mozilla/5.0 (compati-ble; bingbot/2.0; +http://www.bing.com/bingbot.htm)” 203.124.22.107 - - [15/Mar/2011:05:27:35 -0400] “GET /cv/docs/RTOS_transition.pdf HTTP/1.0” 200 529567 “http://www.google.co.in/url?sa=t&source=web&cd=43&ved=0CCAQFjACOCg&url=http%3A%2F%2Fembeddedpundit.com%2Fcv%2Fdocs%2FRTOS_transition.pdf&rct=j&q=RTOS%20IN%20PDF&ei=2jB_TfKfBca4rAf01YSnBw&usg=AFQjCNHoceJZbCvRaDM2_1EYHqQ-YX-whg” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB6.6; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)” 203.124.22.107 - - [15/Mar/2011:05:27:37 -0400] “GET /cv/docs/RTOS_transition.pdf HTTP/1.0” 206 12641 “-” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB6.6; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)”

Page 10: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 10

• Scalability – Hadoop and HDFS have been proven to scale up to 2000 working nodes in a data management cluster, and beyond

• Reliability – HDFS is architected to be fault-resilient and self-repairing with minimal or no operator intervention for node failover

• Data Centric – Big Data is almost always larger in size and scope than the software that processes it. Hadoop architecture recognizes this fact and distributes Hadoop jobs to where the data resides instead of vice-versa.

• Cost – Because Hadoop clusters are built from freely distributable open source software running on standard PC-type compute blades, it is incredibly cost effective and scalable with linear incremental scaling investment

• Innovation – Hadoop is an active open source project with a dynamic developer and user community. Hadoop, like its project parent Apache and also like Linux, benefits from this worldwide network, rapidly advancing in capability and code quality, several steps ahead of competing Data Management paradigms and many times faster than legacy solutions.

Hadoop is an open source project under the umbrella of the Apache Foundation. Later in this document we’ll review commercial suppliers of Hadoop distributions and other MapReduce implementations.

Operational Data

Within Hadoop clusters (or adjacent to them), there exist multiple options for storing and manipulating structured data created from execution of Hadoop jobs. This structured data can represent Big Data outcomes or intermediate stages of complex multi-stage jobs and queries.

As with most other technologies that interoperate with Hadoop, Hadoop itself is fairly agnostic to choice of non-SQL relational databases (hence the term “NoSQL”) and scalable document stores. These database technologies include:

• HBase – the standard Hadoop database, an open-source, distributed, versioned, column-oriented store, pro-viding Bigtable-like capabilities over Hadoop. HBase includes base classes for backing Hadoop MapReduce jobs; query predicate push; optimizations for real time queries; a Thrift gateway and a REST-ful web service to support XML, Protobuf, and binary data encoding; an extensible JRu-by-based (JIRB) shell; and support for the Hadoop metrics subsystem. Like Hadoop, HBase is an Apache project, hosted at http://hbase.apache.org/

• Cassandra – Apache Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data mode. The Cassandra project lives at http://cassandra.apache.org. A good example of using Cassandra together with Hadoop lies in the Datastax Brisk platform – learn more at http://www.datastax.com

• Mongo – an open source, scalable, high-performance, schema-free, document-oriented database written in C++. The MongoDB project is hosted at http://www.mongodb.org/. To use Mongo and Hadoop together, check out https://github.com/mongodb/mongo-hadoop

• CouchDB - Apache CouchDB is a document-oriented database supporting queries and indexing in a MapReduce fashion using JavaScript. CouchDB provides APIs that can be accessed via HTTP requests to support web applications. Learn more at http://couchdb.apache.org

Data Management Infrastructure

The most salient characteristics of Big Data deal with “What” and “How,” but “Where” can be equally important. While Big Data is mostly “agnostic” or orthogonal to infrastructure, the underlying platforms present implications for cost, scalability and performance.

Page 11: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 11

Physical and Virtual

MapReduce, Hadoop and other Big Data technologies originally evolved as internal projects at companies like Google and Yahoo that needed to scale massively with low incremental cost. They were designed to take advantage of “standard” hardware – primarily Intel Architecture blades – running the FOSS Linux and open application platforms like Java, in local, and later, remote data centers.

Rapidly maturing, Big Data infrastructure proved a perfect candidate for public and private Cloud hosting, and so Big Data users frequently leverage PaaS (Platform as a Service) instead of actual data centers. Leading this trend is Amazon, whose Web Services and Elastic MapReduce (EMR) greatly simplify companies’ first forays into Big Data and also provide for tremendous scalability throughout the lifetime of Big Data projects.

Hosting Trends

The Hadoop project website states “GNU/Linux is supported as a development and production platform” and indeed most Hadoop installations, physical and virtual, build on Linux infrastructure. While most code in Hadoop and related projects can migrate to other UNIX-type platforms (Solaris, etc.), Microsoft Windows hosting is more challenging. Hadoop core code exhibits dependencies primarily on Java, but traditionally needs support from UNIX shells, SSH and other utilities. As such, Windows hosting, predicated upon availability and stability of the Cygwin emulation environment, is not supported as a production environment.

Big Data developers and analysts, however, make extensive use of other development hosts. Data collected by Karmasphere for its Karmasphere Studio Community Edition and professional products shows developer host distribution of 45% from Windows, 34% from Linux and 22% from MacOS.

Data Analysis

Analysis is where companies begin to extract value from Big Data. Distinct from Business Intelligence (see Data Consumption below), Big Data analysis involves development of applications and using those apps to gain insight into massive data sets.

Development

Big Data developers resemble other enterprise IT software engineers in many aspects: in particular, they

• Use the same programming languages, starting with Java, augmented with higher-level languages like Pig Latin and Hive

• Develop in the same environments, especially the Eclipse and Netbeans IDEs

• Build applications that manipulate data stores, in some cases using SQL

However, today’s Big Data developers diverge from traditional enterprise IT programmers in key aspects of their trade:

• Their audience is more specialized – not average enterprise end users, but data analysts

• The software they create must manipulate orders of magnitude larger data sets, increasingly with seemingly exotic programming constructs like MapReduce

• They rely on batched execution, with unique and complex job execution sequences (most resembling High Performance Computing)

Hadoop Programming

While a tutorial on Hadoop is beyond the scope of this white paper, it is useful to understand the core programming tasks faced by Big Data developers. To gain insight from Big Data with Hadoop, developers must bootstrap Hadoop clusters, set up input sources to distribute data across the Hadoop file system, create code to

Page 12: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 12

implement the elements of MapReduce (mappers, partitioners, comparators, combiners, reducers, etc.), successfully build, deploy and run those jobs, and dispose of output to intermediate data stores or structured data storage (RDBMs, etc.) for subsequent analysis.

Big Data developers, especially ones new to the Hadoop framework, need to focus their energies on optimizing MapReduce, not on dealing with the intricacies of Hadoop implementation. Karmasphere and other companies offer a range of products to simplify the Hadoop development process. In particular, Karmasphere Studio provides a graphical environment to develop, debug, deploy and monitor MapReduce jobs, cutting time, cost and effort to get results from Hadoop.

Learn more about Karmasphere Studio at http://karmasphere.com/Products-Information/karmasphere-studio.html.

Analytics

Building and executing jobs for Hadoop is only half the challenge of Big Data analysis.

The outcome of Hadoop job execution, while greatly condensed and more structured, does not automatically yield insight to guide business decisions. Ideally, Big Data Analysts should be able to use familiar tools to characterize, query and visualize data sets coming out of Hadoop.

Karmasphere and other suppliers offer Big Data analysts software platforms and tools to simplify and streamline interaction with Hadoop clusters, extract data sets and glean insight from that data. In particular, Karmasphere Analyst provides Big Data analysts with quick, efficient SQL access and insight to Big Data on any Hadoop cluster from within a graphical desktop environment. Working with structured and unstructured data, automatically dis-covering its schema, it lets analysts, SQL programmers, developers and DBAs develop and debug SQL with any Hadoop cluster.

Learn more about Karmasphere Analyst at http://karmasphere.com/Products-Information/karmasphere-analyst.html.

Data Use

If Big Analytics is about mining Big Data for insights, Data Use (consumption) is about acting upon those discoveries. Data Use falls into two rough categories:

Business Intelligence and Visualization – feeding into traditional BI suites and into OLAP, the output of Big Data provides business analysts with comprehensive data sets, not just statistically-selected sub-sets that fit into legacy databases and schemas. By improving the scope and quality of data, Big Data greatly enhances the reli-ability of conclusions drawn from it and improves BI outcomes.

Big Data Applications – using Big Data outcomes to drive applications in web commerce, social gaming, data visualization, search, etc. Businesses in these and other areas are drawing upon Big Data not just for high-level business insights, but to provide concrete input to user-facing applications.

The PlayersFor each Big Data element (Figure 9), there are multiple participants, with complex relationships among them. Under Data Management there are suppliers of Hadoop distributions as well as MapReduce technology suppliers with both Cloud and data center solutions. There are offerings in Big Analytics that fulfill specific development and analysis requirements, complementing one another and addressing multiple phases in the Big Data application life cycle. And while most Big Data applications reflect and support the operations of particular end-users, companies and products, there are others that cross industry and corporate boundaries.

In comprehending the elements of Big Data – the players and the layers and the niches within it – you should always remember that:

Page 13: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 13

• Not everyone who works with Hadoop is in competition

• Not everyone in Big Data is a Hadoop distribution vendor

• While building upon open source technologies like Hadoop, Hive and Java, the value in Big Data offerings encompasses an increasingly rich mix of services and commercial software that goes beyond that open source core

Open Source Projects, Developers and Communities

Unlike dominant legacy data technologies (proprietary RDBMS, etc.), Big Data has strong ties to Free and Open Source Software (FOSS) and to the community development model. Indeed, the technologies at the center of Big Data are primarily FOSS, many of them under the Apache project umbrella:

• Hadoop – the data management platform at the core of Big Data. Key corporate contributors include Cloudera, Facebook, Linked-In and Yahoo.

• HDFS – the Apache Hadoop distributed file system

• Hive – open source data warehouse and query infrastructure built on top of Hadoop

• Java – the language of Hadoop and of Hadoop job programming, originally developed and maintained by Sun Microsystems (now Oracle)

• Linux – for hosting Hadoop clusters and also as a development host10, with perhaps the largest global developer community of any FOSS project. RedHat and Canonical11 in particular are investing in supporting Big Data

• Eclipse and Netbeans – common Big Data applications and analytics development environments (IDEs)

• NoSQL – multiple implementations used in Big Data infrastructure include projects such as Apache HBase, CouchDB and Cassandra and MongoDB

These projects boast communities of hundreds and in some cases thousands of developers and user/ contributors, along with a smaller cadre of core maintainers/committers who guide project evolution and vet patches. A large swathe of FOSS Big Data project developers participate under the corporate banner of their employers while others toil away out of personal or academic interest.

Big Data Developers, Analysts and other End-Users

Given the open source nature of much Big Data technology, the term “developer” ends up being overloaded. “Big Data Developers” in common parlance are not those programmers involved in building the software described in the previous section, but rather engaged in building software for and on it, to create:

• Hadoop / MapReduce jobs

• Analytic queries and report software

• Web and mobile apps and enterprise applications realizing value from and presenting Big Data outcomes

Examples of companies performing Big Data development and realizing value from it include:

• TidalTV - video advertising, optimization, and yield management – http://www.tidaltv.com

• XGraph – connect audience marketing – http://www.xgraph.com

10 About 1/3 of Big Data development occurs on Linux workstations

11 Cloudera and Karmasphere both host their software on Ubuntu Linux

Page 14: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 14

Commercial Suppliers

While many of the underlying Big Data technologies are developed as open source software, the elements of Big Data include a rich mix of commercial software and services suppliers.

Figure 9 – Big Data Players

ISVs

Companies deploying Hadoop increasingly turn to commercial Independent Software Vendors (ISVs) for both fully-supported base platforms and value-added capabilities beyond those included in community-based Big Data software.

Hadoop Distribution Suppliers / Integrators – aggregate, integrate and productize Hadoop, Hive and other elements for easy installation and use on Linux-based clusters and other host systems. These companies add value to Hadoop with “one stop shopping” and through the addition of complementary software and services. The leading Hadoop distribution suppliers are Cloudera (http://www.cloudera.com), Datastax (http://www.datastax.com) and IBM (http://www.ibm.com/software/data/infosphere/hadoop/). Entering the space in May 2011 is also EMC (http://www.emc.com/about/news/press/2011/20110509-03.htm).

Stay tuned for announcements from more players.

Big Data Analytics Solutions and Tools Providers – these companies offer products that streamline development of Hadoop applications and simplify interaction with and visualization of Hadoop outcomes by supporting familiar SQL and spreadsheet-style interfaces to Hadoop. Analytics suppliers include Karmasphere (http://www.karmasphere.com), and a few companies offering analytics for non-Hadoop database solutions (MPP, RDBMS, etc.).

Big Data Hosting Companies and Service Providers – “Hadoop On Demand”

Many first-generation deployers of Hadoop invested in their own local infrastructure (clusters of standard hardware running Hadoop over Linux). Increasingly, companies are also looking outside their own data centers for Big Data hosting, turning to remote data centers (collocation), platform-level Cloud hosting and “Big Data as a Service.” The best example of this last paradigm is Amazon Web Services’ Elastic MapReduce – learn more at http://aws.amazon.com/elasticmapreduce/.

Source: Karmasphere

Page 15: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 15

ConclusionTo outside observers and first-time visitors, the elements of Big Data, the players that implement and supply them, and the transactions among those players, display their own peculiar logic. Depending on one’s point of introduction, it is easy to miss the forest while focusing on interesting and feature-rich trees. For business people and technologists already engaged and involved in Big Data, a practical “heads down” approach can often limit point of view and obscure commercial and technical opportunities.

Choices

For Big Data users, success comes down to two key choices:

• Infrastructure – where and how to host a project, which technologies to deploy, and how to scale

• Value Extraction – paths to insight and methods for analysis and consumption

Comparably, Big Data suppliers need to ask themselves:

• What do Big Data developers and analysts really need?

• How to add value to Hadoop and other Open Source Big Data projects?

• How to accommodate both new requirements emerging from Hadoop and other open source projects, as well as expectations for familiar/legacy capabilities?

Today’s data collection trends stagger the imagination. Companies and research projects now routinely collect terabytes of data, gathering more volume in days or weeks than did all of human civilization over thousands of years. In order that this data storm neither overwhelms its potential users nor goes unexamined in virtual warehouses, data mining and exploration are undergoing a complete reinvention. This Big Data shift is changing not just data processing methods but indeed the entire data management paradigm, moving to build on Hadoop and other Big Data tools and platforms.

The sheer volume of data involved and the tantalizing possibility of game-changing insight buried within it give businesses a sense of great urgency. The pressure for insight and innovation leads many to jump into Big Data before they understand the scope of the challenges facing them. As a result, many organizations expend inordinate time and resources – hundreds of man/months and tens of thousands of dollars – on setting up and tweaking Hadoop and other infrastructure before arriving at any possible actionable insight.

Short Path to Big Data Insight

The main purpose of this White Paper has been to educate readers about the elements of Big Data in general and, in particular, to suggest shorter paths across the Big Data landscape. At Karmasphere, our mission is to bring the power of Apache Hadoop to developers and analysts and to enable companies to unlock competitive advantage from their datasets with easy-to-use solutions.

A great way to get started is by downloading Karmasphere Studio Community Edition and by attending webinars and tutorials sponsored by Karmasphere and its Big Data partners.

Learn More

Visit http://www.karmasphere.com to learn more.

Page 16: Understanding the Elements of Big Data: More than a Hadoop Distribution

Understanding the Elements of Big Data: More than a Hadoop DistributionWhite Paper

KARMASPHERE 16

GlossaryAmazon Web Services (AWS) - a set of Cloud-based services hosted by Amazon that together form a reliable, scalable, and inexpensive computing platform. More at http://aws.amazon.com/

Data Warehouse (DW) – a structured database used for reporting, offloaded from other operational systems. DW comprises three layers: staging (used to store raw data for use by developers), integration (integrates data and provides abstraction), and access (actually supplying data to system users).

Elastic MapReduce (EMR) - Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to manage vast amounts of data. EMR utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). More at http://aws.amazon.com/elasticmapreduce/

Hadoop - a software framework that supports data-intensive distributed applications. Hadoop enables applications to work with thousands of nodes and petabytes of data. Hadoop is built in and uses the Java programming language and is maintained as a top-level Apache.org project being built and used by a global community of contributors. More at http://hadoop.apache.org/

Hadoop Distributed File System (HDFS) – the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, rapid computations. More at http://hadoop.apache.org/hdfs/

Hive – a data warehouse infrastructure built on top of Hadoop. Hive provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to query and analyze large data sets stored in Hadoop. Hive defines a simple SQL-like query language (Hive QL) to let users familiar with SQL make Hadoop queries. Hive QL also allows programmers familiar with MapReduce to plug in custom mappers and reducers to perform more sophisticated analysis, extending the language. More at http://hive.apache.org/

MapReduce – a software framework for distributed computing on large data sets on clusters of computers. The framework is inspired by the “map and reduce” functions in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms.

Massively Parallel Processing (MPP) – a distributed memory computer system with many individual nodes, each an independent computer in itself. In the context of Big Data, MPP connotes database processing system with hundreds or thousands nodes, but with centralized storage.

NoSQL - a relational database management system with distributed data stores that eschews use of Structured Query Language (SQL)

On-line Analytical Processing (OLAP) - an approach to answering multi-dimensional analytical database queries. Databases configured for OLAP use multidimensional data models, and borrow aspects of navigational databases and hierarchical databases. OLAP query output is typically displayed in a matrix format. OLAP is part of the broader category of business intelligence, encompassing data mining.

Pig – a platform for analyzing large data sets consisting of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig’s program structure is amenable to parallelization, enabling Pig to handle very large data sets.

Structured Query Language (SQL) – a standard database computer language designed for managing data in relational database management systems (RDBMS)