The Big Deal About Big Data. @db2Dean. facebook.com/db2Dean. www.db2Dean.com. Dean Compher Data Management Technical Professional for UT, NV [email protected] www.db2Dean.com. Slides Created and Provided by: Paul Zikopoulos Tom Deustch. Why Big Data How We Got Here. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.”
Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken
By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work
Retailers collect click-stream data from Web site interactions and loyalty card data – This traditional POS information is used by retailer for shopping basket
analysis, inventory replenishment, +++– But data is being provided to suppliers for customer buying analysis
Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized
Science is increasingly dominated by big science initiatives– Large-scale experiments generate over 15 PB of data a year and can’t be
stored within the data center; sent to laboratories
Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading
Improved instrument and sensory technology– Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image
Apache Hadoop is a software framework that supports data-intensive applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google Map/Reduce and Google File System papers.
Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo has been the largest contributor to the project, and uses Hadoop extensively across its businesses.
Hadoop is a paradigm that says that you send your application to the data rather than sending the data to the application
Cost effective / Linear Scalability.– Hadoop brings massively parallel competing to commodity servers. You can start small
and scales linearly as your work requires.– Storage and Modeling at Internet-scale rather than small sampling– Cost profile for super-computer level compute capabilities– Cost per TB of storage enables superset of information to be modeled
Mixing Structured and Unstructured data.– Hadoop is its schema-less so it doesn’t care about the form the data stored is in, and thus
allows a super-set of information to be commonly stored. Further, MapReduce can be run effectively on any type of data and is really limited by the creatively of the developer.
– Structure can be introduced at the MapReduce run time based on the keys and values defined in the MapReduce program. Developers can create jobs that against structured, semi-structured, and even unstructured data.
Inherently flexible of what is modeled/analytics run– Ability to change direction literally on a moment’s notice without any design or operational
changes– Since hadoop is schema-less, and can introduce structure on the fly, the type of analytics
and nature of the questions being asked can be changed as often as needed without upfront cost or latency
So How Does It Do That? At its core, hadoop is made up of;
Map/Reduce– How hadoop understands and assigns work to the nodes (machines)
Hadoop Distributed File System = HDFS– Where hadoop stores data– A file system that’s runs across the nodes in a hadoop cluster– It links together the file systems on many local nodes to make them
The HDFS file system stores data across multiple machines. HDFS assumes nodes will fail, so it achieves reliability by
replicating data across multiple nodes– Default is 3 copies
• Two on the same rack, and one on a different rack. The filesystem is built from a cluster of data nodes, each of
which serves up blocks of data over the network using a block protocol specific to HDFS. – They also serve the data over HTTP, allowing access to all content
from a web browser or other client– Data nodes can talk to each other to rebalance data, to move copies
"Map" step: – The program is chopped up into many smaller sub-
problems.• A worker node processes some subset of the smaller
problems under the global control of the JobTracker node and stores the result in the local file system where a reducer is able to access it.
"Reduce" step:– Aggregation
• The reduce aggregates data from the map steps. There can be multiple reduce tasks to parallelize the aggregation, and these tasks are executed on the worker nodes under the control of the JobTracker.
Map-Reduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer.
These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker
The JobTracker then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
The Map/Reduce framework operates exclusively on <key, value> pairs — that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The vast majority of Map-Reduce applications executed on the Grid do not directly implement the low-level Map-Reduce interfaces; rather they are implemented in a higher-level language, such as Jaql, Pig or BigSheets
Taken Together - What Does This Result In? Easy To Scale
– Simply add machines as your data and jobs require Fault Tolerant and Self-Healing
– Hadoop runs on commodity hardware and provides fault tolerance through software.– Hardware losses are expecting and tolerated– When you lose a node the system just redirects work to another location of the data
and nothing stops, nothing breaks, jobs, applications and users don’t even know. Hadoop Is Data Agnostic
– Hadoop can absorb any type of data, structured or not, from any number of sources.– Data from many sources can be joined and aggregated in arbitrary ways enabling
deeper analyses than any one system can provide. – Hadoop results can be consumed by any system necessary if the output is structured
appropriately Hadoop Is Extremely Flexible
– Start small, scale big– You can turn nodes “off” and use for other needs if required (really)– Throw any data, in any form or format, you want at it– What you use it for can be changed on a whim
SLA driven workloads– Guaranteed job completion– Job completion within operational windows
Data Security Requirements– Problematic if it fails or looses data– True DR becomes a requirements– Data quality becomes an issue– Secure Data Marts become a hard requirement
Integration With The Rest of the Enterprise– Workload integration becomes an issue
Efficiency Becomes A Hot Topic– Inefficient utilization on 20 machines isn’t an issue, on 500 or 1000+ it is
Relatively few are really here yet outside of Facebook, Yahoo, LinkedIn, etc…