We'll see this as the en e, at A Framework for Implementing NoSQL, Hadoop Big Data and NoSQL continue to make headlines everywhere. However, most of what has been written about these topics is focused on the hardware, services, and scale out. But what about a Big Data and NoSQL Strategy, one that supports your business strategy? Virtually every major organization thinking about these data platforms is faced with the challenge of figuring out the appropriate approach and the requirements. This presentation will provide guidance on how to think about and establish realistic Big Data management plans and expectations. We will introduce a framework for evaluating the various choices when it comes to implementing and succeeding with Big Data/NoSQL and show how to demonstrate a sample use case. Takeaways: • A Framework for evaluating Big Data techniques • Deciding on a Big Data platform – How do you know which one is a good fit for you? • The means by which big data techniques can complement existing data management practices • The prototyping nature of practicing big data techniques • The distinct ways in which utilizing Big Data can generate business value Date: Time: Presenter: June 9, 2015 2:00 PM ET/11:00AM PT Peter Aiken, Ph.D. & Josh Bartels • • Every century, a new technology-steam power, electricity, atomic energy, or microprocessors-has swept away the old world with a vision of a new one. Today, we seem to be entering the era of Big Data – Michael Coren 1 Copyright 2015 by Data Blueprint Slide #
94
Embed
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
• Big Data could know us better than we knowourselves– Dan
Gardner
• We'll see this as the time in history whthe world's information wastransformed frominert, passive statand put into aunified system thbrings thatinformation alive– Michael Nielsen
ow have ace to enme the
of ournowledgerse, one anonstantly e,figuresto matcheedshael S. one
at
A Framework for ImplementingNoSQL, Hadoop
• N • Today a street stall in Mumbai can access moreb information, maps, statistics, academic papers, pricen trends, futures markets, and data than a U.S.c President could only a few decades ago– – Juan Enriquez
ot everything that can e counted counts, and ot everything that ounts can be countedAlbert Einstein
Big Data and NoSQL continue to make headlines everywhere.However, most of what has been written about these topics is focused on the hardware, services, and scale out. But what about a Big Data and NoSQL Strategy, one that supports your business strategy? Virtually every major organization thinking about these data platforms is faced with the challenge of figuring out the appropriate approach and the requirements. This presentation will provide guidance on how to think about and establish realistic Big Data management plans and expectations. We will introduce aframework for evaluating the various choices when it comes to implementing and succeeding with Big Data/NoSQL and showhow to demonstrate a sample use case.Takeaways:• A Framework for evaluating Big Data techniques• Deciding on a Big Data platform – How do you know which one
is a good fit for you?• The means by which big data techniques can complement
existing data management practices• The prototyping nature of practicing big data techniques• The distinct ways in which utilizing Big Data can generate
business valueDate: Time: Presenter:
June 9, 20152:00 PM ET/11:00 AM PTPeter Aiken, Ph.D. & Josh Bartels
• Soon we will salt the oceans, the land, and the skwith uncounted numbers of sensors invisible to theyes but visible to one another
• We n – Esther Dysonchanbeco centerown kunive that crecon itselfour n– Mic
Mal
• We've reached a tipping point in history: today more ydata is being manufactured by machines, servers, eand cell phones, than by people– Michael E. Driscoll
• Every century, a new technology-steam power, electricity, atomic energy, or microprocessors-has swept away the old world with a vision of a new one.Today, we seem to be entering the era of Big Data– Michael Coren
1Copyright 2015 by Data Blueprint Slide #
Shannon Kempe
Executive Editor at DATAVERSITY.net
2Copyright 2015 by Data Blueprint Slide #
Steven MacLauchlan• 10 years of experience in Application
Development and Data Modeling with a focus on Healthcare solutions.
• Delivers tailored data management solutions that provide focus on data’s business value while enhancing clients’ overall capability to manage data
• Certified Data Management Professional (CDMP)
• Computer Science degree from Virginia Commonwealth University
• Most recent focus: Understanding emerging data modeling trends and how these can best be leveraged for the Enterprise.
3Copyright 2015 by Data Blueprint Slide #
Get Social With Us!
Live Twitter FeedJoin the conversation! Follow us:
@datablueprint@paikenAsk questions and submit your comments: #dataed
Like Us on Facebookwww.facebook.com/
datablueprintPost questions and commentsFind industry news, insightful
content
and event updates.
Join the GroupData Management &
Business IntelligenceAsk questions, gain insightsand collaborate with fellow
Big Data (has something to do with Vs - doesn't it?)
• Volume– Amount of data
• Velocity– Speed of data in and out
• Variety– Range of data types and sources
• 2001 Doug Laney
• Variability– Many options or variable interpretations confound analysis
• 2011 ISRC
• Vitality–A dynamically changing Big Data environment in which analysis and predictive models
must continually be updated as changes occur to seize opportunities as they arrive• 2011 CIA
• Virtual– Scoping the discussion to only include online assets
• 2012 Courtney Lambert
• Value/Veracity• Stuart Madnick (John Norris Maguire Professor of Information Technology, MIT Sloan School of Management & Professor of Engineering Systems, MIT School of Engineering)
11Copyright 2015 by Data Blueprint Slide #
Defining Big Data• Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing toenable enhanced decision making, insight discoveryand process optimization.
– Gartner 2012• Big data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and analyze.– IBM 2012
• An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional dataprocessing applications– Wikipedia 2014
• Shorthand for advancing trends in technology that open the door to a new approachto understanding the world and making decisions.
– NY Times 2012• The broad range of new and massive data types that have appeared over the last
decade– Tom Davenport 2014
• Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”
– Oxford English Dictionary 2014• Big data is about putting the "I" back into IT.
– Peter Aiken 2007
12Copyright 2015 by Data Blueprint Slide #
Big Data Techniques• New techniques available to impact the productivity (order of
magnitude) of any analytical insight cycle that compliment, enhance, or replace conventional (existing) analysis methods
• Big data techniques are currently characterized by:– Continuous, instantaneously
available data sources– Non-von Neumann
Processing (defined later in the presentation)– Capabilities approaching
or past human comprehension– Architecturally enhanceable
identity/security capabilities– Other tradeoff-focused data processing
• So a good question becomes "where in our existing architecture can we most effectively apply Big Data Techniques?"
13Copyright 2015 by Data Blueprint Slide #
Big Data Technologies by themselves, are a One Legged Stool
Governance is the major meansof preventing over reliance on one legged stools!
14Copyright 2015 by Data Blueprint Slide #
The Big Data LandscapeCopyright Dave Feinleib, bigdatalandscape.com
• Caution:– Don’t fall victim to SOS (Shiny Object
Syndrome)– A lot of money is being invested but
is it generating the expected return?– Gartner Hype Cycle suggests results
are going to be disappointing http://www.businessinsider.com/enterprise-big-data-spending-2012-11#ixzz2cdT8shhehttp://www.inc.com/kathleen-kim/big-data-spending-to-increase-for-it-industry.html
• In considering any newsubject, there isfrequently a tendencyfirst to overrate what we find to be alreadyinteresting orremarkable, andsecondly - by a sort of natural reaction - to undervalue the truestate of the case.
• Augusta Ada King, Countess of Lovelace - aka Ada Lovelace, publisher of the first computing program
Peak of Inflated Expectations: Early publicity produces a number of success stories—often accompanied by scores of failures. Somecompanies take action; many do not.
Trough of Disillusionment: Interest wanes as experiments and implementations fail to deliver. Producers of thetechnology shake out or fail. Investments continue only if the surviving providers improve their products to thesatisfaction of early adopters.
Technology Trigger: A potential technology breakthrough kicks things off. Early proof-of-concept stories and media interesttrigger significant publicity. Often no usable products exist and commercial viability is unproven.
Slope of Enlightenment: More instances of how the technology can benefit the enterprise start to crystallize and become more widely understood. Second- and third-generation products appear from technology providers. More enterprises fund pilots;conservative companies remain cautious.
Plateau of Productivity: Mainstream adoption starts totake off. Criteria for assessing provider viability are moreclearly defined. The technology’s broad market applicability and relevance are clearly paying off.
"A focus on big data is not a substitute for thefundamentals of information management."
24Copyright 2015 by Data Blueprint Slide #
2012 Big Data in Gartner’s Hype Cycle
25Copyright 2015 by Data Blueprint Slide #
2013 Big Data in Gartner’s Hype Cycle
26Copyright 2015 by Data Blueprint Slide #
2014 Big Data in Gartner’s Hype Cycle
27Copyright 2015 by Data Blueprint Slide #
Big Data Gartner Hype Cycle
Copyright 2015 by Data Blueprint Slide # 29
Myth #3: Big Data is innovative
Fact:• Big Data techniques are
innovative• ROI and insights depend
on the size of the businessand the amount of dataused and produced, e.g.– Local pizza place vs. Papa
John’s– Retail
29Copyright 2015 by Data Blueprint Slide #
My Barn must pass a foundation inspection
• Before further construction can proceed• No IT equivalent in most organizations
30Copyright 2015 by Data Blueprint Slide #
Frameworks• A system of ideas
for guiding analyses
• A means of organizing project data
• Data integration priorities decision making framework
• A means of assessing progress
8 31Copyright 2015 by Data Blueprint Slide #
"There’s now a blurring between the storage world and the memory world"
• Faster processors outstripped not only the hard disk, but mainmemory– Hard disk too slow– Memory too small
• Flash drives remove both bottlenecks– Combined Apple and Yahoo have
spend more than $500 million to date
• Make it look like traditional storage or more systemmemory– Minimum 10x improvements– Dragonstone server is 3.2 tb flash
memory (Facebook)
• Bottom line - new capabilities!
8 32Copyright 2015 by Data Blueprint Slide #
Non-von Neumann Processing/Efficiencies• von Neumann
bottleneck (computer science)– "An inefficiency inherent in
the design of any von Neumann machine that arises from the fact that most computer time is spent in moving information between storage and the central processing unit rather than operating on it"[http://encyclopedia2.thefreedictionary.com/von+Neumann+bottleneck]
• Michael Stonebraker– Ingres (Berkeley/MIT)– Modern database
processing is approximately 4% efficient
• Many big data architectures are attempts to address this, but:– Zero sum game– Trade characteristics
What is NoSQL?• Commonly interpreted as "Not Only SQL• Broad class of database management technologies that
provide a mechanism for storage and retrieval of data that doesn’t follow traditional relational database methodology.
• Motivations– Simplicity of design– Horizontal scaling– Finer control over availability of the data.
• The data structures used by NoSQL databases differ fromthose used in relational databases, making someoperations faster in NoSQLand others faster in relational databases.
8 38Copyright 2015 by Data Blueprint Slide #
What is Hadoop?• A data storage and processing
system, that runs on clusters of commodity servers.• Able to store any kind of data in its native format.• Perform a wide variety of analyses and transformations.• Store terabytes, and even petabytes, of data
inexpensively.• Handles hardware and system failures automatically,
without losing data or interrupting data analyses.• Critical components of Hadoop:
– HDFS- The Hadoop Distributed File System is the storage systemfor a Hadoop cluster, responsible for distribution of data across theservers.
– Mapreduce- The inner workings of Hadoop that allows for distributed and parallel analytical job execution.
40Copyright 2015 by Data Blueprint Slide #
Why NoSQL? Why Hadoop?• Large number of users (read: the internet)
• Rapid app development and deployment
• Large number of mission critical writes (sensors/etc)
• Small, continuous reads and writes, especially where“Consistency” is less important (social networks)
• Hadoop solves the hard scaling problems caused by largeamounts of complex data.
• As the amount of data in a cluster grows,new servers can be added to a Hadoopcluster incrementally and inexpensivelyto store and analyze it.
• Data analysis struggles with the social– Your brain is excellent at social cognition - people can
• Mirror each other’s emotional states• Detect uncooperative behavior• Assign value to things through emotion
– Data analysis measures the quantity of social interactions but not the quality• Map interactions with co-workers you see during work days• Can't capture devotion to childhood friends seen annually
– When making (personal) decisions about social relationships, it’s foolish to swap the amazing machinein your skull for the crude machine on your desk
• Data struggles with context– Decisions are embedded in sequences and contexts– Brains think in stories - weaving together multiple
causes and multiple contexts– Data analysis is pretty bad at
• Narratives / Emergent thinking / Explaining
• Data creates bigger haystacks– More data leads to more statistically significant
correlations– Most are spurious and deceive us– Falsity grows exponentially greater amounts of data
we collect
• Big data has trouble with big problems– For example: the economic stimulus debate– No one has been persuaded by data to switch sides
• Data favors memes over masterpieces– Detect when large numbers of people take an instant
liking to some cultural product– Products are hated initially because they are unfamiliar
• Data obscures values– Data is never raw; it’s always structured according to
somebody’s predispositions and values
Some Big Data Limitations
Myth #4: Big Data is just another IT project
Copyright 2013 by Data Blueprint
Fact:• Big Data is not your typical IT
project– Does not answer typical IT questions– Trend analysis, agile, actionable, etc.– Fundamentally different approach
• Big Data Projects are exploratory• Big Data enables new capabilities• Big Data can be a disruptive
technology• It might sound simple but that
doesn’t mean it’s easy• Beware of SOS (Shiny Object
The Bills of Mortality was an Early Data Collection
47Copyright 2015 by Data Blueprint Slide #
Mortality Geocoding
Where is it happening?
Copyright 2015 by Data Blueprint
47
("Whereas of the Plague")
Plague Peak
When is it happening?
Copyright 2015 by Data Blueprint
48
Black Rats or Rattus Rattus
Why is it happening?
50
Copyright 2015 by Data Blueprint
What Will Happen? What will happen?
51
Copyright 2015 by Data Blueprint
Formalizing Data Management• Defend the Realm:
The authorized history of MI5by Christopher Andrew
• World War I• 1914• At war with much
of Europe• 14,000,000 Germans living
in the United Kingdom• How to efficiently and
effectively manageinformation on that manyindividuals?
• The Security Service is responsible for "protecting the UK against threats to national security fromespionage, terrorism and sabotage, from the activities of agents of foreign powers, and from actions intended to overthrow or undermine parliamentary democracy by political, industrial or violent means."
51Copyright 2015 by Data Blueprint Slide #
“As a final thought, how about a machine that would send, via closed-circuit television, visual andoral information needed immediately at high-levelconferences or briefings? Let’s say that a group of senior officers are contemplating a covert actionprogram for Afghanistan. Things go well untilsomeone asks “Well, just how many schools arethere in the country, and what is the literacy rate?” No one in the room knows. (Remember, this is animaginary situation). So the junior member present dials a code number into a device at one end of thetable. Thirty seconds later, on the screen overhead, a teletype printer begins to hammer out therequired data. Before the meeting is over, the group has been given, through the same method, thenames of countries that have airlines intoAfghanistan, a biographical profile of the Soviet ambassador there, and the Pakistani order of battlealong the Afghanistan frontier. Neat, no?”
• Predicted use of not justcomputing in theintelligence community
• Also forecastpredictiveanalytics
• Accompanyingprivacychallenges
52Copyright 2015 by Data Blueprint Slide #
A Framework for Implementing NoSQL, HadoopDemystifying Big Data 2.0: Developing the Right Approach for Implementing Big Data Techniques
• Big Data Context– We are using the wrong vocabulary to discuss this topic
• More Precise Definitions– Framework– Non Von Neuman Architectures– Hadoop/Nosql
• How can data be leveraged inexploring– External market place
• Analyze opportunities and threats– Internal efficiencies
• Analyze strengths and weaknesses
56
Example: 2012 Olympic Summer Games
Copyright 2013 by Data Blueprint
1. Volume: 845 million FB users averaging 15 TB+ of data/day
2. Velocity: 60 GB of data per second3. Variety: 8.5 billion devices connected4. Variability: Sponsor data, athlete data, etc.5. Vitality: Data Art project “Emoto”6. Virtual: Social media
57
• Based on my 6 V analysis, do I need a Big Data solution
Copyright 2013 by Data Blueprint
or does my current BI solution address my businessopportunity?– Do the 6 Vs indicate general Big Data characteristics?– What are the limitations of my current Bi environment?
(Technology constraint)– What are my budgetary restrictions? (Financial constraint)– What is my current Big Data knowledge base? (Knowledge
constraint)
58
• MUST have bothFoundational andTechnical practiceexpertise
60Copyright 2013 by Data Blueprint
Copyright 2013 by Data Blueprint60
• Data Strategy
Copyright 2013 by Data Blueprint
• Data Governance
• Data Architecture
• Data Education
61
• Data Quality
Copyright 2013 by Data Blueprint
• Data Integration
• Data Platforms
• BI/Analytics
62
• Needs to be actionable• Generally well understood by
business• Document what has been learned
Copyright 2013 by Data Blueprint63
• Perfect results are not necessary
• Reiterate and refine• Iterative process to
reach decision point• Use as feedback for
next exploration
Copyright 2013 by Data Blueprint64
Copyright 2013 by Data Blueprint65
Myth #7: You need Big Data for Insights
Fact:• Distinction between Big Data and
doing analytics– Big Data is defined by the technology stack
that you use– Big Data is used for predictive and
prescriptive analytics
• Use existing data for reporting, figureout bottlenecks and optimize current business model
• Understand how is your datastructured, architected and stored
Copyright 2013 by Data Blueprint66
A Framework for Implementing NoSQL, HadoopDemystifying Big Data 2.0: Developing the Right Approach for Implementing Big Data Techniques
• Big Data Context– We are using the wrong vocabulary to discuss this topic
• More Precise Definitions– Framework– Non Von Neuman Architectures– Hadoop/Nosql
• Big Data– Historical Perspective
• Big Data Approach– Crawl, Walk, Run
• Framework Examples– Social– Operational BWB
• Take Aways and Q&A
68Copyright 2015 by Data Blueprint Slide #
Tweeting now at: #dataed
Social Sentiment Analysis• One of the burgeoning areas
for use of Big Data / Hadoopplatforms.
• Allows for the landing of multiple sources of unstructured data. (Twitter, Facebook, Linked In, etc.)
• Data than can be analyzed with algorithms looking for keywords that determinepositive/negative feedback
Copyright 2013 by Data Blueprint69
Operational Use• Utilize real time pricing data from multiple sources to dynamically
update the pricing for books in the Amazon Marketplace.• Ingested data from multiple sources looking for real time changes
in price.• Would apply predictive model to determine best price point and set
price of their books on the marketplace.• Increased conversion rate, but created a race to the bottom
situation if not monitored
Copyright 2013 by Data Blueprint79
Healthcare Example: Patient Data
Copyright 2013 by Data Blueprint
• Clinical data:– Diagnosis/prognosis/treatment
– Genetic data
• Patient demographic data• Insurance data:
– Insurance provider
– Claims data
• Prescriptions & pharmacy information• Physical fitness data
– Activity tracking through smartphone apps & social media
• The Human Face of Big Data, Rick Smolan & Jennifer Erwitt, First Edition edition (November 20, 2012)
• McKinsey: Big Data: The next frontier for innovation, competition and productivity (http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation?p=1)
• The Washington Post: Five Myths about Big Data (http://articles.washingtonpost.com/2013-08-16/opinions/41416362_1_big-data-data-crunching-marketing-analytics)
• Gartner: Gartner’s 2013 Hype Cycle for Emerging Technologies Maps Out Evolving Relationship Between Humans and Machines (http://www.gartner.com/newsroom/id/2575515)
• The New York Times | Opinion Pages: What Data Can’t Do (http://www.nytimes.com/2013/02/19/opinion/brooks-what-data-cant-do.html?_r=1&)
• CIO.com: Five Steps for How to Better Manage Your Data (http://www.cio.com.au/article/429681/five_steps_how_better_manage_your_data/)
• Business Insider: Enterprises Aren’t Spending Wildly on ‘Big Data’ But Don’t Know If It’s Worth It Yet (http://www.businessinsider.com/enterprise-big-data-spending-2012-11#ixzz2cdT8shhe)
• Inc.com: Big Data, Big Money: IT Industry to Increase Spending (http://www.inc.com/kathleen-kim/big-data-spending-to-increase-for-it-industry.html)
• Forbes: Big Data Boosts Customer Loyalty. No, Really. (http://www.forbes.com/sites/xerox/2013/09/27/big-data-boosts-customer-loyalty-no-really/)
• We are at an inflection point: Thesheer volume of data generated, stored, and mined for insights hasbecome economically relevant to businesses, government, andconsumers (McKinsey)
• We believe the same important principles still apply:
– What problem are you trying to solve foryour business? Your solution needs to fityour problem
– Doing data for (big) data’s sake is not goingto solve any problems
– Risk of spending a lot of money on chasingBig Data that will realize little to no returns -especially at this hype cycle stage
• Directional accuracy is the goal• Focus on your most important data
assets and ensure our solutionsaddress the root cause of any qualityissues – so that your data is correctwhen it is first created
• Experience has shown that organizations can never get in front of their data quality issues if they only usethe ‘find-and-fix’ approach
Copyright 2013 by Data Blueprint91
Data Quality Considerations• Big Data is trying to be
predictive• What are the questions you
are trying to answer?– What level of accuracy are you
looking for?– What confidence levels?– Example: Do I need to know
exactly what the customer isgoing to buy or do I just need toknow the range of products he/ she is going to choose from?
Copyright 2013 by Data Blueprint92
Technical Practice: Data Platforms• Do you want to measure
critical operational processperformance?
• No one data platform can answer all your questions. Thisis commonly misunderstood and often leads to very expensive, bloated andineffective data platforms.
• Understanding the questionsthat need to be asked and howto build the right data platformor how to optimize an existing one
Copyright 2013 by Data Blueprint93
Data Platforms Considerations• Commonalities between most big data
stacks with file storage, columnar store, querying engine, etc.
• Big data stack generally looks the same until you get into appliances– Algorithms are built into appliance
themselves, e.g. Netezza, Teradata, etc.)
• Ask these questions:– Do you want insights on your
customer’s behavior?– Do you need real-time customer
transactional information?– Do you need historical data or just
access to the latest transactions?– Where do you go to find the single