Top Banner
Scaling up Data Management: From Data to Big Data
49

Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Apr 16, 2018

Download

Documents

vudieu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Scaling up Data Management:

From Data to Big Data

Page 2: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Management: Evolution •  60s

o  Access data in files o  Computerized databases started – shared access o  Network model (CODASYL) – Integrated Data Store (IDS) o  Hierarchical model (IMS) -- Information Management System o  SABRE was created to manage airline reservations

•  70s o  Relational model o  ACM SIGMOD and VLDB started (1975) o  ER model o  System R, Ingres o  SQL

•  80s o  Databases for PCs o  DB2, Oracle, Sybase, Informix o  SQL standard o  RDBMS became a success o  Expert systems, OODBMS, distributed databases

Page 3: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Management: Evolution •  90s

o  Expensive products -- database for the ‘rich’ o  Internet database connectors; features for spatial, temporal,

multimedia data; active and deductive capabilities o  Exploit massively parallel processors

•  2000s o  Oracle, IBM and Microsoft are the major RDBMS vendors o  Main-memory databases

•  2010s o  Open source – databases for all o  Big Data o  NoSQL – do not attempt to provide atomicity, consistency,

isolation and durability o  NewSQL – SQL + NoSQL

Page 4: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Management Software Revenue Global database market reached over $40 billion in 2015 Business analytics software market 2013: $37 billion

http://www.infoworld.com/article/2916057/open-source-software/open-source-threatens-to-eat-the-database-market.html

Page 5: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Big Data Technology

A new forecast from International Data Corporation (IDC ) sees the big data technology and services market growing at a compound annual growth rate (CAGR) of 23.1% over the 2014-2019 forecast period with annual spending reaching $48.6 billion in 2019

http://www.idc.com/getdoc.jsp?containerId=prUS40560115

Page 6: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Big Data

Page 7: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

4 Vs of Big Data

8

Page 8: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Big Data: New Applications •  Google: many billions of pages indexed, products, structured

data •  Facebook: 1.5 billion users using the site each month •  Twitter: 517 million accounts, 320 million monthly active

users, 500 million tweets/day

Page 9: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Big Data: New Computing Infrastructure •  Meet the cloud! •  “[Hardware, Infrastructure, Platform] as a service” •  Utility Computing: pay-as-you-go computing

o  Illusion of infinite resources o  No up-front cost o  Fine-grained billing (e.g., hourly)

Page 10: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Cloud Computing: Why Now?

•  Experience with very large data centers o  Unprecedented economies of scale o  Transfer of risk

•  Technology factors o  Pervasive broadband Internet o  Maturity in virtualization technology

•  Business factors o  Minimal capital expenditure o  Pay-as-you-go billing model

Agrawal et al., VLDB 2010 Tutorial

Page 11: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Warehouse Scale Computing

16 Million Nodes per building

Google’s data center in Oregon

Agrawal et al., VLDB 2010 Tutorial

Page 12: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Unused resources

Economics of Cloud Users

•  Pay by use instead of provisioning for peak

Static data center Data center in the cloud

Demand

Capacity

Time

Res

ourc

es

Demand

Capacity

Time R

esou

rces

Slide Credits: Berkeley RAD Lab Agrawal et al., VLDB 2010 Tutorial

Page 13: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Unused resources

Economics of Cloud Users

•  Risk of over-provisioning: underutilization

Static data center

Demand

Capacity

Time

Res

ourc

es

Slide Credits: Berkeley RAD Lab Agrawal et al., VLDB 2010 Tutorial

Page 14: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Economics of Cloud Users

•  Heavy penalty for under-provisioning

Lost revenue

Lost users

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Res

ourc

es

Demand

Capacity

Time (days) 1 2 3

Slide Credits: Berkeley RAD Lab Agrawal et al., VLDB 2010 Tutorial

Page 15: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Cloud Computing: Hype or Reality •  Unlike the earlier attempts:

o  Distributed Computing o  Distributed Databases o  Grid Computing

•  Cloud Computing is REAL: o  Organic growth: Google, Yahoo, Microsoft, and Amazon o  Poised to be an integral aspect of National Infrastructure in

US and elsewhere

Agrawal et al., VLDB 2010 Tutorial

Page 16: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Cloud Computing Modalities

•  Hosted Applications and services

•  Pay-as-you-go model •  Scalability, fault-tolerance,

elasticity, and self-manageability

•  Very large data repositories •  Complex analysis •  Distributed and parallel data

processing

“Can we outsource our IT software and hardware infrastructure?”

“We have terabytes of click-stream data – what can we do with it?”

Agrawal et al., VLDB 2010 Tutorial

Page 17: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Who are our lowest/highest margin

customers ? Who are my customers

and what products are they buying?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What product promotions have the biggest

impact on revenue?

What is the most effective distribution

channel?

Why Data Analysis?

Agrawal et al., VLDB 2010 Tutorial

Business have been doing this for a long time!

Page 18: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Decision Support •  Data analysis in the enterprise context emerged:

o  As a tool to build decision support systems o  Data-centric decision making instead of using intuition o  New term: Business Intelligence

•  Used to manage and control business •  Data is historical or point-in-time •  Optimized for inquiry rather than update •  Use of the system is loosely defined and can be ad-hoc •  Used by managers and end-users to understand the

business and make judgments

Agrawal et al., VLDB 2010 Tutorial

Page 19: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Decision Support •  Traditional approach:

o  Decision makers wait for reports from disparate OLTP systems

o  Put it all together in a spreadsheet o  Manual process

•  There are many commercial systems that support analytics and decision support

Agrawal et al., VLDB 2010 Tutorial

Page 20: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Decision Support •  Traditional approach:

o  Decision makers wait for reports from disparate OLTP systems

o  Put it all together in a spreadsheet o  Manual process

•  There are many commercial systems that support analytics and decision support

Modified from Agrawal et al., VLDB 2010 Tutorial

Page 21: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Analytics in the Big Data Era •  Lots of open data available on the Web! •  Data capture at the user interaction level:

o  In contrast to the client transaction level in the Enterprise context

o  The amount of data increases significantly o  Need to analyze such data to understand user behavior

•  Cannot afford expensive warehouse solutions

Page 22: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Where are our lowest/highest margin

passengers?

What is the distribution of trip lengths?

Where should driversgo to get

passengers? What impact will

the introduction of additional medallions

have?

What is the quickestroute from midtown

to downtown at 4pm onMonday?

What would the impacts be of fare change?

Why Data Analysis? Now, many more stakeholders want to do this too!

Page 23: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analytics in the Cloud •  Scalability to large data volumes:

o  Scan 100 TB on 1 node @ 50 MB/sec = 23 days o  Scan 100 TB on 1000-node cluster = 33 minutes

Divide-And-Conquer (i.e., data partitioning)

•  Cost-efficiency: o  Commodity nodes (cheap, but unreliable) o  Commodity network o  Automatic fault-tolerance (fewer admins) o  Easy to use (fewer programmers)

Agrawal et al., VLDB 2010 Tutorial

Page 24: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Platforms for Large-scale Data Analysis •  Parallel DBMS technologies

o  Proposed in the late eighties o  Matured over the last two decades o  Multi-billion dollar industry: Proprietary DBMS Engines

intended as Data Warehousing solutions for very large enterprises

•  Map Reduce o  pioneered by Google o  popularized by Yahoo! (open-source Hadoop)

Agrawal et al., VLDB 2010 Tutorial

Page 25: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Parallel DBMS technologies •  Popularly used for more than two decades

o  Research Projects: Gamma, Grace, … o  Commercial: Multi-billion dollar industry but access to only a

privileged few •  Relational Data Model •  Indexing •  Familiar SQL interface •  Advanced query optimization •  Well understood and studied •  Very reliable!

Agrawal et al., VLDB 2010 Tutorial

Page 26: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Parallel Databases

DBMS hides the complexity from the client application DBA does most of the work – data partitioning, optimization, etc.

Page 27: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce •  Overview:

o  Data-parallel programming model o  An associated parallel and distributed implementation for

commodity clusters •  Pioneered by Google

o  Processing 20 PB of data per day (circa 2008)

[Dean et al., OSDI 2004, CACM Jan 2008, CACM Jan 2010]

Agrawal et al., VLDB 2010 Tutorial

Page 28: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Hadoop•  Open source of MapReduce framework of Apache Project •  Used by Yahoo!, Facebook, Amazon, and the list is

growing … •  Key components

o  MapReduce - distributes applications o  Hadoop Distributed File System (HDFS) - distributes data

•  Hadoop Distributed File System (HDFS) o  Store big files across machines o  Store each file as a sequence of blocks o  Blocks of a file are replicated for fault tolerance

•  Distribute processing of large data across thousands of commodity machines

•  You have to program your data processing and analysis

Page 29: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Word Count in Python def word_count_dict(filename):  """Returns a word/count dict for this filename."""  # Utility used by count() and Topcount().  word_count = {} # Map each word to its count  input_file = open(filename, 'r')  for line in input_file:    words = line.split()    for word in words:      word = word.lower()      # Special case if we're seeing this word for the first time.      if not word in word_count:        word_count[word] = 1      else:        word_count[word] = word_count[word] + 1  input_file.close() # Not strictly required, but good form.  return word_counthttps://github.com/mlafeldt/google-python-class/blob/master/basic/solution/wordcount.py

Page 30: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce Programming Model•  Borrows primitives from functional programming •  Users should implement two primary methods:

o Map: (key1, val1) → [(key2, val2)] o Reduce: (key2, [val, val, val,…]) → [(key3, val3)]

© Kyuseok Shim (VLDB 2012 TUTORIAL)

Page 31: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Word Counting with MapReduce

Documents

Financial, IMF, Economics, Crisis

Financial, IMF, Crisis

Map

Key Value

Financial 1

IMF 1

Economics 1

Crisis 1

Financial 1

IMF 1

Crisis 1

Key Value

Economics 1

Harry 1

Financial 1

Harry 1

Potter 1

Film 1

Crisis 1

Harry 1

Potter 1

`

`

`

`

`

`

`Documents

Economics, Harry

Financial, Harry, Potter, Film

Crisis, Harry, Potter

Map

Doc1

Doc2

Doc3

Doc4

Doc5

M1

M2

© Kyuseok Shim (VLDB 2012 TUTORIAL)

Page 32: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Word Counting with MapReduce

Reduce

Key Value

Financial 3

IMF 2

Economics 2

Crisis 3

Harry 3

Film 1

Potter 2

Key Value

Financial 1

Financial 1

Financial 1

IMF 1

IMF 1

Economics 1

Economics 1

Key Value

Crisis 1

Crisis 1

Crisis 1

Harry 1

Harry 1

Harry 1

Film 1

Potter 1

Potter 1

Before reduce functions are called, for each distinct key, a list of associated values is generated

Key Value list

Financial 1, 1, 1

IMF 1, 1

Economics 1, 1

Crisis 1, 1, 1

Harry 1, 1, 1

Film 1

Potter 1, 1

`

Documents

Financial, IMF, Economics, Crisis

Financial, IMF, Crisis

Map

Documents

Economics, Harry

Financial, Harry, Potter, Film

Crisis, Harry, Potter

Map

Reduce

Doc1

Doc2

Doc3

Doc4

Doc5

© Kyuseok Shim (VLDB 2012 TUTORIAL)

Page 33: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce Advantages •  Automatic Parallelization:

o  Depending on the size of RAW INPUT DATA à instantiate multiple MAP tasks

o  Similarly, depending upon the number of intermediate <key, value> partitions à instantiate multiple REDUCE tasks

•  Run-time: o  Data partitioning o  Task scheduling o  Handling machine failures o  Managing inter-machine communication

•  Completely transparent to the programmer/analyst/user

Agrawal et al., VLDB 2010 Tutorial

Page 34: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce Experience •  Runs on large commodity clusters:

o  1000s to 10,000s of machines •  Processes many terabytes of data •  Easy to use since run-time complexity hidden from the

users •  1000s of MR jobs/day at Google (circa 2004) •  100s of MR programs implemented (circa 2004)

Agrawal et al., VLDB 2010 Tutorial

Page 35: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

The Need •  Special-purpose programs to process large amounts of

data: crawled documents, Web Query Logs, etc. •  At Google and others (Yahoo!, Facebook):

o  Inverted index o  Graph structure of the WEB documents or social network o  Summaries of #pages/host, set of frequent queries, etc. o  Ad Optimization o  Spam filtering o  …

Agrawal et al., VLDB 2010 Tutorial

Page 36: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Takeaway •  MapReduce’s data-parallel programming model

hides complexity of distribution and fault tolerance

•  Principal philosophies: o Make it scale, so you can throw hardware at problems o Make it cheap, saving hardware, programmer and

administration costs (but requiring fault tolerance) •  MapReduce is not suitable for all problems, but

when it works, it may save you a lot of time

Agrawal et al., VLDB 2010 Tutorial

Page 37: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Map Reduce vs Parallel DBMSParallel DBMS MapReduce

Schema Support ü Not out of the box

Indexing ü Not out of the box

Programming Model Declarative(SQL)

Imperative(C/C++, Java, …)

Extensions through Pig and Hive

Optimizations (Compression, Query

Optimization)ü Not out of the box

Flexibility Not out of the box ü

Fault Tolerance Coarse grained techniques ü

Agrawal et al., VLDB 2010 Tutorial

[Pavlo et al., SIGMOD 2009, Stonebraker et al., CACM 2010, …]

Page 38: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce: A step backwards? •  Don’t need 1000 nodes to process petabytes:

o  Parallel DBs do it in fewer than 100 nodes •  No support for schema:

o  Sharing across multiple MR programs is difficult •  No indexing:

o  Wasteful access to unnecessary data •  Non-declarative programming model:

o  Requires highly-skilled programmers •  No support for JOINs:

o  Requires multiple MR phases for the analysis

Agrawal et al., VLDB 2010 Tutorial

We will study this in more detail!

Page 39: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

MapReduce and Big Data •  MapReduce programming model •  Hadoop infrastructure – HDFS, NoSQL stores •  Data management and query processing in Hadoop

environments •  Spark: processing engine compatible with Hadoop data

o  Supports streaming data, interactive queries, and machine learning

o  http://spark.apache.org

•  SQL vs. NoSQL: Big Data Hype and Reality [Tutorial by C. Mohan] o  Need to look back at the lessons learned in database design o  http://bit.ly/CMnMDS

Page 40: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Analysis and Mining

Page 41: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Mining •  Discovery of patterns and models that are

o  Valid – applicable to new data with some certainty o  Useful o  Unexpected o  Understandable to people

•  Confluence of different areas: databases, machine learning, visualization, statistics

•  We will study aspects from these areas, but focus on: o  Scalability o  Algorithms and architectures

Page 42: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analysis and Mining •  Many challenges, even when data is not big… •  Data cleaning and curation: Bad data à bad results

o  Detection and correction of errors in data, e.g., number of passengers = 255, taxis in the river.

o  Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company

Page 43: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analysis and Mining •  Many challenges, even when data is not big… •  Data cleaning and curation: Bad data à bad results

o  Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river.

o  Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company

•  Sometimes it can be hard to distinguish between errors and outliers!

Page 44: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analysis and Mining •  Many challenges, even when data is not big… •  Data cleaning and curation:

o  Detection and correction of errors in data E.g., number of passengers = 255, taxis in the river.

o  Entity resolution and disambiguation, e.g., apple the fruit vs. Apple the company

•  Sometimes it can be hard to distinguish between errors and outliers!

•  Visualization: Pictures help us to think o  Substitute perception for cognition o  External memory: free up limited cognitive/memory resources for

higher-level problems •  Mining: Discovery of useful, possibly unexpected, patterns

in data

Page 45: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analysis and Mining •  In exploratory tasks, change is the norm!

o  Data analysis and mining are iterative processes o  Many trial-and-error steps

Data Data Product

Specification

Knowledge Process Perception & Cognition

Exploration

Data Manipulation User

Figure modified from J. van Wijk, IEEE Vis 2005

Page 46: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Data Analysis and Mining •  In exploratory tasks, change is the norm!

o  Data analysis and mining are iterative processes o  Many trial-and-error steps, easy to get lost…

•  Need to manage the data exploration process: o  Guide users – support for reflective reasoning o  Need provenance for reproducibility [Freire et al., CISE 2008]

Data Data Product

Specification

Knowledge Process Perception & Cognition

Exploration

Data Manipulation User

Figure modified from J. van Wijk, IEEE Vis 2005

Page 47: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Sharing and Collaboration •  Result transparency

o  Show me your work! o  Allow results to be verified à trust the results Keep track of what you do and the steps you follow – the provenance of your work

•  Hard data science problems require people with different expertise to collaborate o  Need to share work, but this can be challenging… o  E.g., A sends their analysis script to B, but B cannot run it…

•  Missing or incorrect versions of libraries •  Hard-coded file names: /home/A/myinputfile.txt •  …

Follow best practices for sharing and reproducibility

Page 48: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Analyzing and Mining Big Data: Issues •  Scalability for algorithms and computations: need to

design/extend algorithms to leverage new computing model o  We will cover this in the third module of our course

•  A big data-mining risk is that you will“discover” patterns that are meaningless – watch out for bogus patterns/events

•  Bonferroni correction gives a statistically sound way to avoid most of these bogus positive responses

Page 49: Scaling up Data Management: From Data to Big Datajuliana/courses/BigData2016/... Big Data Technology A new forecast from International Data Corporation (IDC ) sees the big data technology

Bonferroni’s Principle •  Calculate the expected number of occurrences of the

events you are looking for, assuming that data is random •  If this number is significantly larger than the number of real

instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.

•  Read textbook! o  Chapter 1 of Mining of Massive Data Analysis