[Chapter-2]-slide DS[Data Engineering]

DATA ENGINEERINGCHAPTER-2

OUTLINE

• Defining big data

• Sources of big data

• Distinguishing data science and data engineering

• Solutions for big data problems

2

ABOUT BIG DATA…?• Hype around BIG DATA!

• What BIG DATA is?

• HOW we can use BIG DATA?

• Where BIG DATA comes from?

• How to use BIG DATA?

• Roles of data engineers and data scientists in BIG DATA?

• How to leverage BIG DATA and DATA SCIENCE to improve your life and business workflow?

3

DEFINING BIG DATA

• BIG DATA—data that exceeds the processing capacity of conventional database systems because it’s too big (terabytes/petabyte), it move too f a s t , o r i t d o e s n ’ t fi t t h e s t r u c t u re requirements of traditional database architectures.

4

CHARACTERISTICS OF BIG DATA

• 4 Vs

• Volume—lower limits of volumes range between a few terabytes, up to tens of petabytes, have no upper limit

• Velocity—data volume per unit time—range between 30 kilobytes (K) per second up to even 30 gigabytes (GB) per second. (High-velocity, real-time data streams present an obstacle to timely decision making. The capabilities of data-handling and data-processing technologies often limit data velocities)

5

• Variety—unstructured and semistructured data in with the structured datasets—generated from social networks or from automated machinery

• Structured data—derived from all sorts of sources, from click-streams and web-based forms to point of sale transactions and sensors

• Unstructured data—derived from blog posts, emails, and Word documents

• Semi-structured data—stored as log files, XML files, or JSON data files

• Value—most big data is low value — in other words, the value-to-data quantity ratio is low in raw big data. Big data is comprised of huge numbers of very small transactions that come in a variety of formats. Big data produce true value only after they're rolled up and analyzed.

SOURCE OF BIG DATA

• Big data is being generated by humans, machines, and sensors everywhere, on a continual basis.

• Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the internet of things.

8

A diagram of popular big data sources.9

DIFFERENCE BETWEEN DATA SCIENCE AND DATA ENGINEERING

• Defining data science—the scientific domain that’s dedicated to knowledge discovery via data analysis. (Remarks: domain-specific refers to the industry sector or subject matter domain that data science methods are being used to explore)

• Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big data sets.

10

• In business—data science is to empower business and organizational processes for maximum efficiency and revenue generation.

• In science—data science methods are used to derive results and develop protocols for achieving the specific scientific goal at hand.

• A data scientist should have expertise in math and statistics, computer programming, and your own domain-specific subject matter.

11

• With skills data scientist can…

• Use machine learning to optimize energy usages and lower corporate carbon foot prints

• Optimize tactical strategies to achieve goals in business and in science

• Predict for unknown contaminant(สิ่งปนเปื้อน) levels from sparse environmental datasets

• Design automated theft and fraud prevention systems to detect anomalies and trigger based on algorithmic results.

• Implement and interpret predictive analytics and forecasting techniques for net increases in business value

12

DIFFERENCE BETWEEN DATA SCIENCE AND DATA ENGINEERING

• Defining data engineering—the engineering domain thats dedicated to overcoming data-processing bottlenecks and data-handling problems for applications that utilize big data.

• Both in business and in science, data science methods can provide more robust decision making capabilities

13

• A data engineer should have experience working with and designing real-time processing frameworks and Massively Parallel Processing (MPP) platforms, RDBMS, Java, C++, Python, Hadoop, MapReduce

• With skills data engineer can…

• Build large-scale Software as as Service (SaaS) application,

• Build and customize Hadoop and MapReduce applications,

• Design and build relational databases and highly scaled distributed architectures fro processing big data,

• Extract, transform, and load (ETL) data from one database into another.

15

TAKE A BREAK!!WITH …

MAP-REDUCE AND HADOOP

DIGGING INTO MAP-REDUCE

• MapReduce is a programming paradigm allowing to parallel distributed processing of large sets of data, converting them to sets of tuples, and then combining and reducing those tuples into smaller sets of tuples

17

• Map the data… • incoming data must first be delegated into key-value pairs

and divided into fragments.

• each computing cluster is assigned a number of map tasks

• processing of the key-value pairs, intermediate key-value pairs regenerated—intermediate key-value paris are sorted by their key values, and this list is divided into a new set of fragments

18

• Reduce the data… • process the fragment and produces an output, which is

also a key-value pair,

• distributed among the different nodes of the cluster,

• final output is written onto a file system.

19

UNDERSTANDING HADOOP

• Hadoop is … • an open-source data processing tool

• currently program for handling huge volumes and varieties of data because it was designed to make large-scale computing more affordable and flexible

20

UNDERSTANDING HADOOP

• Hadoop can offer a great solution to handle, process, and group mass streams of structured, semi-structured, and unstructured data.

• Hadoop provides a map-and-reduce layer that’s capable of handling the data processing requirements of most big data projects.

21

HADDOOP’S COMPONETS

• A distributed processing framework—use MapReduce as its distributed processing framework—a powerful f ramework where processing tasks are distributed across clusters of nodes, large data volumes can be processed very quickly.

• A distributed file system—use Hadoop Distributed File System (HDFS)

22

• Workloads of applications that run on Hadoop are divided among the nodes of the Hadoop cluster, and then the output is stored on the HDFS.

• Remarks: Hadoop processes data in batch. If you’re working with real-time, streamlining data, you won’t be able to use Hadoop to handle your big data issues.

IDENTIFYING ALTERNATIVE BIG DATA SOLUTIONS

• Real-time processing frameworks

• Massively Parallel Processing (MPP) platforms

• NoSQL databases

24

REAL-TIME PROCESSING FRAMEWORKS

• Recall—Hadoop is a batch processor and can’t process real-time, streaming data. Sometimes—need to query big data streams in real-time

• A real-time processing framework—a framework that is able to process data in real-time (data streams and flows into the system)

25

CATEGORIES OF REAL-TIME PROCESSING FRAMEWORKS

• Frameworks that lower the overhead of MapReduce tasks to increase the overall time efficiency of the systems (Apache Strom/Apache Spark)

• Frameworks that deploy innovative querying methods to facilitate real-time querying of big data (Google Dremel, Apache Drill, Shark for Apache Hive, Cloudera’s Impala)

26

MASSIVELY PARALLEL PROCESSING (MPP) PLATFORMS

• MPP—an alternative approach for distributed data processing (Teradata platform, Greenplum DVA of EMC2, Vertical of HP, Netezza of IBM, Exadata of Oracle)

• MPP runs parallel computing tasks on costly, custom hardware, whereas MapReduce runs them on cheap commodity servers.

• MPP (based on SQL) is quicker and easier that standard MapReduce (based on Java)

27

NOSQL DATABASES• RDBMS aren’t equipped to handle big data

• RDBMS cannnot handle unstructured and semi-structured

• RDBMS don’t have processing and handling capabilities for big data volume and velocity.

• NoSQL—Not Only SQL ( such as MongoDB)—non-relational, distributed database systems

• NoSQL facilitates non-SQL data querying of non-relational or schema-free, semi-structured and unstructured data

28

• NoSQL offers four categories of non-relational databases—graph databases, document databases, key-values stores, and column family stores.

• NoSQL offers very efficient storage and retrieval functionality.

30

WRAPPING UP YOUR KNOWLEDGE!!!

–Komate AMPHAWAN, komate(at)gmail.com

“Any feedback can help to improve quality of my teaching”

32

http://gmail.com

[Chapter-2]-slide DS[Data Engineering]

Documents