Top Banner
DATA ENGINEERING CHAPTER-2
32

[Chapter-2]-slide DS[Data Engineering]

Apr 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: [Chapter-2]-slide DS[Data Engineering]

DATA ENGINEERINGCHAPTER-2

Page 2: [Chapter-2]-slide DS[Data Engineering]

OUTLINE

• Defining big data

• Sources of big data

• Distinguishing data science and data engineering

• Solutions for big data problems

2

Page 3: [Chapter-2]-slide DS[Data Engineering]

ABOUT BIG DATA…?• Hype around BIG DATA!

• What BIG DATA is?

• HOW we can use BIG DATA?

• Where BIG DATA comes from?

• How to use BIG DATA?

• Roles of data engineers and data scientists in BIG DATA?

• How to leverage BIG DATA and DATA SCIENCE to improve your life and business workflow?

3

Page 4: [Chapter-2]-slide DS[Data Engineering]

DEFINING BIG DATA

• BIG DATA—data that exceeds the processing capacity of conventional database systems because it’s too big (terabytes/petabyte), it move too f a s t , o r i t d o e s n ’ t fi t t h e s t r u c t u re requirements of traditional database architectures.

4

Page 5: [Chapter-2]-slide DS[Data Engineering]

CHARACTERISTICS OF BIG DATA

• 4 Vs

• Volume—lower limits of volumes range between a few terabytes, up to tens of petabytes, have no upper limit

• Velocity—data volume per unit time—range between 30 kilobytes (K) per second up to even 30 gigabytes (GB) per second. (High-velocity, real-time data streams present an obstacle to timely decision making. The capabilities of data-handling and data-processing technologies often limit data velocities)

5

Page 6: [Chapter-2]-slide DS[Data Engineering]

• Variety—unstructured and semistructured data in with the structured datasets—generated from social networks or from automated machinery

• Structured data—derived from all sorts of sources, from click-streams and web-based forms to point of sale transactions and sensors

• Unstructured data—derived from blog posts, emails, and Word documents

• Semi-structured data—stored as log files, XML files, or JSON data files

Page 7: [Chapter-2]-slide DS[Data Engineering]

• Value—most big data is low value — in other words, the value-to-data quantity ratio is low in raw big data. Big data is comprised of huge numbers of very small transactions that come in a variety of formats. Big data produce true value only after they're rolled up and analyzed.

Page 8: [Chapter-2]-slide DS[Data Engineering]

SOURCE OF BIG DATA

• Big data is being generated by humans, machines, and sensors everywhere, on a continual basis.

• Typical sources include data from social media, financial transactions, health records, click-streams, log files, and the internet of things.

8

Page 9: [Chapter-2]-slide DS[Data Engineering]

A diagram of popular big data sources.9

Page 10: [Chapter-2]-slide DS[Data Engineering]

DIFFERENCE BETWEEN DATA SCIENCE AND DATA ENGINEERING

• Defining data science—the scientific domain that’s dedicated to knowledge discovery via data analysis. (Remarks: domain-specific refers to the industry sector or subject matter domain that data science methods are being used to explore)

• Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big data sets.

10

Page 11: [Chapter-2]-slide DS[Data Engineering]

• In business—data science is to empower business and organizational processes for maximum efficiency and revenue generation.

• In science—data science methods are used to derive results and develop protocols for achieving the specific scientific goal at hand.

• A data scientist should have expertise in math and statistics, computer programming, and your own domain-specific subject matter.

11

Page 12: [Chapter-2]-slide DS[Data Engineering]

• With skills data scientist can…

• Use machine learning to optimize energy usages and lower corporate carbon foot prints

• Optimize tactical strategies to achieve goals in business and in science

• Predict for unknown contaminant(สิ่งปนเปื้อน) levels from sparse environmental datasets

• Design automated theft and fraud prevention systems to detect anomalies and trigger based on algorithmic results.

• Implement and interpret predictive analytics and forecasting techniques for net increases in business value

12

Page 13: [Chapter-2]-slide DS[Data Engineering]

DIFFERENCE BETWEEN DATA SCIENCE AND DATA ENGINEERING

• Defining data engineering—the engineering domain thats dedicated to overcoming data-processing bottlenecks and data-handling problems for applications that utilize big data.

• Both in business and in science, data science methods can provide more robust decision making capabilities

13

Page 14: [Chapter-2]-slide DS[Data Engineering]

• A data engineer should have experience working with and designing real-time processing frameworks and Massively Parallel Processing (MPP) platforms, RDBMS, Java, C++, Python, Hadoop, MapReduce

Page 15: [Chapter-2]-slide DS[Data Engineering]

• With skills data engineer can…

• Build large-scale Software as as Service (SaaS) application,

• Build and customize Hadoop and MapReduce applications,

• Design and build relational databases and highly scaled distributed architectures fro processing big data,

• Extract, transform, and load (ETL) data from one database into another.

15

Page 16: [Chapter-2]-slide DS[Data Engineering]

TAKE A BREAK!!WITH …

MAP-REDUCE AND HADOOP

Page 17: [Chapter-2]-slide DS[Data Engineering]

DIGGING INTO MAP-REDUCE

• MapReduce is a programming paradigm allowing to parallel distributed processing of large sets of data, converting them to sets of tuples, and then combining and reducing those tuples into smaller sets of tuples

17

Page 18: [Chapter-2]-slide DS[Data Engineering]

• Map the data… • incoming data must first be delegated into key-value pairs

and divided into fragments.

• each computing cluster is assigned a number of map tasks

• processing of the key-value pairs, intermediate key-value pairs regenerated—intermediate key-value paris are sorted by their key values, and this list is divided into a new set of fragments

18

Page 19: [Chapter-2]-slide DS[Data Engineering]

• Reduce the data… • process the fragment and produces an output, which is

also a key-value pair,

• distributed among the different nodes of the cluster,

• final output is written onto a file system.

19

Page 20: [Chapter-2]-slide DS[Data Engineering]

UNDERSTANDING HADOOP

• Hadoop is … • an open-source data processing tool

• currently program for handling huge volumes and varieties of data because it was designed to make large-scale computing more affordable and flexible

20

Page 21: [Chapter-2]-slide DS[Data Engineering]

UNDERSTANDING HADOOP

• Hadoop can offer a great solution to handle, process, and group mass streams of structured, semi-structured, and unstructured data.

• Hadoop provides a map-and-reduce layer that’s capable of handling the data processing requirements of most big data projects.

21

Page 22: [Chapter-2]-slide DS[Data Engineering]

HADDOOP’S COMPONETS

• A distributed processing framework—use MapReduce as its distributed processing framework—a powerful f ramework where processing tasks are distributed across clusters of nodes, large data volumes can be processed very quickly.

• A distributed file system—use Hadoop Distributed File System (HDFS)

22

Page 23: [Chapter-2]-slide DS[Data Engineering]

• Workloads of applications that run on Hadoop are divided among the nodes of the Hadoop cluster, and then the output is stored on the HDFS.

• Remarks: Hadoop processes data in batch. If you’re working with real-time, streamlining data, you won’t be able to use Hadoop to handle your big data issues.

Page 24: [Chapter-2]-slide DS[Data Engineering]

IDENTIFYING ALTERNATIVE BIG DATA SOLUTIONS

• Real-time processing frameworks

• Massively Parallel Processing (MPP) platforms

• NoSQL databases

24

Page 25: [Chapter-2]-slide DS[Data Engineering]

REAL-TIME PROCESSING FRAMEWORKS

• Recall—Hadoop is a batch processor and can’t process real-time, streaming data. Sometimes—need to query big data streams in real-time

• A real-time processing framework—a framework that is able to process data in real-time (data streams and flows into the system)

25

Page 26: [Chapter-2]-slide DS[Data Engineering]

CATEGORIES OF REAL-TIME PROCESSING FRAMEWORKS

• Frameworks that lower the overhead of MapReduce tasks to increase the overall time efficiency of the systems (Apache Strom/Apache Spark)

• Frameworks that deploy innovative querying methods to facilitate real-time querying of big data (Google Dremel, Apache Drill, Shark for Apache Hive, Cloudera’s Impala)

26

Page 27: [Chapter-2]-slide DS[Data Engineering]

MASSIVELY PARALLEL PROCESSING (MPP) PLATFORMS

• MPP—an alternative approach for distributed data processing (Teradata platform, Greenplum DVA of EMC2, Vertical of HP, Netezza of IBM, Exadata of Oracle)

• MPP runs parallel computing tasks on costly, custom hardware, whereas MapReduce runs them on cheap commodity servers.

• MPP (based on SQL) is quicker and easier that standard MapReduce (based on Java)

27

Page 28: [Chapter-2]-slide DS[Data Engineering]

NOSQL DATABASES• RDBMS aren’t equipped to handle big data

• RDBMS cannnot handle unstructured and semi-structured

• RDBMS don’t have processing and handling capabilities for big data volume and velocity.

• NoSQL—Not Only SQL ( such as MongoDB)—non-relational, distributed database systems

• NoSQL facilitates non-SQL data querying of non-relational or schema-free, semi-structured and unstructured data

28

Page 29: [Chapter-2]-slide DS[Data Engineering]

• NoSQL offers four categories of non-relational databases—graph databases, document databases, key-values stores, and column family stores.

• NoSQL offers very efficient storage and retrieval functionality.

Page 30: [Chapter-2]-slide DS[Data Engineering]

30

Page 31: [Chapter-2]-slide DS[Data Engineering]

WRAPPING UP YOUR KNOWLEDGE!!!

Page 32: [Chapter-2]-slide DS[Data Engineering]

–Komate AMPHAWAN, komate(at)gmail.com

“Any feedback can help to improve quality of my teaching”

32