Top Banner
T13 Big Data 10/6/16 13:30 The Four V's of Big Data Testing: Variety,Volume, Velocity, and Veracity Presented by: Jaya Bhagavathi Bhallamudi Tata Consultancy Services Brought to you by: 350 Corporate Way, Suite 400, Orange Park, FL 32073 8882688770 9042780524 [email protected] http://www.starwest.techwell.com/
42

The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

Jan 22, 2018

Download

Software

TechWell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

       T13  Big  Data  10/6/16  13:30            

The  Four  V's  of  Big  Data  Testing:  Variety,Volume,  Velocity,  and  Veracity  

Presented  by:      

  Jaya  Bhagavathi  Bhallamudi      

Tata  Consultancy  Services    

Brought  to  you  by:        

   

   

350  Corporate  Way,  Suite  400,  Orange  Park,  FL  32073    888-­‐-­‐-­‐268-­‐-­‐-­‐8770  ·∙·∙  904-­‐-­‐-­‐278-­‐-­‐-­‐0524  -­‐  [email protected]  -­‐  http://www.starwest.techwell.com/      

 

   

Page 2: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

   

Jaya  Bhagavathi  Bhallamudi      A  senior  consultant  in  the  assurance  services  unit  of  Tata  Consultancy  Services,  Jaya  Bhagavathi  Bhallamudi  heads  the  Big  Data  and  Analytics  Assurance  Center  of  Excellence,  which  focuses  on  R&D,  test  process  definitions,  test  automation  solution  development,  and  competency  development  on  Big  Data  technologies.  Jaya  has  been  in  the  test  automation,  testing  services,  and  solutions  innovation  space  for  fifteen  of  her  seventeen  years  in  IT.  She  enjoys  building  test  automation  frameworks  and  accelerators  for  various  testing  services.  Contact  Jaya  at  [email protected]  or  on  LinkedIn.  

Page 3: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

1 | Copyright © 2016 Tata Consultancy Services Limited

The Four V’s of Big Data Testing: Variety, Volume, Velocity & Veracity

October 6, 2016 TCS Confidential | Copyright © 2016 Tata Consultancy Services Limited

Jayabhagavathi Bhallamudi – Head, Big Data COE, TCS

Page 4: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

2

With you today…

• Jaya is a Senior Consultant in TCS and currently heading the Big Data and Analytics Assurance Center of Excellence, which focuses on the R&D, Test Process definitions, Test Automation solution development and Competency development

• Jaya has 18+ years of experience in IT industry with 15+ years in Test Automation and Testing Services & Solutions Innovation

• Jaya holds Masters degree in Computer Application from Osmania University, Hyderabad, India

Jayabhagavathi Bhallamudi, Head – Big Data Testing COE, Assurance Services, TCS

TCS Confidential Information – Not to be shared

Page 5: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

3

Today we will cover…

TCS Confidential

1

Tester’s Dilemma 2

Framework to tackle the problem

Need for Big Data Assurance

3

Page 6: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

4

BIG DATA

BIGGER DILLEMA

NEED FOR BIG DATA

ASSURANCE

Page 7: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

5

Big Data Analytics

TCS Confidential Information – Not to be shared

Non-traditional internal data &

uncontrolled external data

Complex non-traditional

analytical models

INPUT OUTPUT

Page 8: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

6

Garbage in equals Garbage out

TCS Confidential Information – Not to be shared

IN OUT

Increased Risk

=

Page 9: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

7

How this impacts your business

TCS Confidential Information – Not to be shared

Bad Data

Wrong Insights

Business / Brand Image Losses

Incorrect Processing

Page 10: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

8

Appropriate Big Data Assurance ensures

TCS Confidential Information – Not to be shared

Good Data

Relevant Actionable Insights

Business Growth

Reliable Processing

Page 11: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

9

BIG DATA

BIGGER DILLEMA

TESTER’S DILEMMA

Page 12: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

10

Scope in terms of data flow

Ingestion

Integration

Migration

Homogenization

Standardization

Storage

Analytics

Apps Insights

Transformed

Data Raw Data

TCS Confidential Information – Not to be shared

Page 13: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

11

VERACITY

Focus in terms of V’s

VALUE

TCS Confidential Information – Not to be shared

VELOCITY

VOLUME

VARIETY

VARIABILITY

BIG

DATA

TBs

RDBMS, txt,

xml, json,

bson, orc, rc…

Inconsistency

Reliability

Relevancy

Performance

Page 14: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

12 TCS Confidential Information – Not to be shared

Ingestion

Integration

Migration

Homogenization

Standardization

Storage

Analytics

Apps Insights

When to focus which ‘V’?

Or .. Should we focus on all V’s all the time?

Page 15: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

13

A FRAMEWORK

TO TACKLE

THE PROBLEM

Page 16: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

14

Understand the architecture of the integrated data enterprise 1

TCS Confidential Information – Not to be shared

Page 17: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

15

Hadoop

Non-Hadoop

Databases Files Near real-time data streams

HDFS ( Raw data )

HIVE / HBASE ( Standardized data )

HIVE / HBASE

( Data for creating

analytical models )

HIVE / HBASE

( Data for applying

analytical models )

Step 1: Understand the architecture

DWHs

Apps

Analy

tics

Analytics

TCS Confidential Information – Not to be shared

Page 18: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

16 TCS Confidential Information – Not to be shared

Identify the testing interfaces 2

Page 19: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

17

Hadoop

Non-Hadoop

Databases Files Near real-time data streams

HDFS ( Raw data )

HIVE / HBASE ( Standardized data )

HIVE / HBASE

( Data for creating

analytical models )

HIVE / HBASE

( Data for applying

analytical models )

Step 2: Identify testing interfaces

DWHs

Apps

Analy

tics

Analytics

TCS Confidential Information – Not to be shared

a b c

d

f h

e

g i

j

k

m

l

n

Page 20: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

18

Identify testing type relevant to the interface 3

TCS Confidential Information – Not to be shared

Page 21: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

19

Databases

HDFS ( Raw data )

Data ingestion testing

Data migration testing

Data integration testing Te

sting types @

Step 3: Identify testing type

a

a

TCS Confidential Information – Not to be shared

Page 22: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

20

Files

HDFS ( Raw data )

Data ingestion testing

Data migration testing

Data integration testing Te

sting types @

Step 3: Identify testing type

b

b

TCS Confidential Information – Not to be shared

Page 23: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

21

Near real-

time data

streams

HDFS ( Raw data )

Data ingestion testing

Data integration testing

Te

sting types @

Step 3: Identify testing type

c

c

TCS Confidential Information – Not to be shared

Page 24: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

22

HDFS

(Raw data)

HIVE / HBASE

(Standardized data)

Data homogenization

testing

Te

sting types @

Step 3: Identify testing type

d

d

TCS Confidential Information – Not to be shared

Page 25: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

23

HIVE / HBASE

(Standardized data) Data standardized testing

Te

sting types @

Step 3: Identify testing type

e

TCS Confidential Information – Not to be shared

e

Page 26: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

24

HIVE / HBASE

(Standardized data) Data migration testing

Te

sting types @

Step 3: Identify testing type

f

TCS Confidential Information – Not to be shared

HIVE / HBASE

(Data for creating

analytical models)

Data integration testing

f

Page 27: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

25

HIVE / HBASE

(Data for creating

analytical models) Analytical model validation

Te

sting types @

Step 3: Identify testing type

g

TCS Confidential Information – Not to be shared

g

Page 28: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

26

HIVE / HBASE

(Standardized data) Data migration testing

Te

sting types @

Step 3: Identify testing type

h

TCS Confidential Information – Not to be shared

HIVE / HBASE

(Data for applying

analytical models)

Data integration testing

h

Page 29: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

27

HIVE / HBASE

(Data for applying

analytical models)

Analytical model

effectiveness testing

Te

sting types @

Step 3: Identify testing type

i

TCS Confidential Information – Not to be shared

i

Page 30: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

28

Ha

doo

p

HIVE / HBASE

(Data for applying

analytical models)

Data provision

testing

T

esting types @

Step 3: Identify testing type

TCS Confidential Information – Not to be shared

Ana

lytics

j

j

k

l

Apps

Analytics k

l

Page 31: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

29

Hadoop

HIVE / HBASE

(Data for applying

analytical models)

Data migration

testing

T

esting types @

Step 3: Identify testing type

TCS Confidential Information – Not to be shared

k

DWHs k

Data ingestion

testing

Data

integration

Page 32: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

30

Te

sting T

ypes @

n

Data Provisioning Testing

Step 3: Identify testing type

DWHs

Apps

Analytics n

o

o

TCS Confidential Information – Not to be shared

Page 33: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

31

Identify the V to be prioritized for the testing type

4

TCS Confidential Information – Not to be shared

Page 34: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

32

Step 4: Prioritize V’s 4

Data Ingestion Testing

VARIETY

VELOCITY

High priority for file-based data ingestions

High priority for real time data ingestions

TCS Confidential Information – Not to be shared

Page 35: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

33

Step 4: Prioritize V’s 4

Data Migration Testing

VOLUME High priority for historical data migrations

TCS Confidential Information – Not to be shared

Page 36: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

34

Step 4: Prioritize V’s 4

Data Integration Testing

VARIABILITY Inconsistency / non-compliance checks

TCS Confidential Information – Not to be shared

High priority for data acquired from multiple sources to a single target

High priority for data acquired from external sources like social media

Page 37: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

35

Step 4: Prioritize V’s 4

Data Homogenization Testing

VARIETY High priority for unstructured or semi-structured to

structured data format conversions

TCS Confidential Information – Not to be shared

Page 38: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

36

Step 4: Prioritize V’s 4

Data Standardization Testing

VOLUME High priority for any pre-existing data to be checked for

conformance to data standards & industry compliances

TCS Confidential Information – Not to be shared

Page 39: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

37

Step 4: Prioritize V’s 4

Analytical Model Validation

VOLUME To identify data patterns which were not considered in

development of model; Entire historical data to be

considered for testing

TCS Confidential Information – Not to be shared

Analytical models based on historical data

Page 40: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

38

Step 4: Prioritize V’s 4

Analytical Model Validation

VERACITY

VALUE

High priority to identify the data patterns

that are not relevant for the business

High priority to identify the data patterns

that do not bring any value to the business

TCS Confidential Information – Not to be shared

Analytical models not based on historical data

Page 41: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

39

Step 4: Prioritize V’s 4

Analytical Model Effectiveness Testing

VOLUME High priority to identify wrong predictions, unidentified data patterns

TCS Confidential Information – Not to be shared

If the actual data, on which the model needs to be run, is available

Page 42: The Four V’s of Big Data Testing: Variety, Volume, Velocity, and Veracity

40

Thank you!

For more information, please write to me at [email protected]

Visit TCS at booth # 1