Top Banner
Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum Big Data Curation Edward Curry (Insight @ NUI Galway) Project co-funded by the European Commission within the 7th Framework Program (Grant Agreement No. 257943)
22

Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Oct 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

Big Data Curation

Edward Curry (Insight @ NUI Galway)

Project co-funded by the European Commission within the 7th Framework Program (Grant Agreement No. 257943)

Page 2: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA INSIGHTS

▶  Coping with data variety and verifiability are central challenges and opportunities for Big Data

▶  The long tail of data variety is a major shift in the data landscape ▶  Need for scalable approaches to cope with data under different

format and semantic assumptions

The Data Landscape

The Solution Space ▶  Lowering the usability barrier for data tools is a major requirement

across all sectors. Users should be able to directly manipulate the data ▶  Blended human and algorithmic data processing approaches are

a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data

▶  Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges

Page 3: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

THE DATA VALUE CHAIN

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

•  Structured data

•  Unstructured data

•  Event processing

•  Sensor networks

•  Streams •  Multimodality

•  Data preprocessing

•  Semantic analysis

•  Sentiment analysis

•  Data correlation

•  Pattern recognition

•  Realtime analysis

•  Machine learning

•  Trust •  Provenance •  Data

augmentation •  Annotation •  Data validation •  Redundancy

elimination •  Keep up-to-date •  Consistency

•  In-Memory Technology

•  HANA •  Column DB •  NoSQL •  Cloud storage •  Compression

•  Decision support

•  Predictions •  Simulation •  Exploration •  Modelling •  Control •  Domain-

specific usage

Technical Working Groups

Value Chain

Page 4: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

DATA CURATION

Value Chain

Data Acquisition

Data Analysis

Data Curation

Data Storage

Data Usage

Page 5: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

THE PROBLEM: DATA QUALITY

ID PNAME PCOLOR PRICE

APNR iPod Nano Red 150

APNS iPod Nano Silver 160

<Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera>on>5</genera>on>                  </Item>          </Items>  </Product>  

Source A

Source B Schema Difference?

Data Developer

APNR  

iPod  Nano  

Red  

150  

APNR  

iPod  Nano  

Silver  

160  

iPod  Nano   IPN890  150  

5  

Value Conflicts? Entity Duplication?

Data Steward

Business Users

? Technical Domain

(Technical)

Domain

Page 6: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

DATA CURATION OVERVIEW

▶  Digital Curation “Selection, preservation, maintenance, collection, and archiving of digital assets”

▶  Data Curation “Active management of data over its life-cycle”

Definition

▶  Individual Curators ▶  Curation Departments ▶  Community-based (Emerging trend)

Who?

▶  Manual Curation ▶  (Semi-)Automated ▶  Sheer Curation ▶  Collaborative Data Management (Crowdsourcing)

How?

▶  Accessible ▶  Authenticity ▶  Collaboration ▶  Discoverability ▶  Fitness for Use

Why? ▶  Integrity ▶  Reusability ▶  Security ▶  Sustainability ▶  Trustworthy

Page 7: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

Clean Data

ALGORITHM + CROWD

Developers Data Governance

Internal Community

External Crowd

Data Sources

Data Quality Algorithms

Human Computation

Page 8: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

MIXED HUMAN-COMPUTER INTELLIGENCE

▶  Coordinating a crowd (a large group of workers) to do micro-work (small tasks) that solves pro(that computers or a single user can’t)blems

▶  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals

Key Points

▶  Collective Intelligence ▶  Social Computing ▶  Human Computation ▶  Data Mining & Machine learning ▶  Natural Language Processing ▶  Speech recognition & Computer vision

Related Areas

Page 9: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

HUMAN VS MACHINE AFFORDANCES

ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge

ü Large-scale data manipulation ü Collecting and storing

large amounts of data ü Efficient data movement ü Bias-free analysis

Human Machine

Page 10: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

WHEN COMPUTERS WERE HUMAN

▶ Used human computers to created almanac of moon positions ▶ Used for shipping/

navigation ▶ Quality assurance ▶ Do calculations twice ▶ Compare to third verifier

Maskelyne 1760

Page 11: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

WHEN COMPUTERS WERE HUMAN

Page 12: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION EXEMPLARS

Page 13: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

TAG A TUNE

Page 14: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

PEEKABOOM

Page 15: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

FOLDIT

Page 16: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

RECAPTCHA

▶ OCR ▶  ~ 1% error rate ▶  20%-30% for 18th and 19th

century books ▶  40 million ReCAPTCHAs

every day” (2008) ▶  Fixing 40,000 books a day

Recaptcha

Page 17: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION IN ENTERPRISES

Product Categorization

Sentiment Analysis

▶ Categorize millions of products with accurate and complete attributes

▶ Combine the crowd with machine learning to create an affordable and flexible catalog quality system

▶ Understanding customer sentiment for worldwide launch of new product

▶  Implemented 24/7 sentiment analysis system using workers from around the world

Page 18: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION USE CASES Telco, Media, & Entertainment

Manufacturing, Retail, Energy & Transport

Public Sector Life Sciences

Page 19: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

COMMUNITY, CROWDS, & OPEN DATA

▶  Leaverage online community to curate large datasets ▶  Natural Language Processing, Computer Vision,

Classification, Verification, Enrichment, Judgments, etc

Community & Crowds

Emerging Economic Model for Open Data ▶  Pre-competitive collaboration efforts ▶  Share costs, risks, & technical challenges ▶  Benefit from collective wisdom and

network effect for curated dataset ▶  Pistoia Alliance (pharmaceutical data)

Page 20: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

FUTURE REQUIREMENTS OF BIG DATA CURATION

▶  Increase in the need for automation ▶  Trust and provenance capture/management

Curation at Scale

▶  Interfaces which can cope with different levels of expertise and responsibility

▶  Discoverability of data items ▶  Fine-grained control over accessibility of various data items

Access Management

▶  Enable contribution from wide range of human resources such as programmers, domain experts, non-experts contributors, and crowds.

▶  Distribute curation tasks while considering abilities of persons and complexities of tasks

Variety of Expertise

Multimedia & Text ▶  Data curation infrastructure focused on multimedia and

unstructured resources

Page 21: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

SUMMARY

▶  Coping with data variety and verifiability are central challenges and opportunities for Big Data

▶  The long tail of data variety is a major shift in the data landscape ▶  Need for scalable approaches to cope with data under different

format and semantic assumptions

The Data Landscape

The Solution Space ▶  Lowering the usability barrier for data tools is a major requirement

across all sectors. Users should be able to directly manipulate the data ▶  Blended human and algorithmic data processing approaches are

a trend for coping with data acquisition, transformation, curation, access, and analysis challenges for Big Data

▶  Solutions based on large communities (crowd-based approaches) are emerging as a trend to cope with Big Data challenges

▶  Principled semantic and standardized data representation models are central to cope with data heterogeneity

Page 22: Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Big Data Curation Webinar 19/12/2013

BIG Big Data Public Private Forum

BIG DATA CURATION INTERVIEW SERIES http://big-project.eu/text-interviews

More to come in 2014…

Future Interviews