Top Banner
Mini- Project on Web Data Analysis DANIEL DEUTCH
21

Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Dec 31, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Mini-Project on Web Data AnalysisDANIEL DEUTCH

Page 2: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Data Management

“Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets”

(DAMA Data Management Body of Knowledge)

A major success:

the relational model of databases

Page 3: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Relational Databases

• Developed by Codd (1970), who won the Turing award for the model

• Huge success and impact: ‒ The vast majority of organizational data today is stored in relational

databases‒ Implementations include MS SQL Server, MS excel, Oracle DB,

mySQL,… ‒ 2 Turing award winners (Edgar F. Codd and Jim Gray)

• Basic idea: data is organized in tables (=relations)

• Relations can be derived from other relations using a set of operations called the relational algebra ‒ On which SQL is largely based

Page 4: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Research in Data(base) Management• 1970- : Relational Databases (tables).

‒ Indexing, Tuning, Query Languages, Optimizations, Expressive Power,….

• ~20 years ago: Emergence of the Web and research on Web data‒ XML, text database, web graph….‒ Google is a product of this research

(by Stanford’s PhD students Brin and Page)

• Recent years: hot topics include distributed databases, data privacy, data integration, social networks, web applications, crowdsourcing, trust,…‒ Foundations taken from “classical” database research

• Theoretical foundations with practical impact

Page 5: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Web 2.0• “Old” web (“Web 1.0”): static pages

– News, encyclopedic knowledge...

– No, or very little, interactive process between the web-page and the user.

• Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user.

– “Network as platform" computing

– The “participatory Web”

Page 6: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Web 2.0• “Old” web (“Web 1.0”): static pages

– News, encyclopedic knowledge...

– No, or very little, interactive process between the web-page and the user.

• Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user.

– “Network as platform" computing

– The “participatory Web”

Page 7: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Online shopping

Page 8: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Advertisements

Page 9: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Social Networks

Page 10: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Crowd Sourcing

Page 11: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Data is all around

• Web graph

• “Social graph”

• Pictures, Videos, notifications, messages..

• Data that the application processes

• Advertisments

• Even the application structure itself

Page 12: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

(A small portion of) the web graph

Page 13: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Need to Analyze

• Huge amount of data out there

– Est. 13.68 billion web-pages and counting

– Half a billion tweets per day and counting

• An average user “sees” about 600 tweets per day

• Most of it is irrelevant for you, some is incorrect

Page 14: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Filter, Rank, Explain

• Filter– Select the portion of data that is relevant– Group similar results

• Rank– Rank data by trustworthiness, relevance, recency...– Present highest-rank first

• Explain– An explanation of why is the data considered

relevant/highly-ranked– An explanation of how has the data propagated

• “Why do I see this?”

Page 15: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Main topics• Analysis of Tables and Links on the Web

• Trust Management

• Explanation (Provenance)

• Information Extraction

• Social Networks

• Crowd-sourcing

• Distributed Query Evaluation

Page 16: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Approach

• Leverage knowledge from “classic” database research

• Account for the new challenges

• Do so in a generic manner

• Leverage unique features such as collaborative contribution, distribution, etc.

Page 17: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

17

SSN Name Category123-45-6789 Charles undergrad234-56-7890 Dan grad

… …

Students

Physical Storage

Indexing

Distribution

...

Data model Query language

Select…

From …

Where…

Students Takes

sid=sid

sname

name=“Mary ”

cid=cid

Courses

Page 18: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Foundations

• Model

• Query Language

• Query evaluation algorithms

• Prototype implementation and optimizations

• Getting Data and Testing

Page 19: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Project Requirements• Read a paper (or a bunch of papers) in the area

• Likely to require that you follow citations and read earlier papers!

• Think of an application based on the paper ideas

• Does not have to be exactly the application described in the paper!

• E.g. you do not have to use relational databases

• Think of how would you get/generate data

• Implement, test

• Submit an application+ report

Page 20: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Report

• An integral part of the project submission

• Should include:

• A detailed description of the model and algorithms that you have implemented

• A detailed description of the application

• Code design

• Use cases

• Difficulties that you have encountered and how you addressed them

Page 21: Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

Timeline

• By 20/3 (1 week from now): send me an ordered list of 3 preferred papers

• Email title includes the words “mini-project”

• Body includes the names and IDs of the pair

• A bit after passover (date TBA): Each pair presents a

7-10 minutes presentation on the expected project

A slide on each of the issues mentioned in the requirement slide

1 week before the last week of the semester: short project presentations (including screenshots or live demo)