Top Banner
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
30

Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Jul 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Analytics Building Blocks

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 3: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Building blocks. Not Rigid “Steps”.

Can skip some

Can go back (two-way street)

• Data types inform visualization design

• Data size informs choice of algorithms

• Visualization motivates more data cleaning

• Visualization challenges algorithm assumptionse.g., user finds that results don’t make sense

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 4: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

How “big data” affects the process? (Hint: almost everything is harder!)

The Vs of big data (3Vs originally, then 7, now 42)

Volume: “billions”, “petabytes” are common

Velocity: think Twitter, fraud detection, etc.

Variety: text (webpages), video (youtube)…

Veracity: uncertainty of data

Variability

Visualization

Value

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Disseminationhttp://www.ibmbigdatahub.com/infographic/four-vs-big-datahttp://dataconomy.com/seven-vs-big-data/https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

Page 5: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Two Example Projects from Polo Club

Page 6: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Apolo Graph Exploration: Machine Learning + Visualization

6

Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011.

Page 7: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

7

Page 8: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

7

Beautiful HairballDeath StarSpaghetti

Page 9: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Finding More Relevant Nodes

HCIPaper

Data MiningPaper

Citation network

8

Page 10: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Finding More Relevant Nodes

HCIPaper

Data MiningPaper

Citation network

8

Page 11: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Finding More Relevant Nodes

Apolo uses guilt-by-association(Belief Propagation)

HCIPaper

Data MiningPaper

Citation network

8

Page 12: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Demo: Mapping the Sensemaking Literature

9

Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations

Page 13: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based
Page 14: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based
Page 15: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Key Ideas (Recap)Specify exemplarsFind other relevant nodes (BP)

11

Page 16: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

What did Apolo go through?

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Scrape Google Scholar. No API. 😩

Design inference algorithm (Which nodes to show next?)

Paper, talks, lectures

Interactive visualization you just saw

You will a new Apolo prototype (called Argo)

Page 17: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

13Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. ACM Conference on Human Factors in Computing Systems (CHI) 2011. May 7-12, 2011.

Page 18: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

NetProbe: Fraud Detection in Online Auction

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Page 19: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Find bad sellers (fraudsters) on eBay who don’t deliver their items

NetProbe: The Problem

Buyer

$$$

Seller

15

Non-delivery fraud is a common auction fraudsource: https://www.fbi.gov/contact-us/field-offices/portland/news/press-releases/fbi-tech-tuesday---building-a-digital-defense-against-auction-fraud

Page 20: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

16

Page 21: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

NetProbe: Key Ideas§ Fraudsters fabricate their reputation by

“trading” with their accomplices§ Fake transactions form near bipartite cores§ How to detect them?

17

Page 22: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

NetProbe: Key IdeasUse Belief Propagation

18

F A HFraudster

AccompliceHonest

Darker means more likely

Page 23: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

NetProbe: Main Results

19

Page 24: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

20

Page 25: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

20

Page 26: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

20

“Belgian Police”

Page 27: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

21

Page 28: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

What did NetProbe go through?

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Scraping (built a “scraper”/“crawler”)

Design detection algorithm

Not released

Paper, talks, lectures

Page 29: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

23NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide Web (WWW) 2007. May 8-12, 2007. Banff, Alberta, Canada. Pages 201-210.

Page 30: Analytics Building Blockspoloclub.gatech.edu/cse6242/2018spring/slides/CSE... · Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based

Homework 1 (out next week; tasks subject to change)

• Simple “End-to-end” analysis

• Collect data using Twitter API

• Store in SQLite database

• Great graph from data

• Analyze, using SQL queries (e.g., create graph’s degree distribution)

• Visualize graph using Gephi

• Describe your discoveries

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination