Top Banner
Big Data and Programming 4 February 2015
20

Big Data and Programming 4 February 2015. Today’s Agenda A Short Introduction to Big Data A Big Data Project: People In Motion Next week Meet.

Dec 17, 2015

Download

Documents

Roxanne Carter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Big Data and Programming

4 February 2015

Page 2: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Today’s Agenda A Short Introduction to Big Data

A Big Data Project: People In Motion Next week

Meet Monday here at 2:30 for ca. 60-75 minutes Meet Wednesday ca. 2:30-4:30 in Library 034a

(north stairs, go to basement)

Page 3: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Data Deluge Bit, byte, kilobyte (kB) megabyte (MB),

gigabyte, terabyte, petabyte, exabyte, zettabytes....

Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data”

IP traffic is around 667 exabytes It’s a deluge... “Big Data”

too large for current software to handle

Don’t be intimidated Not all DH sources (yet)

Page 4: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Big Data for History Tools for journalists, literature scholars and others

Where does history fit in? Graham, Milligan, & Weingart

“Will Big Data have a revolutionary impact on the epistemological foundation of history?”

Will it get us closer to the past? Networks

A whole world of fun! Visualization is also a whole new world

See: David McCandless, “The Beauty of Data Visualization

What does it tell us?

Page 5: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

New approaches: Crowdsourcing An “online, distributed problem-solving and

production model.” Examples:

Wikipedia reCAPTCHA

Luis von Ahn

Others...

Page 6: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.
Page 7: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.
Page 8: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

A Database for Your Project? Think about how you might use a database

but perhaps not too big! Databases can be very small and still be DH-

worthy Are there public docs out there that you can

digest?

Resources: Programming Historian MS Excel (spreadsheet), Access (relational

database), Google Refine

Page 9: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

People in Motion:Longitudinal Data from

the Canadian CensusA Big Data Project at the University of Guelph

Page 10: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Page 11: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Stage 1: 1871 to 1881

100% of 1871

Census

Automatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Page 12: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Teaching a Computer to be a genealogist Training with existing manually-created (True)

links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links

Bias concerns Think of any?

Logan Twp

Guelph

Page 13: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,

widowed, unknown

Page 14: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Automatic Linkage

The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

The system:

Page 15: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Data Cleaning and Standardization Cleaning

Names – remove non-alpha numerical characters; remove titles

Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

Standardization Birthplace codes and granularity Marital status

Page 16: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Computational Expense Very expensive to compare all the possible pairs

of records

Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Page 17: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Managing Computational Expense Blocking

By first letter of last name By birthplace

Using HPC Running the system on multiple processors in

parallel

Page 18: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Record Comparison Comparing Strings

String measures: First letter, “edit Distance”, sound

Age +/- 2 years

Required exact matches Gender Birthplace

Page 19: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Linkage Results 1871-81-91-1901

Over 500,000 links… About 20%

Page 20: Big Data and Programming 4 February 2015. Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.

Coding Workshop Go to http://www.codecademy.com/learn Scroll down to “Goals” Pick one of the three activities

Animate your Name About You Sun, Earth and Code

After 30 minutes, be prepared to present!