Top Banner
Big Data and Programming (History 9808A) 27 October 2014
20

Big Data and Programming (History 9808A) 27 October 2014.

Dec 18, 2015

Download

Documents

Kelley Brooks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data and Programming (History 9808A) 27 October 2014.

Big Data and Programming(History 9808A)

27 October 2014

Page 2: Big Data and Programming (History 9808A) 27 October 2014.

Today’s Agenda Proposals

How are we with the due date? A Short Introduction to Big Data

A Big Data Project: People In Motion

Page 3: Big Data and Programming (History 9808A) 27 October 2014.

Data Deluge Bit, byte, kilobyte (kB) megabyte (MB), gigabyte,

terabyte, petabyte, exabyte, zettabytes.... Library of Congress = 200 terabytes

“Transferring “Libraries of Congress” of Data” IP traffic is around 667 exabytes It’s a deluge... “Big Data”

too large for current software to handle

Don’t be intimidated Not all DH sources (yet)

Instructive video – David McCandless, “The Beauty of Data Visualization

Page 4: Big Data and Programming (History 9808A) 27 October 2014.

Big Data for History Tools for journalists, lit scholars and others

Where does history fit in? “Digital history does not offer truths, but only a new

way of interpreting and understanding traces of the past.” (S. Graham, I. Milligan, & S. Weingart)

Blog Leaders Taryn

“…we have to have a better understanding of how programming works so we can at least engage with Computer Scientists to help develop the complex systems required…”

Tamar The Strange Case of Belgium/Ancestry.com

Nick K. The Case of the Missing API

Page 5: Big Data and Programming (History 9808A) 27 October 2014.

New approach: Crowdsourcing An “online, distributed problem-solving and

production model.” Examples:

Wikipedia reCAPTCHA

Luis von Ahn

Others... Transcribe Bentham Census transcription

Page 6: Big Data and Programming (History 9808A) 27 October 2014.
Page 7: Big Data and Programming (History 9808A) 27 October 2014.
Page 8: Big Data and Programming (History 9808A) 27 October 2014.

A Database for Your Project? Think about how you might use a database

but perhaps not too big! Databases can be very small and still be DH-

worthy Are there public docs out there that you can

digest? Google Refine

Incorporate a search function into your website? Resources

MS Excel (spreadsheet) MS Access (relational database) Google Refine

Cleaning data

Page 9: Big Data and Programming (History 9808A) 27 October 2014.

People in Motion:Longitudinal Data from

the Canadian CensusA Big Data Project at the University of Guelph

Page 10: Big Data and Programming (History 9808A) 27 October 2014.

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Page 11: Big Data and Programming (History 9808A) 27 October 2014.

Stage 1: 1871 to 1881

100% of 1871

Census

Automatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Page 12: Big Data and Programming (History 9808A) 27 October 2014.

Teaching a Computer to be a genealogist Training with existing manually-created (True)

links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links

Bias concerns Think of any?

Logan Twp

Guelph

Page 13: Big Data and Programming (History 9808A) 27 October 2014.

Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,

widowed, unknown

Page 14: Big Data and Programming (History 9808A) 27 October 2014.

Automatic Linkage

The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

The system:

Page 15: Big Data and Programming (History 9808A) 27 October 2014.

Data Cleaning and Standardization Cleaning

Names – remove non-alpha numerical characters; remove titles

Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

Standardization Birthplace codes and granularity Marital status

Page 16: Big Data and Programming (History 9808A) 27 October 2014.

Computational Expense Very expensive to compare all the possible pairs

of records

Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Page 17: Big Data and Programming (History 9808A) 27 October 2014.

Managing Computational Expense Blocking

By first letter of last name By birthplace

Using HPC Running the system on multiple processors in

parallel

Page 18: Big Data and Programming (History 9808A) 27 October 2014.

Record Comparison Comparing Strings

String measures: First letter, “edit Distance”, sound

Age +/- 2 years

Required exact matches Gender Birthplace

Page 19: Big Data and Programming (History 9808A) 27 October 2014.

Linkage Results 1871-81-91-1901

Over 500,000 links… About 20%

Page 20: Big Data and Programming (History 9808A) 27 October 2014.

Coding Playtime W3C tutorials The Programming Historian

http://programminghistorian.org/ Codeacademy

http://www.codecademy.com/learn