Big Data and Programming 4 February 2015
Dec 17, 2015
Today’s Agenda A Short Introduction to Big Data
A Big Data Project: People In Motion Next week
Meet Monday here at 2:30 for ca. 60-75 minutes Meet Wednesday ca. 2:30-4:30 in Library 034a
(north stairs, go to basement)
Data Deluge Bit, byte, kilobyte (kB) megabyte (MB),
gigabyte, terabyte, petabyte, exabyte, zettabytes....
Library of Congress = 200 terabytes “Transferring “Libraries of Congress” of Data”
IP traffic is around 667 exabytes It’s a deluge... “Big Data”
too large for current software to handle
Don’t be intimidated Not all DH sources (yet)
Big Data for History Tools for journalists, literature scholars and others
Where does history fit in? Graham, Milligan, & Weingart
“Will Big Data have a revolutionary impact on the epistemological foundation of history?”
Will it get us closer to the past? Networks
A whole world of fun! Visualization is also a whole new world
See: David McCandless, “The Beauty of Data Visualization
What does it tell us?
New approaches: Crowdsourcing An “online, distributed problem-solving and
production model.” Examples:
Wikipedia reCAPTCHA
Luis von Ahn
Others...
A Database for Your Project? Think about how you might use a database
but perhaps not too big! Databases can be very small and still be DH-
worthy Are there public docs out there that you can
digest?
Resources: Programming Historian MS Excel (spreadsheet), Access (relational
database), Google Refine
People in Motion:Longitudinal Data from
the Canadian CensusA Big Data Project at the University of Guelph
‘Unbiased’ links connecting individuals/households over several
census years
A comprehensive infrastructure of longitudinal data
What we are working towards
1851Census
1871Census
1881Census 1891
Census
1901Census
1906 Census
1916Census
1911Census
US 1880
Census
US 1900
Census
Stage 1: 1871 to 1881
100% of 1871
Census
Automatic Linking
4,277,807 records
3,601,663 records
Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta
100% of 1871
Census
100% of 1871
Census
100% of 1881
Census
100% of 1871
Census
Teaching a Computer to be a genealogist Training with existing manually-created (True)
links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links
Bias concerns Think of any?
Logan Twp
Guelph
Attributes for Automatic Linking Last Name – string First Name – string Gender – binary Birthplace – code Age – number Marital status – single, married, divorced,
widowed, unknown
Automatic Linkage
The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense
The system:
Data Cleaning and Standardization Cleaning
Names – remove non-alpha numerical characters; remove titles
Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);
All attributes - deal with English/French notations (e.g. days/jours, married/mariee)
Standardization Birthplace codes and granularity Marital status
Computational Expense Very expensive to compare all the possible pairs
of records
Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)
Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)
Managing Computational Expense Blocking
By first letter of last name By birthplace
Using HPC Running the system on multiple processors in
parallel
Record Comparison Comparing Strings
String measures: First letter, “edit Distance”, sound
Age +/- 2 years
Required exact matches Gender Birthplace
Coding Workshop Go to http://www.codecademy.com/learn Scroll down to “Goals” Pick one of the three activities
Animate your Name About You Sun, Earth and Code
After 30 minutes, be prepared to present!