Top Banner
BACK TO BASICS: IG DATA AND EDUCATION IN THE SOCIAL SCIENCES Matthew S. Weber Rutgers University AEJMC 2014 Montreal, Canada
20

AEJMC 2014 - Big Data and Education

Jun 02, 2015

Download

Education

mwe400

The role of big data in education for the social sciences.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AEJMC 2014 - Big Data and Education

BACK TO BASICS: BIG DATA AND EDUCATION IN THE SOCIAL

SCIENCES Matthew S. WeberRutgers University

AEJMC 2014Montreal, Canada

Page 2: AEJMC 2014 - Big Data and Education

2

Page 3: AEJMC 2014 - Big Data and Education
Page 4: AEJMC 2014 - Big Data and Education
Page 5: AEJMC 2014 - Big Data and Education

5

Breaking down the walls of big data?

Page 6: AEJMC 2014 - Big Data and Education

6

http://archivehub.rutgers.edu

Page 7: AEJMC 2014 - Big Data and Education

EXAMPLE: Undergraduates

Page 8: AEJMC 2014 - Big Data and Education

Learning About Your Network• By being aware of your connections, you can take an active role

in managing your connections

– Be aware of the connections that you have, and what they contribute to your “network”

– Seek out networking opportunities

– Forge connections with people you admire and respect

Page 9: AEJMC 2014 - Big Data and Education
Page 10: AEJMC 2014 - Big Data and Education

LinkedIn Network Maps

Page 11: AEJMC 2014 - Big Data and Education
Page 12: AEJMC 2014 - Big Data and Education

Assignment Prompt

Prompt: Use www.touchgraph.com/facebook to generate a map of your Facebook network. Spend some time exploring your different connections, and then respond to the following:• What different types of clusters do you see? Be specific in identifying

at least 2 – 3 different clusters.• Is there someone in your network you forgot about? Who? Why?• Identify 2 people who you feel are the most useful connections in

your network based on where they are positioned. Who are they and why are they useful?

12

Page 13: AEJMC 2014 - Big Data and Education

EXAMPLE: PhD

Page 14: AEJMC 2014 - Big Data and Education
Page 15: AEJMC 2014 - Big Data and Education

SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);

nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;

i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;

i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);

i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();

Page 16: AEJMC 2014 - Big Data and Education

SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);

nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;

i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;

i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);

i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();

Page 17: AEJMC 2014 - Big Data and Education

SET DEFAULT_PARALLEL 30;titles = LOAD '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/nsf1.wat.gz' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata','Envelope.WARC-Header-Metadata.WARC-Target-URI','Envelope.WARC-Header-Metadata.WARC-Date','Envelope.WARC-Header-Metadata.Content-Type','Envelope.WARC-Header-Metadata.Content-Length') AS (links:chararray,target:chararray,date:chararray,contenttype:chararray,contentlength:chararray);

nonnulls = filter titles by links is not null;paths = foreach nonnulls generate org.sci.historycrawl.parser($0,$1,$2),$2,$3,$4;i6 = foreach paths generate bagwati.url,$1,$2,$3; i7 = foreach i6 generate flatten($0) as words,org.sci.historycrawl.formatdate(SUBSTRING($1,0,10)),$2,$3;

i8 = foreach i7 generate org.sci.historycrawl.getsourceURL($0),org.sci.historycrawl.getdstURL($0),org.sci.historycrawl.getText($0),$1,$2,(long)$3;

i9 = group i8 by ($0,$1,$3);i10 = foreach i9 generate FLATTEN(group),FLATTEN(TOP(1,0,i8.$2)),COUNT(i8),FLATTEN(TOP(1,0,i8.$4)),SUM(i8.$5);

i11 = filter i10 by $0 is not null;i12 = filter i11 by $1 is not null;store i12 INTO '/home/hai/Projects/HistoryCrawl/Data/IA/2_26_2014/HC_Output' using PigStorage();

Page 18: AEJMC 2014 - Big Data and Education

18

Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text

Link Data:

http://gawker.com/5953665/mitt-romneys-staff-played-the-media-covering-them-in-a-friendly-game-of-flag-football

Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag

http://gawker.com

2012-10-22

Page 19: AEJMC 2014 - Big Data and Education

19

Dataset Research Potential Dates Captures Unique URLs

Hurricane KatrinaOnline networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination

2003 – 2012 1,694,236 663,740

Superstorm Sandy 2003 – 2012 41,703,112 20,013,455

US SenateStudy the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse

109th – 112th Congresses

26,965,770 8,674,397

US House 51,840,777 12,410,014

Occupy Wall Street

Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs

2010 – 2012 247,928,272 11,3259,655

US MediaPrevious studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns

2008 – 2012 1,315,132,555 539,184,823

Page 20: AEJMC 2014 - Big Data and Education

• Email me! [email protected]• ArchiveHub: http://archivehub.rutgers.edu

• The Team– Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Luan Nguyen, Marya Doerfel, Rutgers University– Peter Monge, Ayushman Datta, Kristen Guth, USC

20

Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers