Top Banner
The Record of our Recent Past Large-Scale Text Mining in Canadian Web Archives Ian Milligan Assistant Professor
34

Congress text-mining-event

Jul 24, 2015

Download

Internet

Ian Milligan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Congress text-mining-event

The Record of our Recent Past

Large-Scale Text Mining in Canadian Web Archives

Ian Milligan Assistant Professor

Page 2: Congress text-mining-event

Why?The sheer amount of social,

cultural, and political information generated every

day presents new opportunities for historians.

Page 3: Congress text-mining-event

Could one even study the 1990s

and beyond without

web archives?

Page 4: Congress text-mining-event

No.

Historians need to do this now, or we’re going to be left behind.

Page 5: Congress text-mining-event

One Case Study

• Archive-It Research Services/University of Toronto Libraries: “Canadian Political Parties and Political Interest Groups” collection

• 2005 - 2015

• ARC/WARC/WAT Files

Page 6: Congress text-mining-event

Problem One:Historians want content, but we

can only locally work with metadata

Page 7: Congress text-mining-event

WAT files vs.

ARC/WARC files

Page 8: Congress text-mining-event

Do we want metadata or content analysis?

Page 9: Congress text-mining-event

Historians NEED content, but metadata can help us find and contextualize it

Page 10: Congress text-mining-event

Metadata Extraction

Page 11: Congress text-mining-event

Metadata Extraction

Page 12: Congress text-mining-event

Metadata Extraction

Page 13: Congress text-mining-event

Metadata Extraction

• Results @ http://ianmilligan.ca/2015/02/05/topic-modeling-web-archive-modularity-classes/

Page 14: Congress text-mining-event

Metadata Extraction• Conservative themes (2014): economic

development, family, immigration, legislation, women’s issues, senior issues, Ukrainians, constituency offices, some prominent (and not-so-prominent) MPs, and of course, our economic action plan.

• Liberal themes (2014): Justin Trudeau (the new leader), cuts to social programs, child poverty, mental health, municipal issues, labour, workers, Stop the Cuts, and housing.

Page 15: Congress text-mining-event

Metadata Extraction

• Conservative themes (2006): education, university, but tons of information on Aboriginal issues;

• Liberal themes (2006): community questions, electoral topics, universities, human rights, child care support.

Page 16: Congress text-mining-event

As well as short stories..

Page 17: Congress text-mining-event
Page 18: Congress text-mining-event
Page 19: Congress text-mining-event
Page 20: Congress text-mining-event
Page 21: Congress text-mining-event
Page 22: Congress text-mining-event
Page 23: Congress text-mining-event
Page 24: Congress text-mining-event

2005 Canadian Federal Election

Page 25: Congress text-mining-event

WATs help us find the files we need to use - and to

contextualize them

Page 26: Congress text-mining-event

Problem Two:You can do amazing things with the content (WARCs), but you

need a cluster.

Page 27: Congress text-mining-event
Page 28: Congress text-mining-event

WARC Analysis• 2005-2009: 244 GB of content;

2.9 GB of plain text

• 10,606,822 websites

• On a local powerful node (3 Ghz 8-Core Intel Xeon E5/64 GB RAM, data on SSD), about three to four hours per query

• On a cluster, about ~10-20 minutes per query, depending on traffic

Page 29: Congress text-mining-event

Large-Scale Text Analysis

• With Hadoop about 15-20 minutes to extract all plain-text from any specified queries: i.e. all pages belonging to Green Party, Liberal Party, Conservative Party, Council of Canadians, etc.

• Compared to “out of memory”/go home for an extended weekend on a local node

Page 30: Congress text-mining-event

Large-Scale Text Analysis

• NER/LDA/Keyword Frequency broken down by scrape date: i.e. scrape carried out 2005-10, see change over time;

• Downside: not everything is optimized for parallel environment; if not, it crawls (there goes a day)

• Downside: scrape date != creation date, requiring temporal analysis

Page 31: Congress text-mining-event

Trial Three: Culturomics for Web

Archives?

Page 32: Congress text-mining-event

(switch to browser)

Page 33: Congress text-mining-event

Some code/walkthroughs/sample data available at

https://github.com/ianmilligan1/WAHR

Page 34: Congress text-mining-event

Thank you!

Ian Milligan Assistant Professor

https://uwaterloo.ca/web-archive-group/