Page 1
CS6604 Digital LibrariesGlobal Events Team Final Presentation
Presenters:Liuqing Li, Islam Harb, Andrej Galad
{liuqing, iharb, agalad}@vt.edu
Instructor: Dr. Edward A. Fox
Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061
April 27, 2017
Page 2
Global Events Team Final Presentation
• Background• Implementation
• DataCollection• DataProcessing• DataVisualization
• FutureWork• Acknowledgement
Outline
1
Page 3
Global Events Team Final Presentation
Background
2
• GETAR*• GlobalEventandTrendArchiveResearch• Architecture
* Edward A Fox, Donald Shoemaker, Chandan Reddy, Andrea Kavanaugh, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR), NSF grant IIS - 1619028, 2017-2019. http://eventsarchive.org
Page 4
Global Events Team Final Presentation
Implementation – Architecture
3
Event Focused Crawler (EFC)
WARCFiles CDXFilesCDX Writer
ArchiveSpark
ApacheSpark
StanfordNER
RegularExpression
ScoreFunction
Entity-basedResults
Standalone HBase
WebApplication
Data Collection
Data Processing
Data Visualization
Page 5
Global Events Team Final Presentation
SchoolShootingEvents YearVirginiaTechShooting 2007
NorthernIllinoisUniversityShooting 2008DunbarHighSchoolShooting 2009UniversityofAlabamaShooting 2010Worthing HighSchoolShooting 2011
SandyHookElementarySchoolShooting 2012SparksMiddleSchoolShooting 2013ReynoldsHighSchoolShooting 2014
UmpquaCommunityCollegeShooting 2015TownvilleElementarySchoolShooting 2016
Events of Interest
4
Page 6
Global Events Team Final Presentation
Focused Crawler – Collecting / Archiving
5
START
ManuallyCurateSeeds
URLsQueue
DownloadPage
ProcessPage&ConvertintoWARCFormat
ExtractURLs
CalculateRelevancy
Relevant?
Discard
AppendResultwarc.gz EventFile
END
Yes
No
No
Yes
AllURLs?
Page 7
Global Events Team Final Presentation
• Wget (Version1.14orlater)
WARC Libraries
6
Page 8
Global Events Team Final Presentation
• Wpull
WARC Libraries
7
Page 9
Global Events Team Final Presentation
• WARCIO:WARC(andARC)StreamingLibrary• Python2.7+and3.3+• Post-Processing:Read/WriteWARCformat
WARC Libraries
8
Page 10
Global Events Team Final Presentation
• NamingConvention• [location]_[year].warc.gz
Ten Events Collections
9
Page 11
Global Events Team Final Presentation
• ArchiveSpark• ApacheSparkframeworkforWebArchives• Easydataextraction• Input:WARCandCDXfiles
• CDXWriter• PythonscripttocreateCDXfilesofWARCfiles• Format:CDXNbamskrMSVg
• e.g.,edu,vt,cnre)/20170422005601http://cnre.vt.edu text/html200BT3ILJXROIILHBKQPNYDUCUVZRDKG3OA- - 947820104749data/Virginia-Tech-Shooting_20070416.warc.gz
Tools for Data Processing
10
Page 12
Global Events Team Final Presentation
• WebpageCleaning• ExtractRawText
• payload.string.html.body.text• RemovejQuery&JavaScript
• {WPGroHo.syncProfileData(hash,id);},…• Removetags
• <br>,<p>,…• Removemarkers
• *,|,+,…• Removestopwords
• a,about,the,…
Data Preprocessing
11
Page 13
Global Events Team Final Presentation
• EntityExtraction• BasicParsing
• eventnameanddate• StanfordNER(Integratedmodel)
• entities,shootername• RegularExpression
• eventdate• shooternameandage• numberofvictims• weaponlist
• ScoreFunction• 𝑡𝑓 ∗ 𝑑𝑓
Data Processing
12
Page 14
Global Events Team Final Presentation
• Build-inImportTsv Utility• ImportDataintoHBase
HBase
13
Table Name globalevents
Row_Key Event_Date + Event Hash Value 20070416217787922
Column Family event
Column
event: name Virginia Tech Shooting
event: date 20070416
event: shooter_age 23-year-old
event: shooting_victims 32 victims
event: entities Virginia;Tech;VA;University;…
event: entities_count 146900;62415;13940;7732;…
event: entities_url url1,url2,url3,url4,url6;url2,url3,url4,url5;url1,url3,url4,url5,url6;…
Page 15
Global Events Team Final Presentation
• KeyStages• Initialization
• CreateSparkSession• CreateNLPCore• CreateStorage
• Processing• ExtractEventName/Date/URL• ExtractNameEntities• ExtractOtherEventFeatures
• ExportandImport• GenerateTSVfile• ImportTSVfileintoHBase
Data Processing – Demo
14
Page 16
Global Events Team Final Presentation
• Efficientvisualizationoflong-termglobalevents• Showrepresentativeterms->linktocorrespondingURLs• Visualizeevents’trendsovertime(timeseries)
• Java7SpringBootWebapplication• Buildsystem- Gradle• EmbeddedTomcatWebserver• Backend- HBase,in-memory• Frontend- D3.js,Bootstrap
Global Events Viewer
15
https://github.com/dedocibula/global-events-viewer
Page 17
Global Events Team Final Presentation
• KeyComponents• WordCloud,RangeSelection,URLList,Trends
Global Events Viewer – Demo
16
Page 18
Global Events Team Final Presentation
Problem Faced
17
DataCollectionEncodingproblems(UTF-8, ASCIIandothers)Get morerelevantseedsforoldevents
DataProcessingLack ofdocumentation(ArchiveSpark)Versionconflict(CDXWriter,Kernel inJupyter)JVMissue(Spark)
DataVisualizationSpringbootIntelliJsetupJQueryUI
Page 19
Global Events Team Final Presentation
Lessons Learned
18
DataCollectionWARCIOFocusedCrawler
Data ProcessingArchiveSparkSpark& Scala(Map/ReduceProcess)
DataVisualizationD3WordCloudD3DynamicLineCharts
Page 20
Global Events Team Final Presentation
Future Work
19
DataCollectionWayback MachineAutomaticRoutineforFocusedCrawlerEvent Extension(Sources,Time,Space)
Data ProcessingStandaloneMode-> ClusterModeNameEntityRecognizerAutomaticProcessing(CDXWriter andHBase)
DataVisualizationLocalization– DatamapsWeapons
Page 21
Global Events Team Final Presentation
Acknowledgement
20
Projects
NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)
NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)
OrganizationsInternetArchiveL3SResearchCenter
PersonsInstructor Dr.EdwardA.FoxAlumnus Dr.MohamedMagdy FaragLabmates PrashantChandrasekar, XuanZhang
Page 22
Thank you !
Questions?