Web Archive Profiling For Efficient Memento Aggregation Sawood Alam Old Dominion University, Norfolk, Virginia - 23529 Advisor: Michael L. Nelson Doctoral Consortium TPDL’16 September 5, 2016 Supported in part by the International Internet Preservation Consortium (IIPC)
40
Embed
TPDL 2016 Doctoral Consortium - Web Archive Profiling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Web Archive ProfilingFor
Efficient Memento Aggregation
Sawood AlamOld Dominion University, Norfolk, Virginia - 23529
Advisor: Michael L. Nelson
Doctoral Consortium TPDL’16September 5, 2016
Supported in part by the International Internet Preservation Consortium (IIPC)
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)
● What do individual web archives hold?● How much do we need to know about an
archive’s holdings?● What is the optimal level of summarization for
better accuracy and increased freshness?● What are various ways to learn about archives’
holdings?● How to store and update archives’ profiles to
efficiently scale?
18
Archive Profile
● High-level summary of an archive● Predicts presence of mementos of a URI-R in
an archive● Provides various statistics about the holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other
✓ Baseline Profiling Through CDX Files✓ Profile Serialization✓ Fulltext Search Profiling✓ Sample URI Dataset➢ Instrumenting Memento Aggregator➢ Multidimensional Profiling
37
Publications
TPDL15 Web Archive Profiling Through CDX Summarization
TCDL15 Profiling Web Archives - For Efficient Memento Query Routing
IJDL16 Web Archive Profiling Through CDX Summarization
JCDL16 Poster: MemGator - A Portable Concurrent Memento Aggregator
TPDL16 Web Archive Profiling Through Fulltext Search
RFC Object Resource Stream (ORS) and CDX-JSON (CDXJ) Formats
C4LJ MemGator - A Portable Concurrent Memento Aggregator Architecture
JCDL17 Scalable, Maintainable, and Extensible Web Archive Profile Serialization for Efficient Lookup
JCDL17 URI, Time, and Language Profiling from Live Archives via URI Sampling and Fulltex Search
SIGIR17 Memento Aggregator Routing Based on Probability Distribution of Memento Availability with Archive Profiles
IJDL17 Archive X-Ray - Web Archive Profiling for Efficient Memento Aggregation
38
Future Work
● Language profiles● Evaluation of combination profiles such as
URI-Key along with Datetime● Utilize archive profile to generate rank
ordered list of archive● Profiles for usage other than Memento
routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)
39
Conclusions● Generated profiles with different policies for three archives● Examined cost-precision tradeoffs of various policies● Related CDX Size, URI-M, URI-R, and URI-Key● Gained up to 80% routing accuracy with <1% relative cost