1. Web image size prediction for efficient focused image
crawling Katerina Andreadou, Symeon Papadopoulos and Yiannis
Kompatsiaris Centre for Research and Technology Hellas (CERTH)
Information Technologies Institute (ITI) CBMI 2015, June 11, 2015,
Prague, Czech Republic 2. Challenges in Crawling Web Images #2 Web
pages contain loads of images A large number of HTTP requests need
to be issued to download all of them Yet, the majority of small
images are either irrelevant correspond to decorative elements 3.
The Problem Improve the performance of our focused image crawler
crawls images related to a given set of keywords Typical focused
crawling metrics Harvest rate the number of relevant web pages
discovered Target precision the number of relevant crawl links
Proposed evaluation criteria for images Does the alternate text
contain any of the keywords? Does the web page title contain any of
the keywords? Very time consuming to download and evaluate the
whole HTML content and all available images #3 4. Objective:
Predict Web Image Size Predict the size of images based solely on
the image URL and the HTML metadata and HTML surrounding elements
(number of DMO siblings, depth of the DOM tree, parent text, etc.)
Classify the images into two groups SMALL width and height smaller
than 200 pixels BIG width and height bigger than 400 pixels #4 5.
Benefits of Predicting Image Size Substantial gains in time for the
image crawler We used the Apache Benchmark to time random image
requests average download time for an image 300 msec average
classification time for an image 10 msec For all images in Common
Crawl (720 million) 10 download threads on a single core 35 weeks
For just the big images using our method 10 download threads on a
single core less than 3 weeks #5 6. Related Work (Focused Crawling
/ Image Crawling) Link context algorithms rely on the lexical
content of the URL within its parent page The shark-search
algorithm (Hersovici et al., 1998) Graph structure algorithms take
advantage of the structure of the Web around a page Focused
crawling: A new approach to topic-specific web resource discovery
(Chakrabarti et al., 1999) Semantic analysis algorithms utilize
ontologies for semantic classification Ontology-focused crawling
(Maedche et al., 2002) #6 7. Data Collection #7 We used data from
the July 2014 Common Crawl set petabytes of data during the last 7
years contains raw web page data, extracted metadata and text lives
on Amazon S3 as part of the Amazon Public Datasets We created a
MapReduce job to parse all images and videos using EMR 8.
Statistics on Common Crawl Dataset #8 266 TB in size containing 3.6
billion web pages: 78.5M unique domains 8% of images big 40% of
images small 20% of images have no dimension information We choose
400 pixels as threshold to characterize big images. 9. Common Crawl
and Big Data Analytics Used in combination with a Wikipedia dump to
investigate the frequency distribution of numbers Number frequency
on the Web (van Hage, et al., 2014) Question whether the
heavy-tailed distributions observed in many Web crawls are inherent
in the network or a side-effect of the crawling process Graph
structure in the Web (Meusel et al., 2014) Analyze the challenges
of marking up content with microdata Integrating product data from
websites offering microdata markup (Petrovski et al., 2014) #9 10.
Method Overview We propose a supervised machine learning approach
for web image size prediction using different features: 1. The
n-grams extracted from the image URL; 2. The tokens extracted from
the image URL; 3. The HTML metadata and surrounding HTML elements;
4. The combination of textual and non-textual features (hybrid);
#10 11. Method I: NG An n-gram is a continuous sequence of n
characters from the given image URL Our main hypothesis: URLs that
correspond to BIG and SMALL images differ substantially in wording
BIG : large, x-large, gallery SMALL : logo, avatar, small, thumb,
up, down First attempt: use the most frequent n-grams #11 12.
Method II: NG-TRF (term relative frequency) 1. Collect the most
frequent n-grams (n={3,4,5}) for both classes (BIG and SMALL) 2.
Rank the two separate lists by frequency 3. Discard n-grams below a
threshold for every list (e.g., less than 50 occurrences in 500K
images) 4. For every n-gram, compute a correlation score 5. Rank
again the two lists by this score 6. Pick equal number of n-grams
from both lists to create a feature vector (e.g., 500 SMALL n-grams
and 500 BIG n-grams for a 1000-vector) #12 13. Method III:
TOKENS-TRF #13 Same as before but with tokens To produce the tokens
we split the image URL by all non alphanumeric characters (W+) 14.
Method IV: NG-TSRF-IDF #14 Stands for Term Squared Relative
Frequency, Inverse Document Frequency. If an n-gram is very
frequent in both classes, we should discard it. If an n-gram is not
overall very frequent but it is very class-specific, we should
include it. 15. Method V: HTML metadata features #15 HTML metadata
features may reveal cues about the image size. Examples: Photos are
more likely than graphics to have an alt text. Most photos are in
JPG or PNG format. Most icons and graphics are in BMP or GIF
format. 16. Evaluation #16 Training: 1M images (500K small/500K
big) Testing: 200K images (100K small/100K big) Random Forest
classifier (Weka) Experimented with LibSVM and RandomTree but RF
achieved best trade-off between accuracy and training time Tested
with 10, 30, 100 trees Performance measure: 17. Results #17
Doubling the number of n- gram features improves the performance
Adding more trees to the Random Forest classifier improves the
performance The NG-tsrf-idf and TOKENS-trf have the best
performance, followed closely by NG-trf Hybrid 18. Results: Hybrid
method #18 The hybrid method takes into account both textual and
non-textual features. Hypothesis: the two methods will complement
each other when aggregating their outputs: The adv parameter allows
to give an advantage to one of the two classifiers. 19. Conclusion
- Contributions A supervised machine learning approach for
automatically classifying Web images according to their size.
Assessment of textual and non-textual features. A statistical
analysis and evaluation on a sample of the Common Crawl set. #19
20. Future Work Apply the n-grams and tokens approaches to the
alternate and parent text create two additional classifiers and
combine them with the existing ones Detect more fine-grained
characteristics landscape - portrait photographs - graphics #20 21.
Thank you! Resources: Slides:
http://www.slideshare.net/KaterinaAndreadou1/kandreadou- cbmi-59
Code: https://github.com/MKLab-ITI/reveal-media-
webservice/tree/year2/src/main/java/gr/iti/mklab/reveal/clustering
Common Crawl: http://commoncrawl.org/ Get in touch: @kandreads /
[email protected] @sympapadopoulos / [email protected] #21