Web image size prediction for efficient focused image crawling

Web image size prediction for efficient focused image crawlingKaterina Andreadou, Symeon Papadopoulos and Yiannis Kompatsiaris

Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)

CBMI 2015, June 11, 2015, Prague, Czech Republic

Challenges in Crawling Web Images

#2

• Web pages contain loads of images• A large number of HTTP requests need to be issued

to download all of them• Yet, the majority of

small images– are either irrelevant– correspond to

decorative elements

The Problem

• Improve the performance of our focused image crawler crawls images related to a given set of keywords

• Typical focused crawling metrics– Harvest rate the number of relevant web pages discovered– Target precision the number of relevant crawl links

• Proposed evaluation criteria for images– Does the alternate text contain any of the keywords?– Does the web page title contain any of the keywords?

Very time consuming to download and evaluate the whole HTML content and all available images

#3

Objective: Predict Web Image Size

• Predict the size of images based solely on– the image URL and– the HTML metadata and HTML surrounding elements

(number of DMO siblings, depth of the DOM tree, parent text, etc.)

• Classify the images into two groups– SMALL width and height smaller than 200 pixels– BIG width and height bigger than 400 pixels

#4

Benefits of Predicting Image Size

• Substantial gains in time for the image crawler

• We used the Apache Benchmark to time random image requests– average download time for an image 300 msec– average classification time for an image 10 msec

• For all images in Common Crawl (720 million)– 10 download threads on a single core 35 weeks

• For just the big images using our method– 10 download threads on a single core less than 3 weeks

#5

Related Work (Focused Crawling / Image Crawling)

• Link context algorithms rely on the lexical content of the URL within its parent page– The shark-search algorithm (Hersovici et al., 1998)

• Graph structure algorithms take advantage of the structure of the Web around a page– Focused crawling: A new approach to topic-specific web

resource discovery (Chakrabarti et al., 1999)• Semantic analysis algorithms utilize ontologies for

semantic classification– Ontology-focused crawling (Maedche et al., 2002)

#6

Data Collection

#7

• We used data from the July 2014 Common Crawl set– petabytes of data during the last 7 years– contains raw web page data, extracted metadata and text– lives on Amazon S3 as part of the Amazon Public Datasets

• We created a MapReduce job to parse all images and videos using EMR

Statistics on Common Crawl Dataset

#8

266 TB in size containing 3.6 billion web pages:

• 78.5M unique domains• 8% of images big• 40% of images small• 20% of images have no

dimension information

We choose 400 pixels as threshold to characterize big images.

Common Crawl and Big Data Analytics

• Used in combination with a Wikipedia dump to investigate the frequency distribution of numbers– Number frequency on the Web (van Hage, et al., 2014)

• Question whether the heavy-tailed distributions observed in many Web crawls are inherent in the network or a side-effect of the crawling process– Graph structure in the Web (Meusel et al., 2014)

• Analyze the challenges of marking up content with microdata– Integrating product data from websites offering microdata

markup (Petrovski et al., 2014)

#9

Method Overview

We propose a supervised machine learning approach for web image size prediction using different features:1. The n-grams extracted from the image URL;2. The tokens extracted from the image URL;3. The HTML metadata and surrounding HTML

elements;4. The combination of textual and non-textual features

(hybrid);

#10

Method I: NG

• An n-gram is a continuous sequence of n characters from the given image URL

• Our main hypothesis: “URLs that correspond to BIG and SMALL images differ substantially in wording”• BIG : large, x-large, gallery• SMALL : logo, avatar, small, thumb, up, down

• First attempt: use the most frequent n-grams

#11

Method II: NG-TRF (term relative frequency)

1. Collect the most frequent n-grams (n={3,4,5}) for both classes (BIG and SMALL)

2. Rank the two separate lists by frequency3. Discard n-grams below a threshold for every list

(e.g., less than 50 occurrences in 500K images)4. For every n-gram, compute a correlation score5. Rank again the two lists by this score6. Pick equal number of n-grams from both lists to

create a feature vector (e.g., 500 SMALL n-grams and 500 BIG n-grams for a 1000-vector)

#12

Method III: TOKENS-TRF

#13

• Same as before but with tokens• To produce the tokens we split the image URL by all

non alphanumeric characters (\\W+)

Method IV: NG-TSRF-IDF

#14

• Stands for Term Squared Relative Frequency, Inverse Document Frequency.

• If an n-gram is very frequent in both classes, we should discard it.

• If an n-gram is not overall very frequent but it is very class-specific, we should include it.

Method V: HTML metadata features

#15

HTML metadata features may reveal cues about the image size.

Examples:• Photos are more likely than

graphics to have an alt text.• Most photos are in JPG or PNG

format.• Most icons and graphics are in

BMP or GIF format.

Evaluation

#16

• Training: 1M images (500K small/500K big)• Testing: 200K images (100K small/100K big)

• Random Forest classifier (Weka)• Experimented with LibSVM and RandomTree but RF

achieved best trade-off between accuracy and training time

• Tested with 10, 30, 100 trees

• Performance measure:

Results

#17

• Doubling the number of n-gram features improves the performance

• Adding more trees to the Random Forest classifier improves the performance

• The NG-tsrf-idf and TOKENS-trf have the best performance, followed closely by NG-trf

Hybrid

Results: Hybrid method

#18

• The hybrid method takes into account both textual and non-textual features.

• Hypothesis: the two methods will complement each other when aggregating their outputs:

• The adv parameter allows to give an advantage to one of the two classifiers.

Conclusion - Contributions

• A supervised machine learning approach for automatically classifying Web images according to their size.

• Assessment of textual and non-textual features.

• A statistical analysis and evaluation on a sample of the Common Crawl set.

#19

Future Work

• Apply the n-grams and tokens approaches to the alternate and parent text– create two additional classifiers and combine them with

the existing ones

• Detect more fine-grained characteristics– landscape - portrait– photographs - graphics

#20

Thank you!

• Resources:Slides: http://www.slideshare.net/KaterinaAndreadou1/kandreadou-cbmi-59

Code: https://github.com/MKLab-ITI/reveal-media-webservice/tree/year2/src/main/java/gr/iti/mklab/reveal/clusteringCommon Crawl: http://commoncrawl.org/

• Get in touch:@kandreads / [email protected]@sympapadopoulos / [email protected]

#21

http://www.slideshare.net/KaterinaAndreadou1/kandreadou-cbmi-59

https://github.com/MKLab-ITI/reveal-media-webservice/tree/year2/src/main/java/gr/iti/mklab/reveal/clustering




http://commoncrawl.org/

Web image size prediction for efficient focused image crawling

Data & Analytics