Scalable high-dimensional indexing with Hadoop

Scalable high-dimensional indexing with HadoopTEXMEX team, INRIA Rennes, France

Denis Shestakov, PhDdenis.shestakov at {aalto.fi,inria.fr}linkedin: linkedin.com/in/dshestakovmendeley: mendeley.com/profiles/denis-shestakov

Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg

http://www.linkedin.com/in/dshestakov

http://www.mendeley.com/profiles/denis-shestakov/

Outline● Motivation● Approach overview: scaling indexing &

searching using Hadoop● Experimental setup: datasets, resources,

configuration● Results● Observations & implications● Things to share● Future directions

Motivation● Big data is here

○ Lots of multimedia content○ Even forgetting 'big' companies, 1TB/day of

multimedia is now common for many parties● Solution: apply more computational power

○ Luckily, easier access to such power via grid/cloud resources

● Applications:○ Large-scale image retrieval: e.g., detecting copyright

violations in huge image repositories○ Google Goggles-like systems: annotating the scene

Our approach● Index & search huge image collection using

MapReduce-based eCP algorithm○ See our work at ICMR'13: Indexing and searching

100M images with MapReduce [7]○ See Section II for quick overview

● Use the Grid5000 plartform○ Distributed infrastructure available to French

researchers & their partners● Use the Hadoop framework

○ Most popular open-source implementation of MapReduce model

○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes

Our approach● Hadoop used for both indexing and searching● Our search scenario:

■ Searching for batch of images● Thousands of images in one run● Focus on throughput, not on response time

for individual image■ Use case: copyright violation detection

● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary

Experimental setup● Used Grid5000 platform:

○ Nodes in rennes site of Grid5000■ Up to 110 nodes available■ Nodes capacity/performance varied

● Heterogenous, come from three clusters● From 8 cores to 24 cores per node● From 24GB to 48GB RAM per node

● Hadoop ver.1.0.1○ (!) No changes in Hadoop internals

■ Pros: easy to migrate, try and compare by others■ Cons: not top performance

Experimental setup● Over 100 mln images (~30 billion SIFT descriptors)

○ Collected from the Web and provided by one of the partners in Quaero project■ One of the largest reported in literature

○ Images resized to 150px on largest side○ Worked with

■ The whole set (~4TB)■ The subset, 20mln images (~1TB)

○ Used as distracting dataset

Experimental setup● For evaluation of indexing quality:

○ Added to distracting datasets:■ INRIA Copydays (127 images)

○ Queried for■ Copydays batch (3055 images = 127 original

images and their associated variants incl. strong distortions, e.g. print-crumple-scan )

■ 12k batch (12081 images = 245 random images from dataset and their variants)

○ Checked if original images returned as top voted search results

Results: workflow overview● Experiment on indexing & searching 1TB took 5-6

hours

Results: indexing 1TB

Results: indexing 4TB● 4TB● 100 nodes● Used tuned parameters

○ Except change in #mappers/#reducers per node■ To fit bigger index tree (for 4TB) to RAM■ 4 mappers/2 reducers

● Time: 507min

Results: search quality

Results: search scalability

Results: search executionSearch 12k batch over 1TB using 100 nodes

Results: searching 4TB● 4TB● 87 nodes● Copydays query batch (3k images)

○ Throughput: 460ms per image● 12k query batch

○ Throughput: 210ms per image● Bigger batches improve throughput insignificantly

○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node

Observations &implications● HDFS block size limits scalability

○ 1TB dataset => 1186 blocks of 1024MB size○ Assuming 8-core nodes and reported searching

method: no scaling after 149 nodes (i.e. 8x149=1192)

○ Solutions:■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for

512MB blocks■ Re-visit search process: e.g., partial-loading of lookup

table● Big data is here but not resources to process

○ E.g, indexing&searching >10TB not possible given resources we had

Things to share● Our methods/system can be applied to audio datasets

○ No major changes expected○ Contact me if interested

● Code for MapReduce-eCP algorithm available on request○ Should run smoothly on your Hadoop cluster○ Interested in comparisons

● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request○ Describe indexing/searching our dataset by giving details

on map/reduce tasks execution○ Insights on better analysis/visualization are welcome○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE

http://goo.gl/e06wE

Future directions● Deal with big batches of query images

○ ~200k query images● Share auxiliary data (index tree, lookup table) by

mappers○ Multithreaded map tasks

● (environment-specific) Test scalability on more nodes○ Use several sites of Grid5000 infrastructure

■ rennes+nancy sites (up to 300 nodes) --in progress

Acknowledgements● TEXMEX team, INRIA Rennes http://www.

irisa.fr/texmex/index_en.php● Quaero project, http://www.quaero.org/● Grid5000 infrastructure & its Rennes

maintenance team, https://www.grid5000.fr

Thank you!Questions?

Scalable high-dimensional indexing with Hadoop

Technology