Top Banner
Scalable high-dimensional indexing with Hadoop TEXMEX team, INRIA Rennes, France Denis Shestakov, PhD denis.shestakov at {aalto.fi,inria.fr} linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg
20

Scalable high-dimensional indexing with Hadoop

May 06, 2015

Download

Technology

Denis Shestakov

Talk given at CBMI 2013 (Veszprém, Hungary) on 19.06.2013
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable high-dimensional indexing with Hadoop

Scalable high-dimensional indexing with HadoopTEXMEX team, INRIA Rennes, France

Denis Shestakov, PhDdenis.shestakov at {aalto.fi,inria.fr}linkedin: linkedin.com/in/dshestakovmendeley: mendeley.com/profiles/denis-shestakov

Denis Shestakov, Diana Moise, Gylfi Gudmundsson, Laurent Amsaleg

Page 2: Scalable high-dimensional indexing with Hadoop

Outline● Motivation● Approach overview: scaling indexing &

searching using Hadoop● Experimental setup: datasets, resources,

configuration● Results● Observations & implications● Things to share● Future directions

Page 3: Scalable high-dimensional indexing with Hadoop

Motivation● Big data is here

○ Lots of multimedia content○ Even forgetting 'big' companies, 1TB/day of

multimedia is now common for many parties● Solution: apply more computational power

○ Luckily, easier access to such power via grid/cloud resources

● Applications:○ Large-scale image retrieval: e.g., detecting copyright

violations in huge image repositories○ Google Goggles-like systems: annotating the scene

Page 4: Scalable high-dimensional indexing with Hadoop

Our approach● Index & search huge image collection using

MapReduce-based eCP algorithm○ See our work at ICMR'13: Indexing and searching

100M images with MapReduce [7]○ See Section II for quick overview

● Use the Grid5000 plartform○ Distributed infrastructure available to French

researchers & their partners● Use the Hadoop framework

○ Most popular open-source implementation of MapReduce model

○ Data stored in HDFS that splits it into chunks (64MB or often bigger) and distributes it across nodes

Page 5: Scalable high-dimensional indexing with Hadoop

Our approach● Hadoop used for both indexing and searching● Our search scenario:

■ Searching for batch of images● Thousands of images in one run● Focus on throughput, not on response time

for individual image■ Use case: copyright violation detection

● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary

Page 6: Scalable high-dimensional indexing with Hadoop

Experimental setup● Used Grid5000 platform:

○ Nodes in rennes site of Grid5000■ Up to 110 nodes available■ Nodes capacity/performance varied

● Heterogenous, come from three clusters● From 8 cores to 24 cores per node● From 24GB to 48GB RAM per node

● Hadoop ver.1.0.1○ (!) No changes in Hadoop internals

■ Pros: easy to migrate, try and compare by others■ Cons: not top performance

Page 7: Scalable high-dimensional indexing with Hadoop

Experimental setup● Over 100 mln images (~30 billion SIFT descriptors)

○ Collected from the Web and provided by one of the partners in Quaero project■ One of the largest reported in literature

○ Images resized to 150px on largest side○ Worked with

■ The whole set (~4TB)■ The subset, 20mln images (~1TB)

○ Used as distracting dataset

Page 8: Scalable high-dimensional indexing with Hadoop

Experimental setup● For evaluation of indexing quality:

○ Added to distracting datasets:■ INRIA Copydays (127 images)

○ Queried for■ Copydays batch (3055 images = 127 original

images and their associated variants incl. strong distortions, e.g. print-crumple-scan )

■ 12k batch (12081 images = 245 random images from dataset and their variants)

○ Checked if original images returned as top voted search results

Page 9: Scalable high-dimensional indexing with Hadoop

Results: workflow overview● Experiment on indexing & searching 1TB took 5-6

hours

Page 10: Scalable high-dimensional indexing with Hadoop

Results: indexing 1TB

Page 11: Scalable high-dimensional indexing with Hadoop

Results: indexing 4TB● 4TB● 100 nodes● Used tuned parameters

○ Except change in #mappers/#reducers per node■ To fit bigger index tree (for 4TB) to RAM■ 4 mappers/2 reducers

● Time: 507min

Page 12: Scalable high-dimensional indexing with Hadoop

Results: search quality

Page 13: Scalable high-dimensional indexing with Hadoop

Results: search scalability

Page 14: Scalable high-dimensional indexing with Hadoop

Results: search executionSearch 12k batch over 1TB using 100 nodes

Page 15: Scalable high-dimensional indexing with Hadoop

Results: searching 4TB● 4TB● 87 nodes● Copydays query batch (3k images)

○ Throughput: 460ms per image● 12k query batch

○ Throughput: 210ms per image● Bigger batches improve throughput insignificantly

○ bigger batch -> bigger lookup table -> more RAM per mapper required -> less mappers per node

Page 16: Scalable high-dimensional indexing with Hadoop

Observations &implications● HDFS block size limits scalability

○ 1TB dataset => 1186 blocks of 1024MB size○ Assuming 8-core nodes and reported searching

method: no scaling after 149 nodes (i.e. 8x149=1192)

○ Solutions:■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for

512MB blocks■ Re-visit search process: e.g., partial-loading of lookup

table● Big data is here but not resources to process

○ E.g, indexing&searching >10TB not possible given resources we had

Page 17: Scalable high-dimensional indexing with Hadoop

Things to share● Our methods/system can be applied to audio datasets

○ No major changes expected○ Contact me if interested

● Code for MapReduce-eCP algorithm available on request○ Should run smoothly on your Hadoop cluster○ Interested in comparisons

● Hadoop job history logs behind our experiments (not only for those reported at CBMI) available on request○ Describe indexing/searching our dataset by giving details

on map/reduce tasks execution○ Insights on better analysis/visualization are welcome○ Job logs for CBMI'13 experiments: http://goo.gl/e06wE

Page 18: Scalable high-dimensional indexing with Hadoop

Future directions● Deal with big batches of query images

○ ~200k query images● Share auxiliary data (index tree, lookup table) by

mappers○ Multithreaded map tasks

● (environment-specific) Test scalability on more nodes○ Use several sites of Grid5000 infrastructure

■ rennes+nancy sites (up to 300 nodes) --in progress

Page 19: Scalable high-dimensional indexing with Hadoop

Acknowledgements● TEXMEX team, INRIA Rennes http://www.

irisa.fr/texmex/index_en.php● Quaero project, http://www.quaero.org/● Grid5000 infrastructure & its Rennes

maintenance team, https://www.grid5000.fr

Page 20: Scalable high-dimensional indexing with Hadoop

Thank you!Questions?