Top Banner
SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic [email protected]
24

MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

May 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 1

MUFIN Basics

MUFIN teamFaculty of Informatics,

Masaryk UniversityBrno, Czech Republic

[email protected]

Page 2: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 2

Search problem

SEARCH

data

& q

uerie

s

infrastructureindex structure

Page 3: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 3

The thesis(intellectual proposition)

Search systems are more and more complexFuture search system will be born on the divergence of:

scale and determinism

Page 4: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 4

Trends in Scalability of Search

data volume - exponential growthnumber of users - increasing fastvariety of data types - digital databasesmulti-queries - lingual, feature, modal

Page 5: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 5

Trends in Determinism of Search

• Exact match• Precise answer• Unvaried answer

• Fixed query

• Dedicated hardware

• Similarity• Approximate answer• Satisfactory answer (advice,

recommendation)• Personalized, context aware,

proximate• Dynamic mapping, mobile

devices, infrastructure services

Page 6: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 6

Search systemsScalability

● data volume – exponential grows● number of users (queries) increase● variety of data types - digitization● multi-lingual (feature, modal) queries

Determinismexact match ► similarityprecise ► approximateunvaried answer ► good answer; advicefixed query ► personalized; context awarefixed infrastruct. ► dynamic mapping; mobile

grad

e

high

low

well established cutting-edge research

peer

-to-p

eer

cent

raliz

ed

para

llel

dist

ribut

ed

self-

orga

nize

dtime

Page 7: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 7

The MUFIN Approach

SEARCHda

ta &

que

ries

infrastructureindex structure

ScalabilityP2P structure

Extensibilitymetric space

Minkowski distance

Edit distanceJaccard’s coef.

Mahalanobis distance

Hausdforff distance

etc.

MUFIN: MUlti-Feature Indexing Network

Cloud computinginfrastructure as a service

Page 8: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 8

EXTENSIBILITYMetric Space: Abstraction of Similarity

Metric space: M = (D,d)D – domaindistance function d(x,y)∀x,y,z ∈ D

d(x,y) > 0 - non-negativityd(x,y) = 0 ⇔ x = y - identityd(x,y) = d(y,x) - symmetryd(x,y) ≤ d(x,z) + d(z,y) - triangle inequality

Page 9: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 9

Why Can the Metric Approach be Useful

Many application areas:biology, securityaudio-visual, geo. searchsoftware copy detectiondata cleaning, integration,etc.

Query by example paradigmone query image contains a lot of informationone image is worth 1000 wordsadvantage for mobile devices – min. click

Page 10: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 10

Metric Search Grows in Popularity

Hanan SametFoundation of Multidimensional andMetric Data StructuresMorgan Kaufmann, 2006

P. Zezula, G. Amato, V. Dohnal, and M. BatkoSimilarity Search: The Metric Space ApproachSpringer, 2006

Page 11: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 11

Examples of Distance FunctionsLp Minkowski distance of order p

L1 – city-block distance

L2 – Euclidean distance

L∞ – infinity

edit distance (for strings)minimal number of insertions, deletions and substitutionsd(‘application’, ‘applet’) = 6

Jaccard’s coefficient (for sets A,B)

∑=

−=n

iii yxyxL

11 ||),(

( )∑=

−=n

iii yxyxL

1

22 ),(

ii

n

iyxyxL −=

=∞ max),(

1

( )UI

BA

BABAd −=1,

Page 12: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 12

Examples of Distance FunctionsMahalanobis distance

for vectors with correlated dimensions

Hausdorff distancefor sets with elements related by another distance

Earth movers distanceprimarily for histograms (sets of weighted features)

and many others

Page 13: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 13

Image MUFIN overlayA demo on Cophir 50 M dataset (280 dim vectors)

Five combined MPEG7 global descriptor:Color Structure, max. dist.: 40, weight: 3Color Layout, max. dist.: 300, weight: 2Scalable Color, max. dist.: 3000weight: 2Edge Histogram, max. dist.: 68, weight: 4Homogeneous Texture, max. dist.: 25, weight: 0.5

Page 14: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 14

Face searchFace search demo – 6k images with people

face detection – 10k detected facesface description – 64 dimensional vectorsface comparison - advanced face des. MPEG7

Based on a publicly available software

Page 15: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 15

SCALABILITYStructured P2P networks

Objectives To scale into contemporary audio-visual data volume and query execution throughput, i.e.:

billions of objectsonline response timehundreds of queries per sec.

A peerContains metric objects, can issue/answer queries, and knows few other peers

Page 16: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 16

Why structured P2P in MUFINStructured P2P network employ a globally considered protocol to ensure that any peer can efficiently route a search to some peerthat has the desired data

Structured P2P networks are used in MUFIN for:no bottleneck, no central componentmultiple access points to the networks distribution of workload – parallel query executiondynamic structure of peers – (controlled) resilience, join, leavemechanisms for fault tolerance, replication and load balancing

Page 17: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 17

P2P Architecture of MUFIN• Native metric techniques: GHT*, VPT*• Transformation techniques: MCAN, M-Chord

(Skip-Graphs, Kademlia, etc.)

Page 18: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 18

P2P Architecture of MUFINPeers are not necessarily computersA peer size determines a lower-bound on the query response timePeer’s data can be searched by:

FilteringM-treeD-indexI-distanceEtc.

Page 19: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 19

Scalability test1M: 50 peers – memory based10M: 500 peers – memory based50M: 2000 peers – disk based

Effectiveness improves with data volumeEfficiency

lower-bounded by the peer size (20k, 20k, 25k)does not change significantly

Page 20: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 20

Infrastructure as a ServiceWhy:

Performance tuningQuery response timeQuery execution throughput

Performance adjustmentDifferent performance requirements (day – night, weekend – working days)

Experimental trials Test an applicationPurchase a new hardware

Availability - reliability

Page 21: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 21

MUFIN Hardware Mapping10M network, 500 peers, memory-basedBatch of 250 queries started from 10 peers

3248169147237969692169128100,663801

23201687802032448316566541,341862

18471624681221035516532652,66944

180617032487573618117875,10498

16051832596726911849589,262716

maxminavgmaxminavg

single query [ms]total (s)

single query [ms]queries/stotal [s]

Sequential from 1 peerParallel from 10 peers

CPUs

Page 22: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 22

Externalindex

Featureextraction

MUFIN Overview

Peer-to-Peer Networks

Multi-overlay structure

Forms• range • k-nearest• complex

Strategies• precise• approximate• social

insertdelete

features

Web service

Universal• batch, telnet, GUI

Specialized• image web interface

Page 23: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 23

MUFIN pluginNews web-sites contain images

CNN, BBC, SEZNAM, iDNESPhotography collection of US National Parks

TERRA GALLERIA Image text searchGoogle, Yahoo, Yandex, Ask, Seznam, Rajče, exalead

Page 24: MUFIN Basics · SEMWEB 21 MUFIN Hardware Mapping z10M network, 500 peers, memory-based zBatch of 250 queries started from 10 peers 1 380 0,66 12810 169 69692 379 1472 169 3248 2 186

SEMWEB 24

Use of MUFIN in SAPIR DemosCaching – to locate cashed queriesText+Image – to perform content similarityVideo search - to perform content similarityMobile interface - to perform content similarity

Some statisticsPermanent demo:coming soon