Top Banner
PARALLEL SORTED NEIGHBORHOOD BLOCKING WITH MAPREDUCE Lars Kolb , Andreas Thor, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Kaiserslautern, BTW 2011
13

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

Dec 22, 2015

Download

Documents

Prudence Eaton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

PARALLEL SORTED NEIGHBORHOOD BLOCKING WITH MAPREDUCELars Kolb, Andreas Thor, Erhard Rahm

Database Group Leipzighttp://dbs.uni-leipzig.de

Kaiserslautern, BTW 2011

Page 2: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

2 / 13

• Detection of entities in one or more sources that refer to the same real-world object

ENTITY RESOLUTION

Parallel Sorted Neighborhood Blocking with MapReduce

Page 3: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

3 / 13

ENTITY RESOLUTION (2)

• Runtime-intensive task O(n²) entity comparisons

• Blocking:• Semantically grouping of similar entities in blocks• Based on blocking keys derived from entities attributes• Restrict entity comparisons to entities from the same block

• Parallelization• MapReduce• Exploitation cloud infrastructures

Parallel Sorted Neighborhood Blocking with MapReduce

Page 4: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

4 / 13

SORTED NEIGHBORHOOD - RUNNING EXAMPLE (w=3)

Parallel Sorted Neighborhood Blocking with MapReduce

K S1 a1 d2 b2 e2 f2 h3 c3 g3 i

Sabcdefghi

Key Generation + Sort by Key

d-e, b-eb-f, e-fe-h, f-hf-c, h-ch-g, c-gc-i, g-i

Sliding Window

• Determine blocking key for each entity and sort entities by blocking key• Move window of fixed size w over sorted records and compare all entities

within window• All entities within a distance of w-1 are compared• O(n²) O(n) + O(n*log n) + O(n*w)

a-d, a-b, d-b

Page 5: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

5 / 13

OUTLINE

• Motivation

• Sorted Neighborhood and SN with MapReduce• Challenge 1: Sorted Reduce Partitions SRP• Challenge 2: Comparison of Boundary Entities JobSN/RepSN

• Experimental Results

• Conclusions & Future Work

Parallel Sorted Neighborhood Blocking with MapReduce

Page 6: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

6 / 13

MAPREDUCE• Computation expressed by two UDFs• Contain sequential code• Executed in parallel among multiple nodes• map: (keyin, valuein) list(keytmp, valuetmp)

• reduce: (keytmp, list(valuetmp)) list(keyout, valueout)

• Computation relies on data partitioning and redistribution• Number of map tasks m and reduce tasks r• Task executed by some idle node in the cluster• UDF part partitions map output and distributes it to the r reduce tasks• Sorting of key-value pairs• Grouping of key-value pairs by key and invocation of reduce for each group

Parallel Sorted Neighborhood Blocking with MapReduce

Page 7: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

7 / 13

ENTITY RESOLUTION WITH MAPREDUCE (m =3, r =2)

Parallel Sorted Neighborhood Blocking with MapReduce

Inpu

t Spl

it

map1Sabcdefghi

K S1 d2 e2 f

K S3 g2 h3 i

map2

map3

Sdef

Sghi

Parti

tioni

ng “

key

mod

ulo

r”

reduce1

reduce2

Mb-fe-h

Ma-dc-ib-fe-h

Out

put M

erge

Map Step: Blocking Reduce Step: Matching

K S1 a2 b3 c

Sabc

K S1 a

3 c1 d3 g3 i

K S1 a1 d3 c3 g3 i

K S2 b2 e2 f2 h

Ma-dc-i

•Map phase•Input data partitioned in m partitions•Each processed by one map task that calls map for each input record (“blocking”)•UDF part partitions map output and distributes it to the r reduce tasks

•Reduce phase•Sorting of key-value pairs by key •Grouping of key-value pairs by key•Invocation of reduce for each group (“matching”)

•Challenge 1: SN requires totally sorted list of entities•All entities assigned to reduce task Ri have smaller blocking key than all entities

assigned to reduce task Ri+1

•“Sorted reduce partitions” (SRP)•Must be ensured by part range partitioning

Page 8: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

8 / 13

• reduce:forEach(entity ϵ list(valuetmp))

match(buffer, entity); //match all buffered entities with entitybuffer.append(entity);if(buffer.size()==w) buffer.removeFirst();

SORTED NEIGHBORHOOD WITH MAPREDUCE – SRP

Parallel Sorted Neighborhood Blocking with MapReduce

map1

K S1.1 a1.2 b2.3 c

K S1.1 d1.2 e1.2 f

K S2.3 g1.2 h2.3 i

map2

map3

Sabc

Sdef

Sghi Pa

rtitio

ning

by

parti

tion

prefi

x

K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h

K S2.3 c2.3 g2.3 i

reduce1

reduce2

Bc-gc-ig-i

Key Generation + Partition Prefix Sliding Window (+ Matching)

K S1 a2 b3 c

K S1 d2 e2 f

K S3 g2 h3 i

Ba-da-bd-bd-eb-eb-fe-fe-hf-h

f-c ?h-c?h-g?

• Challenge 2: Boundary Entities• Comparison of entities entities that are assigned to different reduce tasks

• map outputs composite key: partitionPrefix.blockKey• partitionPrefix(k)= 1 if k<=2, otherwise 2 (range partitioning)

• part(partitionPrefix.blockKey)= partitionPrefix• Key-value pairs are sorted and grouped by composed key

Page 9: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

9 / 13

• SN realization using two consecutive jobs• Job1:

• SRP + additional output of boundary entities• Keys of the additionally outputted entities are

prefixed with an additional boundary component• Job2:

• SN for boundary entities• part(boundary.partitionIndex.blockKey)= boundary % r• Sort and group by composed key

SORTED NEIGHBORHOOD WITH MAPREDUCE – JOBSN

Parallel Sorted Neighborhood Blocking with MapReduce

K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h

K S2.3 c2.3 g2.3 i

reduce1

reduce2

Ba-d...f-h

Bc-gc-ig-i

Sliding Window (+ Matching)+ Boundary Prefix

K S1.2 f1.2 h

K S2.3 c2.3 g

map1

Parti

tioni

ng b

y bo

unda

ry p

refix

reduce1

Bf-ch-ch-g

Identity Sliding Window (+ Matching)

K S1.1.2 f1.1.2 h

map2

K S1.2.3 c1.2.3 g

K S1.1.2 f1.1.2 h1.2.3 c1.2.3 gK S

1.1.2 f1.1.2 h

K S1.2.3 c1.2.3 g

Page 10: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

10 / 13

SORTED NEIGHBORHOOD WITH MAPREDUCE - REPSN

Parallel Sorted Neighborhood Blocking with MapReduce

map1

K S1.1 a1.2 b2.3 c

K S1.1 d1.2 e1.2 f

K S2.3 g1.2 h2.3 i

map2

map3

Sabc

Sdef

Sghi

Key Generation + Partition Prefix + Boundary Prefix

K S1.1 a1.2 b2.3 c1.1 a1.2 b

K S1.1 d1.2 e1.2 f1.2 e1.2 f

K S2.3 g1.2 h2.3 i1.2 h

K S1.1.1 a1.1.2 b2.2.3 c2.1.1 a2.1.2 b

K S1.1.1 d1.1.2 e1.1.2 f2.1.2 e2.1.2 f

K S2.2.3 g1.1.2 h2.2.3 i2.1.2 h

K S1.1.1 a1.1.1 d1.1.2 b1.1.2 e1.1.2 f1.1.2 h

K S2.1.1 a2.1.2 b2.1.2 e2.1.2 f2.1.2 h2.2.3 c2.2.3 g2.2.3 i

reduce1

reduce2

Ba-da-bd-bd-eb-eb-fe-fe-hf-h

Bf-ch-ch-gc-gc-ig-i

Sliding Window (+ Matching)

Parti

tion

ing

by b

ound

ary

prefi

x

• SN realization using data replication•Reduce task i>1 needs last w-1 entities ofprevious partition in front of its input•Potential boundary entities are replicatedby the map tasks (two key-value pairs)•Replica of entity that is assigned toreduce task Ri is assigned to Ri+1

•Implementation•Map key prefixed with boundary component (like JobSN)•boundary= partitionPrefix+1 for replicatedentities (boundary=partitionPrefix otherwise)•part(boundary.partitionPrefix.blockKey)= boundary

Page 11: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

11 / 13

EXPERIMENTAL RESULTS• 1.4m publication records, blocking by title.substring(2), w=1000• 4 Dual core nodes, Hadoop 0.20.2

• Runtime reduction: 9h to 1.5h relative speedup of almost 6• Runtime of the implementations differ only slightly• JobSN faster for small degree of parallelism• RepSN completes faster gebinning with m=r=4

Parallel Sorted Neighborhood Blocking with MapReduce

Page 12: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

12 / 13

CONCLUSIONS• Application of the MapReduce programming model for parallel

execution of typical Entity Resolution workflows

• Realization of Sorted Neighborhood Blocking with MapReduce• Sorted reduce partitions

• Range partitioning

• Boundary entities• JobSN: generation of boundary correspondences by additional job• RepSN: SN realization within a single job using data replication in map phase

• Evaluation of the proposed approaches

• Future work• Load balancing mechanisms for handling skewed (blocking key) data• Multi-pass Blocking within single job

Parallel Sorted Neighborhood Blocking with MapReduce

Page 13: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

13 / 13Parallel Sorted Neighborhood Blocking with MapReduce

THANK YOU FOR YOUR ATTENTION