Efficient Duplicate Detection Over Massive Data Sets

Efficient Duplicate Detection Over Massive Data Sets

Pradeeban Kathiravelu

INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa

Lisbon, Portugal

Data Quality – Presentation 4.April 21, 2015.

Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21

Dedoop: Efficient Deduplication with Hadoop

Introduction

Blocking

Grouping of entities that are “somehow similar”.

Comparisons restricted to entities from the same block.

Entity Resolution (ER, Object matching, deduplication)

Costly.

Traditional Blocking Approaches not effective.



Motivation

Advantages of leveraging parallel and cloud environments.

Manual tuning of ER parameters is facilitated as ER results can bequickly generated and evaluated.

⇓ Execution times for large data sets ⇒ Speed up common datamanagement processes.



Dedoop

http://dbs.uni-leipzig.de/dedoop

MapReduce-based entity resolution of large datasets.

Pair-wise similarity computation [O(n2)] executed in parallel.

Automatic transformation:Workflow definition ⇒ Executable MapReduce workflow.

Avoid unnecessary entity pair comparisons

That result from the utilization of multiple blocking keys.


http://dbs.uni-leipzig.de/dedoop


Features

Several load balancing strategies

In combination with its blocking techniques.To achieve balanced workloads across all employed nodes of the cluster.



User Interface

Users easily specify advanced ER workflows in a web browser.Choose from a rich toolset of common ER components.

Blocking techniques.Similarity functions.

Machine learning for automatically building match classifiers.Visualization of the ER results and the workload of all cluster nodes.



Solution Architecture

Map determines blocking keys for each entity and outputs (blockkey,entity) pairs.

Reduce compares entities that belong to the same block.


MapDupReducer: Detecting Near Duplicates ..

Near Duplicate Detection (NDD)

Multi-Processor Systems are more effective.

MapReduce Platform.

Ease of use.High Efficiency.



System Architecture

Non-trivial generalization of the PPJoin algorithm into theMapReduce framework.

Redesigning the position and prefix filtering.Document signature filtering to further reduce the candidate size.



Evaluation

Data sets.MEDLINE documents.

Finding plagiarized documents.18.5 million records.

BING.Web pages with an aggregated size of 2TB.

Hotspot.High update frequency.

Altering the arguments.

Different number of map() and reduce() params.


Efficient Similarity Joins for Near Duplicate Detection

Similarity Definitions


Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Join Algorithms

Efficient similarity join algorithms by exploiting the ordering of tokensin the records.

Positional filtering and suffix filtering are complementary to theexisting prefix filtering technique.

Commonly used strategy depends on the size of the document.Text documents: Edit distance and Jaccard similarity.

Edit distance: Minimum number of edits required to transform onestring to another.An insertion, deletion, or substitution of a single character.

Web documents: Jaccard or overlap similarity on small or fix sizedsketches.

Near duplicate object detection problem is a generalization of thewell-known nearest neighbor problem.


Efficient Parallel Set-Similarity Joins Using MapReduce

Introduction

Efficiently perform set-similarity joins in parallel using the popularMapReduce framework.

A 3-stage approach for end-to-end set-similarity joins.

Efficiently partition the data across nodes.

Balance the workload.The need for replication ⇓.



MapReduce



Parallel Set-Similarity Joins Stages

1 Token Ordering:Computes data statistics in order to generate good signatures.

The techniques in later stages utilize these statistics.

2 RID-Pair Generation:Extracts the record IDs (“RID”) and the join-attribute value fromeach record.

Distributes the RID and the join-attribute value pairs.The pairs sharing a signature go to at least one common reducer.Reducers compute the similarity of the join-attribute values and outputRID pairs of similar records.

3 Record Join:Generates actual pairs of joined records.

It uses the list of RID pairs from the second stage and the original datato build the pairs of similar records.



Token Ordering



Handling Insufficient Memory



Speedup



Scalability


Conclusion

Conclusion

MapReduce frameworks offer an effective platform for near duplicatedetection.

Distributed execution frameworks can be leveraged for a scalable datacleaning.

Efficient partitioning for data that cannot fit in the main memory.

Software-Defined Networking and later advances in networking canlead to better data solutions.


Conclusion

References

Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplicationwith Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.

Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallelset-similarity joins using MapReduce. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data (pp. 495-506).ACM.

Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.(2010, June). MapDupReducer: detecting near duplicates over massivedatasets. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data (pp. 1119-1122). ACM.

Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficientsimilarity joins for near-duplicate detection. ACM Transactions on DatabaseSystems (TODS), 36(3), 15.

Thank you!Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 21 / 21

Efficient Duplicate Detection Over Massive Data Sets

Technology