Efficient Duplicate Detection Over Massive Data Sets Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 4. April 21, 2015. Pradeeban Kathiravelu (IST-ULisboa) Duplicate Detection 1 / 21
21
Embed
Efficient Duplicate Detection Over Massive Data Sets
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient Duplicate Detection Over Massive Data Sets
Pradeeban Kathiravelu
INESC-ID LisboaInstituto Superior Tecnico, Universidade de Lisboa
Efficient Parallel Set-Similarity Joins Using MapReduce
Parallel Set-Similarity Joins Stages
1 Token Ordering:Computes data statistics in order to generate good signatures.
The techniques in later stages utilize these statistics.
2 RID-Pair Generation:Extracts the record IDs (“RID”) and the join-attribute value fromeach record.
Distributes the RID and the join-attribute value pairs.The pairs sharing a signature go to at least one common reducer.Reducers compute the similarity of the join-attribute values and outputRID pairs of similar records.
3 Record Join:Generates actual pairs of joined records.
It uses the list of RID pairs from the second stage and the original datato build the pairs of similar records.
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: efficient deduplicationwith Hadoop. Proceedings of the VLDB Endowment, 5(12), 1878-1881.
Vernica, R., Carey, M. J., & Li, C. (2010, June). Efficient parallelset-similarity joins using MapReduce. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data (pp. 495-506).ACM.
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., ... & Li, R.(2010, June). MapDupReducer: detecting near duplicates over massivedatasets. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data (pp. 1119-1122). ACM.
Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G. (2011). Efficientsimilarity joins for near-duplicate detection. ACM Transactions on DatabaseSystems (TODS), 36(3), 15.