Top Banner
SEARCH4SIMILAR S at scale
15

Search4similars

Mar 22, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Search4similars

SEARCH4SIMILARS

at scale

Page 2: Search4similars

What do you mean by similar?

■ Jaccard distance

■ Cosine distance

■ Lot’s of others

Page 3: Search4similars

Deduplication / Plagiarism LSH

A B C D E F G

A

B

C

D

E

F

G

All you need is to compare each object with all the another.O (n*n)

Your cap:Compare only similar items.

Page 4: Search4similars

LSH Applications

■ Near-duplicate detection■ Hierarchical clustering■ Genome-wide association study■ Image similarity identification■ VisualRank■ Gene expression similarity identification■ Audio similarity identification■ Nearest neighbor search■ Audio fingerprint■ Digital video fingerprinting

Page 5: Search4similars

LSH is a dimensionality reduction technique■ Batch algorithm■ Word “the” is not the same as word “bozo” when we compare two documents

– LSH for Cosine Distance (http://arxiv.org/pdf/1110.1328.pdf)■ Hard to analyze■ If you add new documents, you can’t find similar in real-time

– some online-related works for restricted cases (http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL11.pdf )

■ LSH will treat “cool project” and “cool room” as more similar than “cool room” and “cold hall”

■ Fits for searching very similar objects. Not optimal to search for not too similar.

Page 6: Search4similars

Search 4 sense

■ Bayes theorem■ Bayesian statistics■ Conjugate prior■ Probabilistic graphical models■ Topic modeling■ pLSA / LDA

Page 7: Search4similars

Bayes' theorem

where A and B are events.■ P(A) and P(B) are the probabilities of A and B without regard to each other.■ P(A | B), a conditional probability, is the probability of observing event A given that B is

true.■ P(B | A) is the probability of observing event B given that A is true.

Page 8: Search4similars

Bayesian vs Frequentist statistics

■ Coin tossing– coin fell 4 times of 5 on a head

■ Сonjugate prior ■ Exponential family■ Sufficient statistic

Page 9: Search4similars

Probabilistic Graphical Models

Page 10: Search4similars

Topic modeling

Page 11: Search4similars

Topic modeling assumptions

■ Document order does not matter (Bag of words)■ Most common words do not characterize topic■ Document collection could be represented as document-word pair ■ Each topic could be described via unknown distribution

■ Independency assumption

Page 12: Search4similars

probabilistic Latent Semantic Analysis

Page 13: Search4similars

LDA■ Almost the same as pLSA, but with Dirichlet distribution as prior

Page 14: Search4similars

LinksMining Massive Datasets■ http://infolab.stanford.edu/~ullman/mmds/book.pdf■ https://ru.coursera.org/course/mmds■ http://www.mmds.org/

K. Vorontsov. Machine Learning■ https://www.youtube.com/watch?v=H7hlSz4WWhQ■ https://www.youtube.com/watch?v=EOmv7fakk5E■ http://

www.machinelearning.ru/wiki/images/2/22/Voron-2013-ptm.pdf

D.Vetrov. Bayes Statistics■ https://compscicenter.ru/courses/bayes-course/2015-su

mmer/

D.Koller. Probabilistic Graphical Models

■ https://ru.coursera.org/course/pgm

■ https://en.wikipedia.org/wiki/Jaccard_index■ https://en.wikipedia.org/wiki/Cosine_similarity■ https://en.wikipedia.org/wiki/MinHash■ https://en.wikipedia.org/wiki/Locality-sensitive_hashing■ LSH for Cosine Distance (

http://arxiv.org/pdf/1110.1328.pdf) 

■ https://en.wikipedia.org/wiki/Bayesian_statistics■ https://en.wikipedia.org/wiki/Conjugate_prior■ https://en.wikipedia.org/wiki/Sufficient_statistic■ https://en.wikipedia.org/wiki/Graphical_model■ https://en.wikipedia.org/wiki/Topic_model■ https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation■ https://en.wikipedia.org/wiki/Probabilistic_latent_semant

ic_analysis

Page 15: Search4similars

Repository & Chats announcements■ Github: https://github.com/scalalab3

– https://github.com/scalalab3/chatbot-engine– https://github.com/scalalab3/logs-service– https://github.com/scalalab3/lyrics-engine

■ Gitter: https://gitter.im/scalalab3/all– https://gitter.im/scalalab3/lyrics-engine– https://gitter.im/scalalab3/logs-service– http://gitter.im/scalalab3/chatbot-engine