18/09/2017 1 Single Value Decomposition SVD – Example: Users-to-Movies • A = U V T - example: Users to Movies J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2 = SciFi Romnce Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 m n U V T “Concepts” AKA Latent dimensions AKA Latent factors SVD – Example: Users-to-Movies • A = U V T - example: Users to Movies J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 = SciFi Romnce x x Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09 SVD – Example: Users-to-Movies • A = U V T - example: Users to Movies J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 SciFi-concept Romance-concept = SciFi Romnce x x Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 12.4 0 0 0 9.5 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09
16
Embed
SVD Example: Users-to-Movieslaber/SVD-Applications.pdf · 2017. 9. 18. · SVD – Example: Users-to-Movies •A = U V T - example: J. Leskovec, A. Rajaraman, J. Ullman: Mining of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
18/09/2017
1
Single Value Decomposition
SVD – Example: Users-to-Movies
• A = U VT - example: Users to Movies
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 3
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
SVD – Example: Users-to-Movies
• A = U VT - example: Users to Movies
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 4
SciFi-concept
Romance-concept
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
18/09/2017
2
SVD – Example: Users-to-Movies
• A = U VT - example:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 5
Romance-concept
U is “user-to-concept” similarity matrix
SciFi-concept
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
SVD – Example: Users-to-Movies
• A = U VT - example:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 6
SciFi
Romnce
SciFi-concept
“strength” of the SciFi-concept
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
SVD – Example: Users-to-Movies
• A = U VT - example:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 7
SciFi-concept
V is “movie-to-concept” similarity matrix
SciFi-concept
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, http://www.mmds.org
8
SVD - Interpretation #1
‘movies’, ‘users’ and ‘concepts’: • U: user-to-concept similarity matrix
• V: movie-to-concept similarity matrix
• : its diagonal elements: ‘strength’ of each concept
18/09/2017
3
Case study: How to query? • Q: Find users that like ‘Matrix’
• A: Map query into a ‘concept space’ – how?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 9
=
SciFi
Romnce
x x
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
Case study: How to query? • Q: Find users that like ‘Matrix’
• A: Map query into a ‘concept space’ – how?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 10
5 0 0 0 0
q =
Matrix
Alie
n
v1
q
v2
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
Project into concept space:
Inner product with each
‘concept’ vector vi
Case study: How to query? • Q: Find users that like ‘Matrix’
• A: Map query into a ‘concept space’ – how?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 11
v1
q
q*v1
5 0 0 0 0
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
v2
Matrix
Alie
n
q =
Project into concept space:
Inner product with each
‘concept’ vector vi
Case study: How to query? Compactly, we have:
qconcept = q V
E.g.:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 12
movie-to-concept similarities (V)
=
SciFi-concept
5 0 0 0 0
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
q =
0.56 0.12
0.59 -0.02
0.56 0.12
0.09 -0.69
0.09 -0.69
x 2.8 0.6
18/09/2017
4
Case study: How to query? • How would the user d that rated
(‘Alien’, ‘Serenity’) be handled? dconcept = d V
E.g.:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 13
movie-to-concept similarities (V)
=
SciFi-concept
0 4 5 0 0
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
q =
0.56 0.12
0.59 -0.02
0.56 0.12
0.09 -0.69
0.09 -0.69
x 5.2 0.4
Case study: How to query?
• Observation: User d that rated (‘Alien’, ‘Serenity’) will be similar to user q that rated (‘Matrix’), although d and q have zero ratings in common!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org 14
0 4 5 0 0
d =
SciFi-concept
5 0 0 0 0
q =
Mat
rix
Alie
n
Sere
nit
y
Cas
abla
nca
Am
elie
Zero ratings in common Similarity ≠ 0
2.8 0.6
5.2 0.4
Centralizando os Dados
• Em algumas aplicações é interessante centralizar os dados antes de aplicar a decomposição SVD
– A centralização consiste em subtrair cada linha da matriz A da média das linhas da matriz (centroide dos pontos respresentados pelas linhas de A)
Centralizando os dados
• Ao aplicar o SVD sem centralizar os dados obtermos o subespaço (hiperplano que passa pela origem) minimiza a soma dos quadrados das distâncias perpendiculares aos dados da matriz A
• Ao aplicar o SVD após centralizar os dados obtermos o espaço afim (affine space) que minimiza a soma dos quadrados das distâncias perpendiculares aos dados da matriz A
18/09/2017
5
Centralizando os Dados
Compressão de Dados
Cenário
• 130 imagens de ‘3’ escritos por mãos
– cada uma representada em uma matriz 16x16.
– cada representação precisa de 16x16 floats => imagens em R256
Compressão de Dados Compressão de Dados
• Modelo com duas direções
• Ao todo 16x16=256 direções possíveis
• 12 direções principais respondem por 63% da variância e 50 por 90% delas
18/09/2017
6
Compressão de Dados
Pontos com círculo em volta no gráfico a esquerda
Compressão de Dados
• Na figura anterior temos a coleção de dígitos 3 projetadas no espaço das duas primeiras direções principais
Alguma interpretação para as componentes?
Compressão de Dados
• Na figura anterior temos a coleção de dígitos 3 projetadas no espaço das duas primeiras direções principais
– Primeira componente (movimento horizontal da parte inferior do 3)
– Segunda componente (espessura)
Compressão de Dados
18/09/2017
7
Entrada. Matriz A representando coleção com n documentos a partir de um vocabulário de d termos
Saida. Dado um conjunto de queries (novos documentos) q(1),...,q(p), determinar eficientemente a similaridade (produto interno) entre cada documento e cada query.
Similaridade Aproximada
Abordagem 1
• Vetor de similaridades entre A e query q é dado por Aq.
• Pode ser computado em O(nd)
Similaridade Aproximada
Abordagem 2
• Utilizar Ak em vez de A
– Temos Ak = 1u1v1T + …+ kukvk
T , onde ui é um
vetor em Rn e vi é um vetor em Rd
– Logo, iuiviTq pode ser calculado em O(n+d)
• calculamos viT q em O(d) e depois i ui (vi
T q) em O(n)
• Complexidade total: O(k(n+d)).
– Vantajoso se k<<min{n,d}
Similaridade Aproximada
Abordagem 2
• Qual o erro de usar Ak?
– Máximo em q de |Aq- Akq |= |(A- Ak)q|
– O erro pode ser Ilimitado se q não tiver nenhuma restrição
Similaridade Aproximada
18/09/2017
8
Def. A norma espectral ||A||2 de A é igual a
max{ |Ax|:|x| 1}
• ||A- Ak ||2 mede o maior erro possível para vetores de norma menor ou igual a 1
• Possível mostrar (Teo 3.9) que
||A- Ak ||2 ||A-B||2
para toda matriz B de rank no máximo k
Similaridade Aproximada
• Coleção com n documentos a partir de um vocabulário de d termos
• Podemos representar a coleção utilizando uma matriz A de n linhas e d colunas
Como rankear os documentos da coleção de acordo com sua importância intrínseca?
Ranking de documentos
Ranking
1. Calcule a direção principal da coleção v1 (melhor representação unidimensional da coleção conforme norma de Frobenius)
2. Projete cada documento na direção v1 e ordene do maior para o menor (coordenadas de u1 )
Ranking de documentos
HITS: Hubs and Authorities
18/09/2017
9
Hubs and Authorities • HITS (Hypertext-Induced Topic Selection)
– Is a measure of importance of pages or documents, similar to PageRank
– Proposed at around same time as PageRank (‘98)
• Goal: Say we want to find good newspapers
– Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers
• Idea: Links as votes
– Page is more important if it has more links
• In-coming links? Out-going links?
38
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Finding newspapers
• Hubs and Authorities Each page has 2 scores:
– Quality as an expert (hub):
• Total sum of votes of authorities pointed to
– Quality as a content (authority):
• Total sum of votes coming from experts
• Principle of repeated improvement
39
NYT: 10
Ebay: 3
Yahoo: 3
CNN: 8
WSJ: 9
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Hubs and Authorities
Interesting pages fall into two classes: 1. Authorities are pages containing
useful information – Newspaper home pages – Course home pages – Home pages of auto manufacturers
2. Hubs are pages that link to authorities – List of newspapers – Course bulletin – List of US auto manufacturers
40
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Counting in-links: Authority
41
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
Each page starts with hub
score 1. Authorities collect
their votes
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
18/09/2017
10
Counting in-links: Authority
42
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
Sum of hub
scores of nodes
pointing to NYT.
Each page starts with hub
score 1. Authorities collect
their votes
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Expert Quality: Hub
43
Hubs collect authority scores
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
Sum of authority
scores of nodes that
the node points to.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Reweighting
44
Authorities again collect
the hub scores
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score) J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, http://www.mmds.org
Mutually Recursive Definition • A good hub links to many good authorities
• A good authority is linked from many good hubs
• Model using two scores for each node:
– Hub score and Authority score
– Represented as vectors 𝒉 and 𝒂
45 J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets, http://www.mmds.org
18/09/2017
11
Hubs and Authorities • Each page 𝒊 has 2 scores:
– Authority score: 𝒂𝒊
– Hub score: 𝒉𝒊
HITS algorithm:
• Initialize: 𝑎𝑗(0)= 1/ N, hj
(0)= 1/ N
• Then keep iterating until convergence:
– ∀𝒊: Authority: 𝑎𝑖(𝑡+1)= ℎ𝑗
(𝑡)𝒋→𝒊
– ∀𝒊: Hub: ℎ𝑖(𝑡+1)= 𝑎𝑗
(𝑡)𝒊→𝒋
– ∀𝒊: Normalize:
𝑎𝑖𝑡+1
2
𝑖 = 1, ℎ𝑗𝑡+1
2
𝑗 = 1
[Kleinberg ‘98]
46
i
j1 j2 j3 j4
𝒂𝒊 = 𝒉𝒋𝒋→𝒊
j1 j2 j3 j4
𝒉𝒊 = 𝒂𝒋𝒊→𝒋
i
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
http://www.mmds.org
Hubs and Authorities • HITS converges to a single stable point