Page 1
Tuning Topical Queries through Context Vocabulary Enrichment:
a Corpus-Based Approach
Carlos M Lorenzetti Ana G [email protected] [email protected]
Universidad Nacional del SurAv. L.N. Alem 1253Bahía Blanca - Argentina
Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento
Laboratorio de Investigación y Desarrollo en Inteligencia Artificial
CONICET AGENCIA
Page 3
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 3
Context–Based SearchJava?
Page 4
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 4
Context–Based SearchJava?
AnimalsAnimals
ComputersComputers
ConsumablesConsumables
EntertainmentEntertainment
GeographyGeography
FloraFlora
ShipsShips
Page 5
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 5
Context–Based Search
ContextContext
ArticlesArticlesNewspapersNewspapers
OthersOthers
Page 6
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 6
Context–Based SearchJava?
ContextContext
ArticlesArticlesNewspapersNewspapers
OthersOthersGeographyGeography
Page 7
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 7
Query tuning
• Step 1: Query• Step 2: Initial set of results• Step 3: Relevance assessment
– SupervisedSupervised feedback– UnsupervisedUnsupervised feedback– SemiSemi--supervisedsupervised feedback
• Step 4: Better representation• Step 5: Revised set of results
Page 8
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 8
Different Role of Terms
• DescriptorsTerms that appear often in documentsrelated to the given topic
What is this topic about?
• DiscriminatorsTerms that appear only in documentsrelated to the given topic
What terms are useful to seek similar information?
Page 9
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 9
Different Role of Terms
• DescriptorsTerms that appear often in documentsrelated to the given topic
What is this topic about?
• DiscriminatorsTerms that appear only in documentsrelated to the given topic
What terms are useful to seek similar information?
RecallRecall
Page 10
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 10
Different Role of Terms
• DescriptorsTerms that appear often in documentsrelated to the given topic
What is this topic about?
• DiscriminatorsTerms that appear only in documentsrelated to the given topic
What terms are useful to seek similar information?
PrecisionPrecision
RecallRecall
Page 11
Descriptors and DiscriminatorsComputation:an example
Page 12
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 12
Descriptors and Discriminators
JavaJavaLanguageLanguage
AppletsApplets
CodeCode
TopicTopic: Java Virtual : Java Virtual MachineMachine
NetBeansNetBeansComputersComputers
JVMJVM
RubyRuby ProgrammingProgramming
JDKJDK
VirtualVirtual
MachineMachine
Page 13
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 13
Descriptors and Discriminators
JavaJavaLanguageLanguage
AppletsApplets
CodeCode
TopicTopic: Java Virtual : Java Virtual MachineMachine
NetBeansNetBeansComputersComputers
JVMJVM
RubyRuby ProgrammingProgramming
JDKJDK
VirtualVirtual
MachineMachineGoodGood descriptorsdescriptors
Page 14
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 14
Descriptors and Discriminators
JavaJavaLanguageLanguage
AppletsApplets
CodeCode
TopicTopic: Java Virtual : Java Virtual MachineMachine
NetBeansNetBeansComputersComputers
JVMJVM
RubyRuby ProgrammingProgramming
JDKJDK
VirtualVirtual
MachineMachine
GoodGood discriminatorsdiscriminators
Page 15
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 15
Documents Descriptors and Discriminators
ktd ji ],[HNumberNumber ofof occurrencesoccurrences ofof termterm jjin in documentdocument ii
TopicTopic: Java Virtual : Java Virtual MachineMachine
0330
0120
1004
2004
3003
0220
1120
0110
0236
2552
InitialInitialContextContext
HH
(1)(1) espressotec.comespressotec.com(2)(2) netbeans.orgnetbeans.org(3)(3) sun.comsun.com(4)(4) wikitravel.orgwikitravel.org
0jdk
0jvm
0province
0island
0coffee
3programming
1language
1virtual
2machine
4java
(1)(1) (2)(2) (3)(3) (4)(4)
Page 16
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 16
Documents Descriptors
TopicTopic: Java Virtual : Java Virtual MachineMachineInitialInitial ContextContext
0jdk
0jvm
0province
0island
0coffee
3programming
1language
1virtual
2machine
4java
1
02]),[(
],[),(n
k
jiki
jitdH
H
DescriptiveDescriptive powerpower ofof a a termterm in a in a documentdocument
0,000
0,000
0,000
0,000
0,000
0,539
0,180
0,180
0,359
0,718
),( 0 jtd
Page 17
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 17
Documents Discriminators
TopicTopic: Java Virtual : Java Virtual MachineMachineInitialInitial ContextContext
0jdk
0jvm
0province
0island
0coffee
3programming
1language
1virtual
2machine
4java
1
0]),[s(
]),[s(),(m
k
jiki
jidtT
T
H
H
DiscriminatingDiscriminating powerpower ofof a a termterm in a in a documentdocument
0,000
0,000
0,000
0,000
0,000
0,577
0,500
0,577
0,500
0,447
),( 0dti
Page 18
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 18
Documents comparison criteria
1
0)),(),((),(
n
kkjkiji tdtddd Documents
similarityDocumentsDocuments
similaritysimilarity
KK11
KK33KK22
dd22
dd11
CosineCosine similaritysimilarity
Page 19
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 19
Topics DescriptorsTopicTopic: Java Virtual : Java Virtual MachineMachine
InitialInitial ContextContext
0jdk
0jvm
0province
0island
0coffee
3programming
1language
1virtual
2machinemachine
4javajava
1
,0
1
,0
2
),(
)),(),((),( m
ikkki
m
ikkjkki
ji
dd
tdddtd
TermTerm descriptivedescriptive powerpower in a in a topictopic ofof a a documentdocument
0,014
0,032
0,040
0,040
0,055
0,064
0,089
0,124
0,1580,158
0,3850,385
),( 0 jtd
Page 20
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 20
Topics DiscriminatorsTopicTopic: Java Virtual : Java Virtual MachineMachine
InitialInitial ContextContext
0province
0island
0coffee
4java
1language
2machine
3programming
1virtual
0jdkjdk
0jvmjvm
Term Term discriminatingdiscriminating powerpower in a in a topictopic ofof a a documentdocument
1
,0
2 )),(),((),(m
jkkkijkji dtdddt
0,385
0,385
0,385
0,493
0,517
0,524
0,566
0,566
0,8480,848
0,8480,848
),( 0dti
Page 21
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 21
Proposed Algorithm
ContextContextw1
w2
w3
w4
w5
w6
w7
w8
wm-1
wm wm-2
w9
...
Roulette
query 01
query 02
query 03
query n
result 03
result 01
result 02
result n
w 0,5w 0,25...w 0,1
1
2
m
DESCRIPTORSDESCRIPTORSDESCRIPTORS
w 0,4w 0,37...w 0,01
1
2
m
DISCRIMINATORSDISCRIMINATORSDISCRIMINATORS
1
24
3
Terms
Page 23
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 23
EvaluationLocal IndexLocal Index• DMOZ – ODP Project
– ~ 500 Topics– 3rd level of the hierarchy– 100 URLs at least– English language– > 350,000 pages
• Initial Context
• vs. Bo1 & Baseline– Novelty-Driven Similarity– Precision– Semantic Precision
TopTop
HomeHome ScienceScience ArtsArts
CookingCooking FamilyFamily
ChildcareChildcare
11stst levellevel
22ndnd levellevel
33rdrd levellevel
Page 24
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 24
Evaluation – N Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180iteration
nove
lty-d
riven
sim
ilarit
y
Maximum
Average
Minimum
Context update
Top/Computers/Open_Source/Software
Query formulation and retrieval process
[0.5866; 0.6073]0.5970best
[0.0618; 0.0704]0.06611st
95% CIMeanN
Page 25
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 25
Evaluation – N Similarity
[0.0822; 0.0924]0.087Baseline
[0.5866; 0.6073]0.597Incremental
[0.0710; 0.0803]0.075Bo1-DFR
95% CIMeanN
Page 26
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 26
Evaluation – Precision
>>
Incremental (66.96%)Bo1-DFR (24.33%)Baseline (8.7%)
CC[0.2461; 0.2863]0.266Baseline
[0.3325; 0.3764]0.354Incremental
[0.2859; 0.3298]0.307Bo1-DFR
95% CIMeanPrecision
Page 27
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 27
Evaluation – Semantic Precision
>>
Incremental (65.18%)Bo1-DFR (27.90%)Baseline (6.92%)
CC[0.5383; 0.5679]0.553Baseline
[0.6068; 0.6372]0.622Incremental
[0.5750; 0.6066]0.590Bo1-DFR
95% CIMeanPrecisionS
Page 28
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based ApproachLorenzetti Carlos M, Maguitman Ana G
1st QSI - OTM – Monterrey, Mexico 28
Conclusions
• We presented an intelligent IR approach for learning context-specific terms.– Take advantage of the user contextuser context.
• We have shown evaluations and the effectivenesseffectiveness of incremental methods.
Future Work– Investigate different parameters.– Develop methods to learn and adjust parameters.– Run additional tests using other IR metrics.
Page 29
ThankThank youyou!!
CONICET AGENCIA
Laboratorio de Investigación y Desarrollo en Inteligencia Artificial
lidia.cs.uns.edu.ar
Universidad Nacional del SurBahía Blanca
www.uns.edu.ar