One Size Fits All? One Size Fits All? A Simple Technique to Perform A Simple Technique to Perform Several NLP Tasks Several NLP Tasks Daniel Gayo Avello (University of Oviedo)
Feb 03, 2016
One Size Fits All?One Size Fits All?A Simple Technique to Perform Several NLP TasksA Simple Technique to Perform Several NLP Tasks
Daniel Gayo Avello(University of Oviedo)
IntroductionIntroduction
•blindLight is a modified vector model with applications to several NLP tasks:
–Automatic summarization, –categorization, –clustering and – information retrieval.
7%
Vector Model vs. Vector Model vs. blindLightblindLight Model ModelVector ModelVector Model
• Document D-dimensional vector of terms.
• D number of distinct terms within whole collection of documents.
• Terms words/stems/character n-grams.
• Term weights function of “in-document” term frequency, “in-collection” term frequency, and document length.
• Assoc measures symmetric: Dice, Jaccard, Cosine, …
IssuesIssues• Document vectors ≠ Document
representations… • … But document representations with
regards to whole collection.• Curse of dimensionality feature
reduction.• Feature reduction when using n-grams
as terms ad hoc thresholds.
13%
blindLightblindLight Model Model• Different documents different length
vectors.
• D? no collection no D no vector space!• Terms just character character nn-grams-grams.• Term weights in-document in-document nn-gram -gram
significancesignificance (function of just document term frequency)
• Similarity measure asymmetric (in fact, two association measures). Kind-of light pairwise alignment (A vs. B ≠ B vs. A).
AdvantagesAdvantages• Document vectors = Unique document
representations…• Suitable for ever-growing document sets.• Bilingual IR is trivial.• Highly tunable by linearly combining the two
assoc measures.
IssuesIssues• Not tuned yet, so…• …Poor performance with broad topics.
What’s What’s nn-gram significance?-gram significance?• Can we know how important an Can we know how important an nn-gram is within just -gram is within just
one document without regards to any external one document without regards to any external collection?collection?
• Similar problem: Extracting multiword items from text (e.g. European Union, Mickey Mouse, Cross Language Evaluation Forum).
• Solution by Ferreira da Silva and Pereira Lopes:– Several statistical measures generalized to be applied to arbitrary
length word n-grams.– New measure: Symmetrical Conditional ProbabilitySymmetrical Conditional Probability (SCP) which
outperforms the others.
• So, our proposal to answer first questionanswer first question:
If SCP shows the most significant multiword items within just one document it can be applied to rank character
n-grams for a document according to their significances.
20%
• Equations for SCP:
• (w1…wn) is an n-gram. Let’s suppose we use quad-grams and let’s take (igni) from the text What’s n-gram significance.– (w1…w1) / (w2…w4) = (i) / (gni) – (w1…w2) / (w3…w4) = (ig) / (ni)
– (w1…w3) / (w4…w4) = (ign) / (i)
– For instance, in p((w1…w1)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams starting with i (e.g. (igni), (ific), or (ican)).
– In p((w4…w4)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams ending with i (e.g. (m_si), (igni), or (nifi)).
What’s What’s nn-gram significance?-gram significance? (cont.)(cont.)
27%
1
111 )...()·...(
1
1 ni
inii wwpwwp
nAvp
Avp
wwpwwfSCP nn
21
1
)...())...((_
33%
What’s What’s nn-gram significance?-gram significance? (cont.)(cont.)
• Current implementation of Current implementation of blindLightblindLight uses quad-grams uses quad-grams because…because…– They provide better results than tri-grams.– Their significances are computed faster than n≥5 n-grams.
• ¿How would it work mixing different length ¿How would it work mixing different length nn-grams within the -grams within the same document vector? same document vector? Interesting question to solve in the future…Interesting question to solve in the future…
• Two example Two example blindLightblindLight document vectors: document vectors:– Q document:Q document: Cuando despertó, el dinosaurio todavía estaba allí.Cuando despertó, el dinosaurio todavía estaba allí.
– T document:T document: Quando acordou, o dinossauro ainda estava lá.Quando acordou, o dinossauro ainda estava lá.
– Q vectorQ vector (45 elements):{(Cuan, 2.49), (l_di, 2.39), (stab, 2.39), ..., (saur, 2.31), (desp, {(Cuan, 2.49), (l_di, 2.39), (stab, 2.39), ..., (saur, 2.31), (desp, 2.31), ..., (ando, 2.01), (avía, 1.95), (_all, 1.92)}2.31), ..., (ando, 2.01), (avía, 1.95), (_all, 1.92)}
– T vectorT vector (39 elements): {{(va_l, 2.55), (rdou, 2.32), (stav, 2.32), ..., (saur, 2.24), (noss, (va_l, 2.55), (rdou, 2.32), (stav, 2.32), ..., (saur, 2.24), (noss, 2.18), ..., (auro, 1.91), (ando, 1.88), (do_a, 1.77)2.18), ..., (auro, 1.91), (ando, 1.88), (do_a, 1.77)}}
• ¿How can such vectors be numerically ¿How can such vectors be numerically compared?compared?
• Some equations:
Comparing Comparing blindLightblindLight doc vectors doc vectors
40%
TTQ
QTQ
SS
SS
/
/
nTnTTTTT
mQmQQQQQ
wkwkwkT
wkwkwkQ
,,,
,,,
2211
2211
n
iiTT
m
iiQQ
wS
wS
1
1
njTwk
miQwk
wwwkkk
wkTQ
jTjT
iQiQ
jTiQxjTiQx
xx
0,),(
,0,),(
,),min(
, TiQTQ wS
Document Vectors
Document Total
Significance
Intersected Document
Vector
Intersected Document
Vector Total Significance
Pi and Rho“Asymmetric”
Similarity Measures
Comparing Comparing blindLightblindLight doc vectors doc vectors (cont.)(cont.)
47%
• The dinosaur is still here…The dinosaur is still here…ki wi
Cuan 2.49 l_di 2.39 stab 2.39
… saur 2.31 desp 2.31
… ando 2.01 avía 1.95 _all 1.92
ki wi va_l 2.55 rdou 2.32 stav 2.32
… saur 2.24 noss 2.18
… auro 1.91 ando 1.88 do_a 1.77
ki wi saur 2.24 inos 2.18 uand 2.12 _est 2.09 dino 2.02 _din 2.02 esta 2.01 ndo_ 1.98 a_es 1.94 ando 1.88
Q doc vectorQ doc vectorSSQQ = 97.52 = 97.52
T doc vectorT doc vectorSSTT = 81.92 = 81.92
==
QΩT vectorQΩT vectorSSQΩTQΩT = 20.48 = 20.48
ΩΩPi = SPi = SQΩTQΩT/S/SQQ = 20.48/97.52 = 0.21 = 20.48/97.52 = 0.21
Rho = SRho = SQΩTQΩT/S/STT = 20.48/81.92 = 0.25 = 20.48/81.92 = 0.25
53%
Clustering case study:Clustering case study:“Genetic” classification of languages“Genetic” classification of languages• The relation between ancestor and descendant languages is
usually called genetic relationship.
• Such relationships are displayed as a tree of families of languages.
• Comparative method looks for regular (i.e. systematic) correspondences in lexicon and thus allows linguists to propose hypothesis about genetic relationship.
• Languages are not only subject to systematic changes but also random, so comparative method is “sensitive to noise”, specially when studying languages that have diverged more than 10,000 years ago.
• Joseph H. Greenberg developed the so-called “mass lexical comparison” method which compares large samples of equivalent words for two languages.
• Our experiment is quite similar to this mass comparison method and to the work done by Stephen Huffman using the Acquaintance technique.
60%
Clustering case study:Clustering case study:“Genetic” classification of languages“Genetic” classification of languages (cont.) (cont.)
• Two different kinds of linguistic data:– Orthographic version of first
three chapters from the Book of Genesis.
– Phonetic transcriptions of “The North Wind and the Sun”.
• Similarity measure to compare document vectors was 0.5+0.5.
• Clustering algorithm similar to Jarvis-Patrick.
• Both resultant trees Both resultant trees coherent to each other and coherent to each other and consistent with linguistic consistent with linguistic theories.theories.
CatalanCatalan
FrenchFrench
PortuguesePortuguese
GalicianGalician SpanishSpanish
EnglishEnglish
DutchDutch
GermanGerman
SwedishSwedish
1
Clustering using phonetic data
Faroese
Swedish
DanishNorwegian
EnglishDutch
German
Catalan
French
Italian
PortugueseSpanish
Basque
Finnish
1
Clustering using orthographic data
67%
Categorization case study:Categorization case study:Language identificationLanguage identification• Categorization using blindLight is
straightforward: – Each category vector is compared with the
document, – the greater the similarity the most likely the
membership.
• Using previous experiment results category vectors on the right were built to develop a language identifier. Many of them are “artificial” obtained by intersecting several language vectors.
• The language identifier operation is simple. Let’s suppose an English sample of text:– It is compared against Basque, Finnish, Italic,
northGermanic and westGermanic.– The most likely category is westGermanic so…– …it is compared against Dutch-German and English.– The most likely is English which is a final category.
Basque
Finnish
ItalicCatalan-French
CatalanFrench
ItalianPortuguese-Spanish
PortugueseSpanish
northGermanicDanish-Swedish
DanishSwedish
FaroeseNorwegian
westGermanicDutch-German
DutchGerman
English
Categorization case study:Categorization case study:Language identificationLanguage identification (cont.) (cont.)• Preliminary results using 1,500 posts from:
– soc.culture.basque– soc.culture.catalan– soc.culture.french– soc.culture.galiza (Galician is not “known” by the identifier).– soc.culture.german
• Posts were submitted in a raw form including the whole header to check “noise tolerance”.
• It was found that actual samples of around 200 characters can be identified in spite of lengthy headers (500 to 900 characters).
• Results for Galician:– As with rest of the groups: plenty of spam (i.e. English posts).– Most of the posts written in Spanish.– Posts actually written in Galician: 63% identified as Portuguese, 37% as
Spanish, graceful decade?
• Results for other languages:
73%
Newsgroup Languages found in the
sample posts Target
language Accuracy
soc.culture.basque Spanish Basque English
96.87% 2.19% 0.94%
Basque 100%
soc.culture.catalan Catalan Spanish
51.63% 48.37%
Catalan 98.44%
soc.culture.french English French
German
73.85% 25.23% 0.92%
French 97.56%
soc.culture.german German English French
50.35% 48.94% 0.71%
German 97.18%
Information Retrieval using Information Retrieval using blindLightblindLight
80%
• Π Π (Pi) and (Pi) and ΡΡ (Rho) can be linearly combined into different (Rho) can be linearly combined into different association measures to perform IR.association measures to perform IR.
• Just two tested up to now: Π and (which performs slightly better).
• IR with blindLight is pretty easy:
1.For each documenteach document within the datasetdataset a 4-gram4-gram is computed and stored.
2.When a queryquery is submitted to the system:
a)A 4-gram (Q) is computed for the query text.
b)For each doc vector (T):i. Q and T are Ω-intersected obtaining Π and Ρ values.
ii.Π and Ρ are combined into a unique association measure (e.g. piro).
c)A reverse ordered list of documents is built and returned to answer the query.
• Features and issues:Features and issues:– No indexing phase. Documents can be added at any moment. – Comparing each query with every document not really feasible with big data
sets.
2
norm
Rho, and thus Pi·Rho, values are negligible when compared to Pi. norm function scales Pi·Rho
values into the range of Pi values.
Rho, and thus Pi·Rho, values are negligible when compared to Pi. norm function scales Pi·Rho
values into the range of Pi values.
Bilingual IR with Bilingual IR with blindLightblindLight• INGREDIENTS:
Two aligned parallel corpora. Languages S(ource) an T(arget).
• METHOD:– Take the original query written in natural language S
(queryS).– Chopped the original query in chunks with 1, 2, …, L words.– Find in the S corpus sentences containing each of these
chunks. Start with the longest ones and once you’ve found sentences for one chunk delete its subchunks.
– Replace every of these S sentences by its T sentence equivalent.
– Compute an n-gram vector for every T sentence, Ω-intersect all the vectors for each chunk.
– Mixed all the Ω-intersected n-gram vectors into a unique query vector (queryT).
– Voilà! You have obtained a vector for a hypothetical queryT without having translated queryS.
87%
For instance, EuroParl
For instance, EuroParl
Encontrar documentos en los que se habla de las discusiones sobre la reforma de instituciones financieras y, en particular, del Banco Mundial y del FMI durante la cumbre de los G7 que se celebró en Halifax en 1995.
Encontrar documentos en los que se habla de las discusiones sobre la reforma de instituciones financieras y, en particular, del Banco Mundial y del FMI durante la cumbre de los G7 que se celebró en Halifax en 1995.
EncontrarEncontrar documentosEncontrar documentos en...institucionesinstituciones financierasinstituciones financieras y...
EncontrarEncontrar documentosEncontrar documentos en...institucionesinstituciones financierasinstituciones financieras y...
(1315) …mantiene excelentes relaciones con las instituciones financieras internacionales.(5865) …el fortalecimiento de las instituciones financieras internacionales…(6145) La Comisión deberá estudiar un mecanismo transparente para que las instituciones financieras europeas…
(1315) …mantiene excelentes relaciones con las instituciones financieras internacionales.(5865) …el fortalecimiento de las instituciones financieras internacionales…(6145) La Comisión deberá estudiar un mecanismo transparente para que las instituciones financieras europeas…
(1315) …has excellent relationships with the international financial institutions…(5865) …strengthening international financial institutions…(6145) The Commission will have to look at a transparent mechanism so that the European financial institutions…
(1315) …has excellent relationships with the international financial institutions…(5865) …strengthening international financial institutions…(6145) The Commission will have to look at a transparent mechanism so that the European financial institutions…
instituciones financieras: {al_i, anci, atio, cial, _fin, fina, ial_, inan_, _ins, inst, ions, itut, l_in, nanc, ncia, nsti, stit, tion, titu, tuti, utio}
instituciones financieras: {al_i, anci, atio, cial, _fin, fina, ial_, inan_, _ins, inst, ions, itut, l_in, nanc, ncia, nsti, stit, tion, titu, tuti, utio}
Nice translated Nice translated nn-grams-grams
Nice un-translatedNice un-translated
nn-grams-grams
Not-Really-Nice Not-Really-Nice
un-translated un-translated nn-grams-grams
Definitely-Not-Nice “noise” Definitely-Not-Nice “noise”
nn-grams-grams
We have compared n-gram vectors for pseudo-translations with vectors for actual translations (Source: Spanish, Target: English).
38.59% of the n-grams within pseudo-translated vectors are also within actual translations vectors.
28.31% of the n-grams within actual translations vectors are present at pseudo-translated ones.
Promising technique but thorough work is required.
Information Retrieval ResultsInformation Retrieval Results• Experiments with small
collections:– CACM (3,204 docs and 64 queries).– CISI (1,460 docs and 112 queries).– Results similar to those achieved by
several systems but not as good as reached by SMART for instance.
• CLEF 2004 results:– Monolingual IR within Russian
documents: 72 documents found from 123 relevant ones, average precision: 0.14
– Bilingual IR using Spanish to query English docs: 145 documents found from 375 relevant ones, average precision: 0.06.
• However, blindLight does not apply:– Stop word removal.– Stemming.– Query term weighting.
• Problems specially with broad topics.
93%
Interpolated P-R graphs
0
0,1
0,2
0,3
0,4
0,5
0,6
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
Recall
Pre
cisi
on
CACM (pi ranking) CACM (piro ranking)
CISI (pi ranking) CISI (piro ranking)
ConclusionsConclusions• Genetic classification of languages (clustering) using
blindLight:– Coherent results for both orthographic and phonetic input.– Results are also consistent with linguistic theories.– Results useful to develop language identifiers.
• Language identification (categorization) using blindLight:– Accuracy higher than 97%.– Information-to-Noise ratio around 2/7.
• Information retrieval performance must be improved, however:– Language independent.– Straightforward bilingual IR.
• To sum up, To sum up, blindLightblindLight is an extremely simple technique is an extremely simple technique which appears to be flexible enough to be applied to a which appears to be flexible enough to be applied to a wide range of NLP tasks showing in all of them adequate wide range of NLP tasks showing in all of them adequate performance.performance.
100%
One Size Fits All?One Size Fits All?A Simple Technique to Perform Several NLP TasksA Simple Technique to Perform Several NLP Tasks
Daniel Gayo Avello(University of Oviedo)
Thank you!