One Size Fits All? A Simple Technique to Perform Several NLP Tasks

One Size Fits All?One Size Fits All?A Simple Technique to Perform Several NLP TasksA Simple Technique to Perform Several NLP Tasks

Daniel Gayo Avello(University of Oviedo)

IntroductionIntroduction

•blindLight is a modified vector model with applications to several NLP tasks:

–Automatic summarization, –categorization, –clustering and – information retrieval.

7%

Vector Model vs. Vector Model vs. blindLightblindLight Model ModelVector ModelVector Model

• Document D-dimensional vector of terms.

• D number of distinct terms within whole collection of documents.

• Terms words/stems/character n-grams.

• Term weights function of “in-document” term frequency, “in-collection” term frequency, and document length.

• Assoc measures symmetric: Dice, Jaccard, Cosine, …

IssuesIssues• Document vectors ≠ Document

representations… • … But document representations with

regards to whole collection.• Curse of dimensionality feature

reduction.• Feature reduction when using n-grams

as terms ad hoc thresholds.

13%

blindLightblindLight Model Model• Different documents different length

vectors.

• D? no collection no D no vector space!• Terms just character character nn-grams-grams.• Term weights in-document in-document nn-gram -gram

significancesignificance (function of just document term frequency)

• Similarity measure asymmetric (in fact, two association measures). Kind-of light pairwise alignment (A vs. B ≠ B vs. A).

AdvantagesAdvantages• Document vectors = Unique document

representations…• Suitable for ever-growing document sets.• Bilingual IR is trivial.• Highly tunable by linearly combining the two

assoc measures.

IssuesIssues• Not tuned yet, so…• …Poor performance with broad topics.

What’s What’s nn-gram significance?-gram significance?• Can we know how important an Can we know how important an nn-gram is within just -gram is within just

one document without regards to any external one document without regards to any external collection?collection?

• Similar problem: Extracting multiword items from text (e.g. European Union, Mickey Mouse, Cross Language Evaluation Forum).

• Solution by Ferreira da Silva and Pereira Lopes:– Several statistical measures generalized to be applied to arbitrary

length word n-grams.– New measure: Symmetrical Conditional ProbabilitySymmetrical Conditional Probability (SCP) which

outperforms the others.

• So, our proposal to answer first questionanswer first question:

If SCP shows the most significant multiword items within just one document it can be applied to rank character

n-grams for a document according to their significances.

20%

• Equations for SCP:

• (w1…wn) is an n-gram. Let’s suppose we use quad-grams and let’s take (igni) from the text What’s n-gram significance.– (w1…w1) / (w2…w4) = (i) / (gni) – (w1…w2) / (w3…w4) = (ig) / (ni)

– (w1…w3) / (w4…w4) = (ign) / (i)

– For instance, in p((w1…w1)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams starting with i (e.g. (igni), (ific), or (ican)).

– In p((w4…w4)) = p((i)) would be computed from the relative frequency of appearance within the document of n-grams ending with i (e.g. (m_si), (igni), or (nifi)).

What’s What’s nn-gram significance?-gram significance? (cont.)(cont.)

27%

1

111 )...()·...(

1

1 ni

inii wwpwwp

nAvp

Avp

wwpwwfSCP nn

21

1

)...())...((_

33%

What’s What’s nn-gram significance?-gram significance? (cont.)(cont.)

• Current implementation of Current implementation of blindLightblindLight uses quad-grams uses quad-grams because…because…– They provide better results than tri-grams.– Their significances are computed faster than n≥5 n-grams.

• ¿How would it work mixing different length ¿How would it work mixing different length nn-grams within the -grams within the same document vector? same document vector? Interesting question to solve in the future…Interesting question to solve in the future…

• Two example Two example blindLightblindLight document vectors: document vectors:– Q document:Q document: Cuando despertó, el dinosaurio todavía estaba allí.Cuando despertó, el dinosaurio todavía estaba allí.

– T document:T document: Quando acordou, o dinossauro ainda estava lá.Quando acordou, o dinossauro ainda estava lá.

– Q vectorQ vector (45 elements):{(Cuan, 2.49), (l_di, 2.39), (stab, 2.39), ..., (saur, 2.31), (desp, {(Cuan, 2.49), (l_di, 2.39), (stab, 2.39), ..., (saur, 2.31), (desp, 2.31), ..., (ando, 2.01), (avía, 1.95), (_all, 1.92)}2.31), ..., (ando, 2.01), (avía, 1.95), (_all, 1.92)}

– T vectorT vector (39 elements): {{(va_l, 2.55), (rdou, 2.32), (stav, 2.32), ..., (saur, 2.24), (noss, (va_l, 2.55), (rdou, 2.32), (stav, 2.32), ..., (saur, 2.24), (noss, 2.18), ..., (auro, 1.91), (ando, 1.88), (do_a, 1.77)2.18), ..., (auro, 1.91), (ando, 1.88), (do_a, 1.77)}}

• ¿How can such vectors be numerically ¿How can such vectors be numerically compared?compared?

• Some equations:

Comparing Comparing blindLightblindLight doc vectors doc vectors

40%

TTQ

QTQ

SS

SS

/

/

nTnTTTTT

mQmQQQQQ

wkwkwkT

wkwkwkQ

,,,

,,,

2211

2211

n

iiTT

m

iiQQ

wS

wS

1

1

njTwk

miQwk

wwwkkk

wkTQ

jTjT

iQiQ

jTiQxjTiQx

xx

0,),(

,0,),(

,),min(

, TiQTQ wS

Document Vectors

Document Total

Significance

Intersected Document

Vector

Intersected Document

Vector Total Significance

Pi and Rho“Asymmetric”

Similarity Measures

Comparing Comparing blindLightblindLight doc vectors doc vectors (cont.)(cont.)

47%

• The dinosaur is still here…The dinosaur is still here…ki wi

Cuan 2.49 l_di 2.39 stab 2.39

… saur 2.31 desp 2.31

… ando 2.01 avía 1.95 _all 1.92

ki wi va_l 2.55 rdou 2.32 stav 2.32

… saur 2.24 noss 2.18

… auro 1.91 ando 1.88 do_a 1.77

ki wi saur 2.24 inos 2.18 uand 2.12 _est 2.09 dino 2.02 _din 2.02 esta 2.01 ndo_ 1.98 a_es 1.94 ando 1.88

Q doc vectorQ doc vectorSSQQ = 97.52 = 97.52

T doc vectorT doc vectorSSTT = 81.92 = 81.92

==

QΩT vectorQΩT vectorSSQΩTQΩT = 20.48 = 20.48

ΩΩPi = SPi = SQΩTQΩT/S/SQQ = 20.48/97.52 = 0.21 = 20.48/97.52 = 0.21

Rho = SRho = SQΩTQΩT/S/STT = 20.48/81.92 = 0.25 = 20.48/81.92 = 0.25

53%

Clustering case study:Clustering case study:“Genetic” classification of languages“Genetic” classification of languages• The relation between ancestor and descendant languages is

usually called genetic relationship.

• Such relationships are displayed as a tree of families of languages.

• Comparative method looks for regular (i.e. systematic) correspondences in lexicon and thus allows linguists to propose hypothesis about genetic relationship.

• Languages are not only subject to systematic changes but also random, so comparative method is “sensitive to noise”, specially when studying languages that have diverged more than 10,000 years ago.

• Joseph H. Greenberg developed the so-called “mass lexical comparison” method which compares large samples of equivalent words for two languages.

• Our experiment is quite similar to this mass comparison method and to the work done by Stephen Huffman using the Acquaintance technique.

60%

Clustering case study:Clustering case study:“Genetic” classification of languages“Genetic” classification of languages (cont.) (cont.)

• Two different kinds of linguistic data:– Orthographic version of first

three chapters from the Book of Genesis.

– Phonetic transcriptions of “The North Wind and the Sun”.

• Similarity measure to compare document vectors was 0.5+0.5.

• Clustering algorithm similar to Jarvis-Patrick.

• Both resultant trees Both resultant trees coherent to each other and coherent to each other and consistent with linguistic consistent with linguistic theories.theories.

CatalanCatalan

FrenchFrench

PortuguesePortuguese

GalicianGalician SpanishSpanish

EnglishEnglish

DutchDutch

GermanGerman

SwedishSwedish

1

Clustering using phonetic data

Faroese

Swedish

DanishNorwegian

EnglishDutch

German

Catalan

French

Italian

PortugueseSpanish

Basque

Finnish

1

Clustering using orthographic data

67%

Categorization case study:Categorization case study:Language identificationLanguage identification• Categorization using blindLight is

straightforward: – Each category vector is compared with the

document, – the greater the similarity the most likely the

membership.

• Using previous experiment results category vectors on the right were built to develop a language identifier. Many of them are “artificial” obtained by intersecting several language vectors.

• The language identifier operation is simple. Let’s suppose an English sample of text:– It is compared against Basque, Finnish, Italic,

northGermanic and westGermanic.– The most likely category is westGermanic so…– …it is compared against Dutch-German and English.– The most likely is English which is a final category.

Basque

Finnish

ItalicCatalan-French

CatalanFrench

ItalianPortuguese-Spanish

PortugueseSpanish

northGermanicDanish-Swedish

DanishSwedish

FaroeseNorwegian

westGermanicDutch-German

DutchGerman

English

Categorization case study:Categorization case study:Language identificationLanguage identification (cont.) (cont.)• Preliminary results using 1,500 posts from:

– soc.culture.basque– soc.culture.catalan– soc.culture.french– soc.culture.galiza (Galician is not “known” by the identifier).– soc.culture.german

• Posts were submitted in a raw form including the whole header to check “noise tolerance”.

• It was found that actual samples of around 200 characters can be identified in spite of lengthy headers (500 to 900 characters).

• Results for Galician:– As with rest of the groups: plenty of spam (i.e. English posts).– Most of the posts written in Spanish.– Posts actually written in Galician: 63% identified as Portuguese, 37% as

Spanish, graceful decade?

• Results for other languages:

73%

Newsgroup Languages found in the

sample posts Target

language Accuracy

soc.culture.basque Spanish Basque English

96.87% 2.19% 0.94%

Basque 100%

soc.culture.catalan Catalan Spanish

51.63% 48.37%

Catalan 98.44%

soc.culture.french English French

German

73.85% 25.23% 0.92%

French 97.56%

soc.culture.german German English French

50.35% 48.94% 0.71%

German 97.18%

Information Retrieval using Information Retrieval using blindLightblindLight

80%

• Π Π (Pi) and (Pi) and ΡΡ (Rho) can be linearly combined into different (Rho) can be linearly combined into different association measures to perform IR.association measures to perform IR.

• Just two tested up to now: Π and (which performs slightly better).

• IR with blindLight is pretty easy:

1.For each documenteach document within the datasetdataset a 4-gram4-gram is computed and stored.

2.When a queryquery is submitted to the system:

a)A 4-gram (Q) is computed for the query text.

b)For each doc vector (T):i. Q and T are Ω-intersected obtaining Π and Ρ values.

ii.Π and Ρ are combined into a unique association measure (e.g. piro).

c)A reverse ordered list of documents is built and returned to answer the query.

• Features and issues:Features and issues:– No indexing phase. Documents can be added at any moment. – Comparing each query with every document not really feasible with big data

sets.

2

norm

Rho, and thus Pi·Rho, values are negligible when compared to Pi. norm function scales Pi·Rho

values into the range of Pi values.

Rho, and thus Pi·Rho, values are negligible when compared to Pi. norm function scales Pi·Rho

values into the range of Pi values.

Bilingual IR with Bilingual IR with blindLightblindLight• INGREDIENTS:

Two aligned parallel corpora. Languages S(ource) an T(arget).

• METHOD:– Take the original query written in natural language S

(queryS).– Chopped the original query in chunks with 1, 2, …, L words.– Find in the S corpus sentences containing each of these

chunks. Start with the longest ones and once you’ve found sentences for one chunk delete its subchunks.

– Replace every of these S sentences by its T sentence equivalent.

– Compute an n-gram vector for every T sentence, Ω-intersect all the vectors for each chunk.

– Mixed all the Ω-intersected n-gram vectors into a unique query vector (queryT).

– Voilà! You have obtained a vector for a hypothetical queryT without having translated queryS.

87%

For instance, EuroParl

For instance, EuroParl

Encontrar documentos en los que se habla de las discusiones sobre la reforma de instituciones financieras y, en particular, del Banco Mundial y del FMI durante la cumbre de los G7 que se celebró en Halifax en 1995.

Encontrar documentos en los que se habla de las discusiones sobre la reforma de instituciones financieras y, en particular, del Banco Mundial y del FMI durante la cumbre de los G7 que se celebró en Halifax en 1995.

EncontrarEncontrar documentosEncontrar documentos en...institucionesinstituciones financierasinstituciones financieras y...

EncontrarEncontrar documentosEncontrar documentos en...institucionesinstituciones financierasinstituciones financieras y...

(1315) …mantiene excelentes relaciones con las instituciones financieras internacionales.(5865) …el fortalecimiento de las instituciones financieras internacionales…(6145) La Comisión deberá estudiar un mecanismo transparente para que las instituciones financieras europeas…

(1315) …mantiene excelentes relaciones con las instituciones financieras internacionales.(5865) …el fortalecimiento de las instituciones financieras internacionales…(6145) La Comisión deberá estudiar un mecanismo transparente para que las instituciones financieras europeas…

(1315) …has excellent relationships with the international financial institutions…(5865) …strengthening international financial institutions…(6145) The Commission will have to look at a transparent mechanism so that the European financial institutions…

(1315) …has excellent relationships with the international financial institutions…(5865) …strengthening international financial institutions…(6145) The Commission will have to look at a transparent mechanism so that the European financial institutions…

instituciones financieras: {al_i, anci, atio, cial, _fin, fina, ial_, inan_, _ins, inst, ions, itut, l_in, nanc, ncia, nsti, stit, tion, titu, tuti, utio}

instituciones financieras: {al_i, anci, atio, cial, _fin, fina, ial_, inan_, _ins, inst, ions, itut, l_in, nanc, ncia, nsti, stit, tion, titu, tuti, utio}

Nice translated Nice translated nn-grams-grams

Nice un-translatedNice un-translated

nn-grams-grams

Not-Really-Nice Not-Really-Nice

un-translated un-translated nn-grams-grams

Definitely-Not-Nice “noise” Definitely-Not-Nice “noise”

nn-grams-grams

We have compared n-gram vectors for pseudo-translations with vectors for actual translations (Source: Spanish, Target: English).

38.59% of the n-grams within pseudo-translated vectors are also within actual translations vectors.

28.31% of the n-grams within actual translations vectors are present at pseudo-translated ones.

Promising technique but thorough work is required.

Information Retrieval ResultsInformation Retrieval Results• Experiments with small

collections:– CACM (3,204 docs and 64 queries).– CISI (1,460 docs and 112 queries).– Results similar to those achieved by

several systems but not as good as reached by SMART for instance.

• CLEF 2004 results:– Monolingual IR within Russian

documents: 72 documents found from 123 relevant ones, average precision: 0.14

– Bilingual IR using Spanish to query English docs: 145 documents found from 375 relevant ones, average precision: 0.06.

• However, blindLight does not apply:– Stop word removal.– Stemming.– Query term weighting.

• Problems specially with broad topics.

93%

Interpolated P-R graphs

0

0,1

0,2

0,3

0,4

0,5

0,6

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Recall

Pre

cisi

on

CACM (pi ranking) CACM (piro ranking)

CISI (pi ranking) CISI (piro ranking)

ConclusionsConclusions• Genetic classification of languages (clustering) using

blindLight:– Coherent results for both orthographic and phonetic input.– Results are also consistent with linguistic theories.– Results useful to develop language identifiers.

• Language identification (categorization) using blindLight:– Accuracy higher than 97%.– Information-to-Noise ratio around 2/7.

• Information retrieval performance must be improved, however:– Language independent.– Straightforward bilingual IR.

• To sum up, To sum up, blindLightblindLight is an extremely simple technique is an extremely simple technique which appears to be flexible enough to be applied to a which appears to be flexible enough to be applied to a wide range of NLP tasks showing in all of them adequate wide range of NLP tasks showing in all of them adequate performance.performance.

100%

One Size Fits All?One Size Fits All?A Simple Technique to Perform Several NLP TasksA Simple Technique to Perform Several NLP Tasks

Daniel Gayo Avello(University of Oviedo)

Thank you!

One Size Fits All? A Simple Technique to Perform Several NLP Tasks

Documents

document of n

document length

n5 ngrams

q document

different length ngrams

rank character ngrams

text whats ngram significance

arbitrary length word