Top Banner
Using association rules for query reformulation Ismaïl Biskri 1 , Louis Rompré 2 1 Department of Mathematics and Computer Science University of Quebec at Trois-Rivieres, Canada 2 Department of Computer Science University of Quebec at Montreal, Canada ABSTRACT In this paper we will present research on the combination of two methods of data mining: text classification and maximal association rules. Text classification has been the focus of interest of many researchers for a long time. However, the results take the form of lists of words (classes) that we often do not know what to do with. The use of maximal association rules induced a number of advantages: (i) the detection of dependencies and correlations between the relevant units of information (words) of different classes, (ii) the extraction of hidden knowledge, often relevant, from a large volume of data. We will show how this combination can improve the process of information retrieval. INTRODUCTION The ever increasing importance of internet penetration and the growing size of electronic documents has made information retrieval a major scientific discipline in computer science, all while access to relevant information has become difficult, having become an informational tide that is occasionally reduced to nothing more than noise. Information retrieval consists of selecting the documents or segments of text likely to respond to the needs of a user from a document database. This operation is carried out by way of digital tools that are sometimes associated with linguistic tools in order to refine the granularity of the results given certain points of view (Desclés, Djioua, 2009) or logical tools in question-answer format, or even tools proper to Semantic Web. However, we knowingly omit a presentation of the contributions of these linguistic methods, logical methods, and Semantic Web, due to a concern for not weighing down the writing in this chapter, since we are primarily interested in the numerical side. Formally, there are three main elements that stand out with regards to information retrieval: (i) The group of documents. (ii) The information needs of the users. (iii) The relevance of the documents or segments of text that an information retrieval system returns given the needs expressed by the user. The last two aspects necessarily rely on the user. Not only does the user define their needs, but they also validate the relevance of the documents returned. To express their needs, a user formulates a query that often (but not always) takes the form of key words submitted to an information retrieval system based either on a Boolean model, a vector model, or a probabilistic model (Boughanem, Savoy, 2008).
13

Using association rules for query reformulation

May 14, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using association rules for query reformulation

Using association rules for query

reformulation

Ismaïl Biskri1 , Louis Rompré

2

1Department of Mathematics and Computer Science

University of Quebec at Trois-Rivieres, Canada

2Department of Computer Science

University of Quebec at Montreal, Canada

ABSTRACT In this paper we will present research on the combination of two methods of data mining: text

classification and maximal association rules. Text classification has been the focus of interest of many

researchers for a long time. However, the results take the form of lists of words (classes) that we often do

not know what to do with. The use of maximal association rules induced a number of advantages: (i) the

detection of dependencies and correlations between the relevant units of information (words) of different

classes, (ii) the extraction of hidden knowledge, often relevant, from a large volume of data. We will

show how this combination can improve the process of information retrieval.

INTRODUCTION

The ever increasing importance of internet penetration and the growing size of electronic documents has

made information retrieval a major scientific discipline in computer science, all while access to relevant

information has become difficult, having become an informational tide that is occasionally reduced to

nothing more than noise.

Information retrieval consists of selecting the documents or segments of text likely to respond to the

needs of a user from a document database. This operation is carried out by way of digital tools that are

sometimes associated with linguistic tools in order to refine the granularity of the results given certain

points of view (Desclés, Djioua, 2009) or logical tools in question-answer format, or even tools proper to

Semantic Web. However, we knowingly omit a presentation of the contributions of these linguistic

methods, logical methods, and Semantic Web, due to a concern for not weighing down the writing in this

chapter, since we are primarily interested in the numerical side.

Formally, there are three main elements that stand out with regards to information retrieval:

(i) The group of documents.

(ii) The information needs of the users.

(iii) The relevance of the documents or segments of text that an information retrieval system returns

given the needs expressed by the user.

The last two aspects necessarily rely on the user. Not only does the user define their needs, but they also

validate the relevance of the documents returned. To express their needs, a user formulates a query that

often (but not always) takes the form of key words submitted to an information retrieval system based

either on a Boolean model, a vector model, or a probabilistic model (Boughanem, Savoy, 2008).

Page 2: Using association rules for query reformulation

However, it is often difficult for a user to find key words that allow them to express their exact needs. In

many cases, the user is confronted by a lack of knowledge on the subject of interest in their information

search on the one hand, and on the other hand, by results that may be biased, as is the case with search

engines on the Web. Thus, retrieving relevant documents from the first search is almost impossible.

Therefore, there is a need to carry out a reformulation of the query either by using completely different

key words, or by expanding the initial query with the addition of new key words (El Amrani & al., 2004).

In the case of expanding the query, two variants are possible:

(i) The first is manual. The user chooses terms that are judged relevant in the documents that are also

judged relevant in order to strengthen the query. This strategy is simple and computationally costs

the least. However, it does not allow for a general view of the group of documents returned by the

retrieval system considering their large numbers, and given that it is not humanly possible. Quite

often, the user only consults the first few documents, and only judges these few.

(ii) The second is semi-automatic. The terms added to the initial query are chosen by the user from a

thesaurus (which may be constructed manually) or from similarity classes of documents and co-

occurrences of terms obtained following a classification applied to a group of documents,

obtained following the initial request as in clustering engines. A process of classifying textual

data from web sites can help the user of a search engine to better identify the target site or to

better formulate a query. Indeed, the lexical units which co-occur with the keywords submitted to

the search engine can provide more details concerning the documents to which access is desired.

However, the interpretation of similarity classes is a nontrivial exercise. The classes of similarity

are usually presented as lists of words that occur together. These lists are often very large and

their vocabulary is very noisy.

In this chapter we will show how maximal association rules can improve the semi-automatic

reformulation of a query in order to access target documents more quickly.

MAXIMAL ASSOCIATION RULES

A brief survey of the literature on data mining (Amir & Aumann, 2005) teaches us that association rules

allow for a representation of regularities in the co-occurrence of data (in the general sense of the term) in

transactions, regardless of their nature. Thus, data that regularly appear together are structured in so-

called association rules. An association rule is expressed as XY. This is read as follows: each time that

X is encountered in a transaction, so is Y. There are also ways to measure the quality of these association

rules: the measure of Support and the measure of Confidence.

The concept of association rule emerges mainly from the late 60 (Hajek & al., 1966) with the

introduction of the concept of the support and the confidence. Interest in this concept was revived in the

90s through the work of Agrawal (Agrawal et al. 1993 ; Agrawal & Srikant, 1994) on the extraction of

association rules in a database containing business transactions.

Curently, work is being done on how best to judge the relevance of association rules, as well as the

quality of their interpretation (Vaillant & Meyer, 2006 ; Lallich & Teytaud, 2003 ; Cherfi & Toussaint,

2002), and their integration into information retrieval systems (Diop & Lo, 2007) and into classification

processes for text mining (Cherfi & Napoli, 2005 ; Serp & al., 2008).

To illustrate association rules, consider the definition of the principal elements in the following example:

Three transactions to regroup the data that co-occurs: T1:{A, 1, K}; T2:{M, L, 2}; T3:{A, 1, 2}

Two sets to categorize the data: E1:{A, M, K, L}; E2:{1, 2}

X and Y: two separate sets of information units: X:{A}; Y:{1}. X E1 and Y E2.

Page 3: Using association rules for query reformulation

For a transaction Ti and a set of information units X, it is said that Ti supports X if X Ti. The Support

of X, noted as S(X), represents the number of transactions Ti such that X Ti. In the case of transactions

T1, T2 and T3, S(X) = S(A) = 2.

The Support of the association rule X Y is the number of transactions that contain X and Y. In the

case of our example S(XY) = S(A1) = 2.

The Confidence of the association rule X Y, noted as C(XY), corresponds to the support of this

association rule divided by the Support of X otherwise stated as C(XY) = S(XY)/S(X). In the case

of our example, C(XY) = C(A1) = 1.

Despite their potential, association rules cannot be established in the case of less frequent associations.

Thus, certain associations are ignored since they are not frequent. For example, if the word printer often

appears with the word paper and less frequently with the word ink, it is very probable that the association

between printer and paper will be retained to the detriment of the association between printer, paper and

ink. In fact, the confidence criterion associated to the relationship between printer, paper and ink would

be too low.

The maximal association rules, noted as X maxY, compensate for this limitation. They are dedicated

to the following general principle: each time that X appears alone, Y also appears. Note that X is reputed

to appear alone if and only if for a transaction Ti and a category set Ej (X Ej), Ti Ej = X. In this

case, X is maximal in Ti with regards to Ej and Ti M-Supports X. Note the M-Support of X by Smax(X),

which thus represents the number of transactions Ti that M-Support X.

In the transaction T1, X is not alone with regards to E1 since T1 E1 = {A, K}. On the other hand, in the

transaction T3, X is alone since T3 E1 = {A}.

The M-support of the maximal association X maxY noted as Smax(X max

Y) represents the

number of transactions that M-support X and support Y.

In the case of our example, only the transaction T3 M-supports X while T1 and T3 support Y.

Consequently Smax (A max1) = 1.

The M-confidence noted as Cmax(X maxY) represents the number of transactions that M-support

X maxY relative to the set of transaction that M-support X max

E2. The M-confidence of the rule

X maxY is thus calculated by the formula Cmax(X max

Y) = Smax(X maxY)/

Smax(X maxE2).

In the association A max1, the M-Confidence is found to be equal to 0.5.

Finally, it should be noted that we must define the minimum thresholds for the M-support of a maximal

association, as well as for its M-Confidence.

REFORMULATING A QUERY FOR A SEARCH ENGINE

We now describe the main elements of our Project, which is currently in progress. Although the project is

still in its early stage, we have already designed and developed several components. Let us now have a

look at the overall processing strategy which is organised in four main phases.

Page 4: Using association rules for query reformulation

Phase #1: An original query is formulated and submitted to a standard search engine, such as Google.

This original query is most probably sub-optimal, because of the reasons we mentioned just above. But

that is exactly our starting point: we only expect the user to have a relatively good idea of what she is

looking for, not too vague but not too precise either. In fact, we expect this original query to subsume the

exact information she is hoping to find on the Web. In a subsequent step, the user will be given the

opportunity to reconsider her query in the light of the search results returned for her original query. These

search results are typically Web sites containing textual information. Of course, there is the possibility

that the number of Web sites returned by the search engine will be very large. The user can set (in an

arbitrary manner in the present state of our work) the maximum number of web sites to consider.

Phase #2 : We consider that each Web site represents a text segment (domain of information). Thus, a set

of Web sites forms a set of text segments which, taken altogether, can be looked at as a corpus. We then

submit this corpus to the GRAMEXCO software (see next section) that will help us identify segments

sharing lexical regularities. Assuming that such similarities correspond to content similarities between the

Web sites, then related web sites will tend to be grouped in the same classes. In other words, the classes

produced by the GRAMEXCO will tend to contain web pages about the same topic and, by the same

token, will identify the lexical units that tend to co-occur within these topics. And this is where we have a

first gain with respect to the original query: GRAMEXCO’s results will provide a list of candidate query

terms that are related to the user’s original query and that will help her reformulate a new, more precise.

The classes of Web pages obtained at the end of this phase are considered as contextual information,

relative to the user’s original query, that will allow her to reformulate her query (if necessary).

Phase #3 : Words in classes of co-occurring words can act as carrier of more selective or complementary

sense when compared with the keywords used in the original query. For instance, such words will be

more selective when the original keywords are polysemous or when they have multiples usages in

different domains; these new words will be complementary when they broaden the subset of the WWW

implied by the original query. Now, which of these new words should be picked by the user to reformulate

her query? It is at this stage that the process of extracting association rules comes up, the aim being to

offer the user a tool to select the words that associate (as the most likely depending of the M-support and

the M-confidence) with the keywords of the original query. The user will not have to go through all

classes of words, which is a non-trivial task.

At that point the user can formulate an updated query from which she will obtain, through the processing

already presented in phase #1 and #2 and #3, another bunch of classes containing similar Web pages, co-

occurring words and maximal association rules. Again, at the end of this step, the user may discover new

words that could guide her in a more precise reformulation of her query. It entirely up to the user to

determine when the query reformulation iterative process will end.

Page 5: Using association rules for query reformulation

Figure 1: From the user’s original query to a satisfying query

IDENTIFICATION OF MAXIMAL ASSOCIATION RULES IN SIMILARITY CLASSES

GRAMEXCO (n-GRAMs in the EXtraction of knowledge (COnnaissance)) is our software tool that has

been developed for the numerical classification of multimedia documents (Rompré et al., 2008),

particularly text documents. The numerical classification takes place by way of a numerical classifier.

The unit of information considered in GRAMEXCO is the n-gram of characters, the value of n being

configurable.

The main objective is to provide the same processing chain, regardless of the corpus language, but with

easily legible layouts in the presentation of the results. Recall that the use of n-grams of characters is not

Identification of Maximal

association rules in classes of

co-occuring words

Phase #1

new query

Phase #2

the user determines if

the results satisfy her

information needs

OK

not

OK

Phase #3

the user picks new

words according to

the obtained

maximal association

rules

Satisfying query

WWW

original

query

list of

Web page

adresses

numerical

classifier

corpus made up of

Web page contents

classes of similar (or related) Web

sites & classes of co-occurring

words

corpus

construction

process

User

Web search engine

(Google)

Page 6: Using association rules for query reformulation

recent. It was first used in work by Damashek (1995) on text analysis and work by Greffenstette (1995)

on language identification. The interest in n-grams today has been extended to the domains of images

(Laouamer et al., 2006), and musicology, particularly in locating refrains (Patel & Mundur, 2005). A

character n-gram is defined here as a sequence of n characters: bigrams for n=2, trigrams for n=3,

quadrigrams for n=4, etc. For example, in the word informatique the trigrams are: inf, nfo, for, orm, rma,

mat, ati, tiq, iqu, que. We justify our choice of n-gram of characters as the unit of information by: (i) The

cutting into sequences of n consecutive characters is possible in most languages. It is necessary that any

approach can be adapted to several languages because of the "multilingual" nature of the web ; (ii) The

necessary tolerance for a certain ratio of deformation or flexion of lexical units. The functioning of

GRAMEXCO is not entirely automatic. The choice of certain parameters is made by the user according to

their own objectives. GRAMEXCO takes a raw (non indexed) text as input in UTF format. There are then

three first main steps where the user can customize certain processes.

1. The first step consists of building a list of information units and information domains (parts of texts

to be compared for similarity). From the two operations carried out simultaneously, we retrieve an

output matrix with a list of the frequency of appearance of each information unit in each information

domain. The information units may be in the form of bigrams, trigrams, quadrigrams, etc. Obtaining

information domains passes through the process of text segmentation which may be done in words,

phrases, paragraphs, documents, web sites or simply in sections of text delimited by a character or a

string of characters. The choice of the size of the n-gram and the type of textual segment is

determined by the user according to the goals of their analysis.

2. The second step consists of reducing the size of the matrix. This operation is indispensable given the

important cost in resources that an overly large matrix would represent.

Thus, during this step, a list of n-grams undergoes some trimming that corresponds to:

the elimination of n-grams whose frequency is lower than a certain threshold or above

another threshold,

the elimination of specific n-grams selected from a list (for example, n-grams containing

spaces or n-grams containing non-alphabetic characters),

the elimination of certain n-grams considered as functional, such as suffixes.

3. In the third step, the classification process takes place. The classifier used here is the neural network

ART (Meunier et al., 1997). The choice of classifier is not dictated by particular performance reasons

since this is not our objective. We could have just as easily chosen another classifier that would have

admittedly yielded different results. Such variations continue to be the focus of research such as was

presented in Turenne (2000).

At the end of this step, segments considered as similar by the classifier are regrouped into similarity

classes. Furthermore, the lexicon of these segments forms the vocabulary of the classes to which they

belong.

Whether the goal is lexical disambiguation, or searching for “conceptual” relationships, or information

retrieval, etc., the interpretation of similarity classes is not a trivial task. Similarity classes are generally

presented as lists of words that co-occur. These lists are quite frequently lengthy and despite

organizational attempts, their vocabulary remains rather noisy.

The extraction process of maximal association rules proves to be one of the most interesting for

permitting the discovery of lexical associations relevant to making an informed decision. The classes

obtained at the end of the classification operation will be the transactions of the process that will allow the

extraction of maximal association rules. Finally, in order for the process to be carried out, it must be

Page 7: Using association rules for query reformulation

supervised by the user who will have to first determine the word for which the most probable associations

will be found.

To illustrate this step, let us posit the following scenario that will allow us to discover maximal

association rules X maxY based on the results of a classification.

The input of the classification is a text in which the vocabulary represents a category set E1: {x, a, b, c, d,

e, f}. The classification outputs classes with their respective lexicon: C1 : {x, a, b, c}, C2 : {a, c, d}, C3 :

{x, e, f, d}.

If the classes represent the transactions, the vocabulary of the input text represents a set E1 for

categorizing the textual data (the vocabulary) in which set X is chosen.

This being established, the extraction process of maximal association rules is carried out in three steps:

1st step: choice of set X: it is the user who chooses the lexicon from a list of elements of E1 that will

represent X. Let us assume for explanatory purposes that X = {x}.

2nd

step: identification of set Y and set E2: the identification of the category set E2 in which Y would be a

subset largely depends on the set X selected and on the classes of which X is a subset.

In the case of our illustration, X is included in C1 and in C3. Y may therefore be a subset either of {a, b,

c} or of {e, f, d}. In other words, Y may represent one of the following subsets: {a}, {b}, {c}, {a, b}, {a,

c}, {b, c}, {a, b, c}, {e}, {f}, {d}, {e, f}, {e, d}. {f, d}, {e, f, d}.

The measures of M-Support and of M-Confidence will be calculated with regards to these different

possible values of Y. An iterative process would allow for testing the set of these possibilities. We may,

however, limit the number of iterations in order to avoid an overly prohibitive computational cost, for

example, by fixing (via parameter) the cardinality of subset Y.

Let us suppose that Y = {a, c}; in order to construct E2, the respective categories of elements a and c must

first be established. These are obtained by uniting classes that contain a (or c, respectively).

Consequently, E2 = category(Y) = category{a, c} would be obtained by intersecting category(a) with

category(c). Thus:

dcbadcacbaacategory ,,,,,,,)(

and

dcbadcacbaccategory ,,,,,,,)(

therefore :

E2 = dcbaccategoryacategorycacategoryYcategory ,,,)()(),()(

3rd

step: once the sets E1, E2, X and Y as well as the transactions have been clearly identified, the

calculation of the measures may be made.

Consider the association cax ,max . Using the classes C1: {x, a, b, c}, C2: {a, c, d}, C3: {x, e, f, d}

as transactions, and E2 = {a, b, c, d}, it follows that M-support equals 1, since only Class 1 contains X=

{x} and Y= {a, c}, and an M-confidence of 0.5 since two classes contain X while only one contains X

and Y.

Page 8: Using association rules for query reformulation

EXPERIMENTS

The whole of the theory presented here was implemented in C#. The results of the analyses are stored in

XML databases. Furthermore, in the short term, we hope to graft a more practical visualisation module

that would permit the one-step capture of the set of associations of a given lexical unit.

The following experiments were applied to four corpora (three of them are extracted from web sites).

Two corpora are in French and two are in Arabic. The first corpus is a collection of interviews with

directors of small and medium Quebecois businesses in order to learn about their perspectives on the

notion of risk. The second corpus addresses the history of the reign of King Hassan II of Morocco. The

third corpus (in Arabic) addresses the Organisation of the Petroleum Exporting Countries (OPEC).

Finally, the fourth and final corpus (in Arabic) summarizes the biography of the American President,

Barack Obama. The domains are sufficiently different to draw conclusions on the efficacy of the

methodology. Note : we limit ourselves to show just maximal associations and scores of each association

(M-support and M-Confidence). We assume that the reader is sufficiently familiar with the methods of

classification and we do not need to show classes of similarities.

1st experiment: the corpus, as mentioned above, addresses the perspective of directors of small and

medium Quebecois businesses with regards to the notion of risk. One of the constraints during the

interviews was the obligation put on the directors to use the word risk when they deemed it necessary. In

our experiments, this aspect is crucial since we need to know which words are associated to risk in the

discourse of the directors.

Thus, despite the presence of noisy data such as, for example, Pause and X, which were intentionally

inserted into the text for ethical reasons (X represents the name of people who were questioned) and to

represent silences (Pause), interesting results were still obtained. For example:

Risk maxProject is an association that is found in 10 classes (M-support = 10) with a

confidence of 100%.

Risk maxManagement, Project is an association that we find in 7 classes (M-support = 7)

with a confidence of 70%. In other words, 30% of the time, it is possible to find the word Risk in

classes where Management and Project did not occur together.

Risk maxManagement is an association that we find in 7 classes (M-support = 7) with a

confidence of 70%.

Risk maxProduct is an association that we find in 5 classes (M-support = 5) with a

confidence of 50%.

The following table summarizes the results obtained:

Page 9: Using association rules for query reformulation

X Y M-Support M-Confidence

Risk Decision, Product 2 20%

Year 2 20%

Markets, Price 2 20%

Science 3 30%

Interview, Studies 3 30%

Function 4 40%

Manner, Level 5 50%

Product 5 50%

Question 6 60%

Interview, Risk 6 60%

Level, X 7 70%

Management 7 70%

Management, Project 7 70%

Project, Risks 8 80%

X 10 100%

Pause 10 100%

Project, X 10 100%

Pause, X 10 100%

Project 10 100%

Table 1 : Results of the 1st Experiment

2nd

experiment: For the second experiment, we chose a short 4-page text about the reign of King Hassan

II. For this experiment, we intentionally chose to consider the cardinality of set Y equal to 1. For X =

{Hassan}, we obtained the results summarized in table 2.

Note that, for example, the association Hassan maxII is very strong. Its confidence is 100%.

Likewise for the associations Hassan max Morocco and Hassan max

King. Although their

confidence is only 61.54%, this is sufficiently high to consider the two associations as maximal.

X Y M-Support M-Confidence

Hassan

Doctor 1 7.69 %

Professor 1 7.69 %

Spain 1 7.69 %

Tunisia 1 7.69 %

Spanish 2 15.38 %

Journalist 3 23.08 %

History 3 23.08 %

Prepare 3 23,08 %

Title 4 30,77 %

France 5 38.46 %

Politics 6 46.15 %

Year 7 53,85 %

King 8 61.54 %

Morocco 8 61.54 %

II 13 100 %

Table 2 : Results of the 2nd

Experiment

Page 10: Using association rules for query reformulation

3rd

experiment: For the third experiment, we chose an Arabic text regarding the Organisation of the

Petroleum Exporting Countries (OPEC), the goal being to evaluate the validity of the method with

regards to the Arabic language. For the purposes of the experiment, we chose X = {OPEC}. The

following table provides a summary of the results (a translation of the Arabic words is provided):

X Y M-Support M-Confidence

OPEC

Mechanisms 1 9,09 %

Paris, Countries 1 9,09 %

Creation, prices 2 18,18 %

Petroleum 3 27,27 %

Countries, members 3 27,27 %

Prices 3 27,27 %

Organisation, prices 3 27,27 %

Creation 3 27,27 %

Members 4 36,36 %

Summit 4 36,36 %

World 4 36,36 %

Organisation, country 4 36,36 %

Organisation 6 54,55 %

Countries 7 63,64 %

In 9 81,82 %

Table 3 : Results of the 3rd

Experiment

The results obtained indeed show the tight relationship between the acronym OPEC and the two words

Organisation and Countries. However, there is an association with a relatively high M-support and M-

confidence that relates OPEC to the function word in. We consider this association as being noise that

may be eliminated if a post-process is added to suppress associations with function words.

4th experiment: The corpus studied here is a short biography of President Barack Obama. The text is

written in Arabic. Upon reading the following table, it can be noted that in the text, Obama is strongly

associated (M-confidence = 100%) to Barack even if the M-support is only 3. It is also noted that in terms

of important values for M-confidence, Obama is strongly associated to the word pairs origins, African

and states, united. However, there is a weak association of Obama with the function words like and of

with an M-confidence of 66.67%. Once more, this type of noise can be eliminated with the addition of a

post-process that would suppress the undesired associations.

X Y M-Support M-Confiance

Obama candidate, last 1 33,33 %

arms 1 33,33 %

president life 1 33,33 %

Washington, American 1 33,33 %

like 2 66,67 %

of 2 66,67 %

states, united 2 66,67 %

origins, African 2 66,67 %

Barack 3 100,00 %

Table 4 : Results of the 4th Experiment

Page 11: Using association rules for query reformulation

In general, the results of our experiments seem interesting. The configuration of the classification results

seems, in fact, to discourage users who found themselves helpless in the face of “voluminous word lists”.

The downstream use of the numerical classification of an extraction process of maximal association rules

may help to better read the results of a classification.

In each experiment, the main topic of each document is represented in the extracted association rules,

since the first keyword used. We can conclude, in general, that the maximum extraction rules capture all

the main topics of the documents.

Maximal association rules are clues that can help the user to reformulate his query. The M-support and

the M-confidence indicate lexical proximity in the documents, but also in the language used and in the

areas covered by the textual content of the documents.

Initially, the query is limited to a single keyword: risk, Hassan, OPEC, or Obama. Now, to improve its

results, the user can reformulate her queries by adding new keywords from those associated with them.

For this, the user takes into account the M-support and the M-Confidence. But, of course, associations,

M-support and M-confidence are only clues. The most important is that the user no longer has to go

through all possible classes.

CONCLUSION

Information Retrieval is a relatively mature discipline. Much work has been presented to the scientific

community (TREC, SIGIR, etc.). However, the difficulties in carrying out this work and the

computational costs make it necessary to continue research in this domain.

In this chapter, we hope to have highlighted certain difficulties encountered in the information retrieval

process. The goal was not to create an exhaustive list of these difficulties, but rather to demonstrate that

possible elegant, user-oriented solutions exist. These solutions must be adapted to the information

retrieval contexts: searching the Web, large documents, multilingualism, new users, etc.

Textual classification allows the identification of similar documents (where in the case of the internet,

documents are web pages). It also allows us to highlight lexical co-occurrences, in particular, terms that

co-occur with key words in a query. A user may then consider these terms to better tailor their query to

their needs. However, the size of vocabularies makes the user’s task an arduous one.

The process of extracting maximal association rules allows us to identify co-occurrences in classes, but

also to attribute a score according to their relevance. These associations are clues to the disposition of the

users that may thus reformulate their queries in an informed manner.

REFERENCES

Agrawal, R., Srikant, R., (1994). Fast algorithms for mining association rules in large databases. In

Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International

Conference on Very Large Data Bases. Santiago, Chile.

Agrawal, R., Imielinski, T., Swami, A., (1993). Mining association rules between sets of items in large

databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Washington.

Amir, A., Aumann, Y., (2005). Maximal association rules: a tool for mining association in text. Kluwer

Academic Publishers Hingham.

Boughanem, M., Savoy, J., (2008). Recherche d’information : états des lieux et perspectives. Éditions

Hermès/Lavoisier, Paris.

Page 12: Using association rules for query reformulation

Cherfi, H., Napoli, A. (2005). Deux méthodologies de classification de règles d'association pour la fouille

de textes. Revue des nouvelles technologies de l'information

Cherfi, H., Toussaint Y. (2002). Adéquation d’indices statistiques à l’interprétation de règles

d’association. Actes des 6èmes

Journées internationales d'Analyse statistique des Données Textuelles.

Saint-Malo.

Desclés, J.P., Djioua B. (2009). La recherche d'information par accès aux contenus sémantiques,

"Annotations automatiques et recherche d'informations ", Eds. dir. Desclés Jean-Pierre, Le Priol

Florence, Hermes - Traite IC2 -- serie Cognition et Traitement de l'information.

Diop, C. T., Lo, M. (2007). Intégration de règles d’association pour améliorer la recherche d’informations

XML. Actes de la Quatrième conférence francophone en Recherche d'Information et Applications. Saint-

Étienne.

Damashek, M., (1995). Gauging Similarity with n-Grams : Language-Independent Categorization Of

Text. Science, 267, 843-848.

El Amrani, M. Y., Delisle, S., Biskri, I., (2004). @GEWEB : Agents personnels d'aide à la recherche sur

le Web". In proceedings of international conférence TALN'04. Rabat, Morocco.

Greffenstette, G., (1995). Comparing Two Language Identification Schemes. Actes des 3èmes

Journées

internationales d'Analyse statistique des Données Textuelles. Rome.

Hajek, P., Havel, I., Chytil, M., (1966). The GUHA method of automatic hypotheses determination. In

Computing.

Lallich, S., Teytaud O. (2003). Évaluation et validation de l'intérêt des règles d'association. Revue des

nouvelles Technologies de l'information.

Laouamer, L., Biskri, I., Houmadi, B., (2005). Towards an Automatic Classification of Images:

Approach by the N-Grams. In Proceedings of WNSCI 2005. Orlando.

Meunier, J.G., Biskri, I., Nault, G., Nyongwa, M., (1997). Exploration de classifieurs connexionnistes

pour l'analyse terminologique. Actes de la conférence Recherche d'Informations Assistée par Ordinateur.

Montréal.

Patel, N., Mundur, P., (2005). An N-gram based approach to finding the repeating patterns in musical. In

Proceedings of Euro/IMSA 2005. Grindelwald.

Rompré, L. Biskri, I., Meunier, F., (2008). Text Classification: A Preferred Tool for Audio File

Classification. In Proceedings of the 6th ACS/IEEE International Conference on Computer Systems and

Applications. Doha.

Turenne, N. (2000). Apprentissage statistique pour l’extraction de concepts à partir de textes

(Application au filtrage d’informations textuelles). Thèse de doctorat en informatique, Université Louis-

Pasteur, Strasbourg, France.

Vaillant, B., Meyer P. (2006). Mesurer l’intérêt des règles d’association. Revue des Nouvelles

Technologies de l’Information (Extraction et gestion des connaissances: État et perspectives).

Page 13: Using association rules for query reformulation

Key terms and definitions

Textual classification: The textual classification is the formal method, which groups together, in classes

of similarities, similar documents. It also captures, in texts, patterns of co-occurrence of units of

information.

Unit of information: A unit of information allows to represent a text as a vector system. In our paper, the

unit of information selected, is the n-gram of characters.

N-grams of characters: An n-gram of characters is a sequence of n successive characters.

Maximal Association rules: A maximal association rule is represented by an association between two

distinct sets of words, which according to a specific score, co-occurr regularly together.

To reformulate a query: Action that involves the use of new keywords in an information retrieval

process.

GRAMEXCO: Our software for the classification of textual documents.

Multilinguism: Multilingualism is the ability of a computational method to deal with several different

languages.