Page 1
Using association rules for query
reformulation
Ismaïl Biskri1 , Louis Rompré
2
1Department of Mathematics and Computer Science
University of Quebec at Trois-Rivieres, Canada
2Department of Computer Science
University of Quebec at Montreal, Canada
ABSTRACT In this paper we will present research on the combination of two methods of data mining: text
classification and maximal association rules. Text classification has been the focus of interest of many
researchers for a long time. However, the results take the form of lists of words (classes) that we often do
not know what to do with. The use of maximal association rules induced a number of advantages: (i) the
detection of dependencies and correlations between the relevant units of information (words) of different
classes, (ii) the extraction of hidden knowledge, often relevant, from a large volume of data. We will
show how this combination can improve the process of information retrieval.
INTRODUCTION
The ever increasing importance of internet penetration and the growing size of electronic documents has
made information retrieval a major scientific discipline in computer science, all while access to relevant
information has become difficult, having become an informational tide that is occasionally reduced to
nothing more than noise.
Information retrieval consists of selecting the documents or segments of text likely to respond to the
needs of a user from a document database. This operation is carried out by way of digital tools that are
sometimes associated with linguistic tools in order to refine the granularity of the results given certain
points of view (Desclés, Djioua, 2009) or logical tools in question-answer format, or even tools proper to
Semantic Web. However, we knowingly omit a presentation of the contributions of these linguistic
methods, logical methods, and Semantic Web, due to a concern for not weighing down the writing in this
chapter, since we are primarily interested in the numerical side.
Formally, there are three main elements that stand out with regards to information retrieval:
(i) The group of documents.
(ii) The information needs of the users.
(iii) The relevance of the documents or segments of text that an information retrieval system returns
given the needs expressed by the user.
The last two aspects necessarily rely on the user. Not only does the user define their needs, but they also
validate the relevance of the documents returned. To express their needs, a user formulates a query that
often (but not always) takes the form of key words submitted to an information retrieval system based
either on a Boolean model, a vector model, or a probabilistic model (Boughanem, Savoy, 2008).
Page 2
However, it is often difficult for a user to find key words that allow them to express their exact needs. In
many cases, the user is confronted by a lack of knowledge on the subject of interest in their information
search on the one hand, and on the other hand, by results that may be biased, as is the case with search
engines on the Web. Thus, retrieving relevant documents from the first search is almost impossible.
Therefore, there is a need to carry out a reformulation of the query either by using completely different
key words, or by expanding the initial query with the addition of new key words (El Amrani & al., 2004).
In the case of expanding the query, two variants are possible:
(i) The first is manual. The user chooses terms that are judged relevant in the documents that are also
judged relevant in order to strengthen the query. This strategy is simple and computationally costs
the least. However, it does not allow for a general view of the group of documents returned by the
retrieval system considering their large numbers, and given that it is not humanly possible. Quite
often, the user only consults the first few documents, and only judges these few.
(ii) The second is semi-automatic. The terms added to the initial query are chosen by the user from a
thesaurus (which may be constructed manually) or from similarity classes of documents and co-
occurrences of terms obtained following a classification applied to a group of documents,
obtained following the initial request as in clustering engines. A process of classifying textual
data from web sites can help the user of a search engine to better identify the target site or to
better formulate a query. Indeed, the lexical units which co-occur with the keywords submitted to
the search engine can provide more details concerning the documents to which access is desired.
However, the interpretation of similarity classes is a nontrivial exercise. The classes of similarity
are usually presented as lists of words that occur together. These lists are often very large and
their vocabulary is very noisy.
In this chapter we will show how maximal association rules can improve the semi-automatic
reformulation of a query in order to access target documents more quickly.
MAXIMAL ASSOCIATION RULES
A brief survey of the literature on data mining (Amir & Aumann, 2005) teaches us that association rules
allow for a representation of regularities in the co-occurrence of data (in the general sense of the term) in
transactions, regardless of their nature. Thus, data that regularly appear together are structured in so-
called association rules. An association rule is expressed as XY. This is read as follows: each time that
X is encountered in a transaction, so is Y. There are also ways to measure the quality of these association
rules: the measure of Support and the measure of Confidence.
The concept of association rule emerges mainly from the late 60 (Hajek & al., 1966) with the
introduction of the concept of the support and the confidence. Interest in this concept was revived in the
90s through the work of Agrawal (Agrawal et al. 1993 ; Agrawal & Srikant, 1994) on the extraction of
association rules in a database containing business transactions.
Curently, work is being done on how best to judge the relevance of association rules, as well as the
quality of their interpretation (Vaillant & Meyer, 2006 ; Lallich & Teytaud, 2003 ; Cherfi & Toussaint,
2002), and their integration into information retrieval systems (Diop & Lo, 2007) and into classification
processes for text mining (Cherfi & Napoli, 2005 ; Serp & al., 2008).
To illustrate association rules, consider the definition of the principal elements in the following example:
Three transactions to regroup the data that co-occurs: T1:{A, 1, K}; T2:{M, L, 2}; T3:{A, 1, 2}
Two sets to categorize the data: E1:{A, M, K, L}; E2:{1, 2}
X and Y: two separate sets of information units: X:{A}; Y:{1}. X E1 and Y E2.
Page 3
For a transaction Ti and a set of information units X, it is said that Ti supports X if X Ti. The Support
of X, noted as S(X), represents the number of transactions Ti such that X Ti. In the case of transactions
T1, T2 and T3, S(X) = S(A) = 2.
The Support of the association rule X Y is the number of transactions that contain X and Y. In the
case of our example S(XY) = S(A1) = 2.
The Confidence of the association rule X Y, noted as C(XY), corresponds to the support of this
association rule divided by the Support of X otherwise stated as C(XY) = S(XY)/S(X). In the case
of our example, C(XY) = C(A1) = 1.
Despite their potential, association rules cannot be established in the case of less frequent associations.
Thus, certain associations are ignored since they are not frequent. For example, if the word printer often
appears with the word paper and less frequently with the word ink, it is very probable that the association
between printer and paper will be retained to the detriment of the association between printer, paper and
ink. In fact, the confidence criterion associated to the relationship between printer, paper and ink would
be too low.
The maximal association rules, noted as X maxY, compensate for this limitation. They are dedicated
to the following general principle: each time that X appears alone, Y also appears. Note that X is reputed
to appear alone if and only if for a transaction Ti and a category set Ej (X Ej), Ti Ej = X. In this
case, X is maximal in Ti with regards to Ej and Ti M-Supports X. Note the M-Support of X by Smax(X),
which thus represents the number of transactions Ti that M-Support X.
In the transaction T1, X is not alone with regards to E1 since T1 E1 = {A, K}. On the other hand, in the
transaction T3, X is alone since T3 E1 = {A}.
The M-support of the maximal association X maxY noted as Smax(X max
Y) represents the
number of transactions that M-support X and support Y.
In the case of our example, only the transaction T3 M-supports X while T1 and T3 support Y.
Consequently Smax (A max1) = 1.
The M-confidence noted as Cmax(X maxY) represents the number of transactions that M-support
X maxY relative to the set of transaction that M-support X max
E2. The M-confidence of the rule
X maxY is thus calculated by the formula Cmax(X max
Y) = Smax(X maxY)/
Smax(X maxE2).
In the association A max1, the M-Confidence is found to be equal to 0.5.
Finally, it should be noted that we must define the minimum thresholds for the M-support of a maximal
association, as well as for its M-Confidence.
REFORMULATING A QUERY FOR A SEARCH ENGINE
We now describe the main elements of our Project, which is currently in progress. Although the project is
still in its early stage, we have already designed and developed several components. Let us now have a
look at the overall processing strategy which is organised in four main phases.
Page 4
Phase #1: An original query is formulated and submitted to a standard search engine, such as Google.
This original query is most probably sub-optimal, because of the reasons we mentioned just above. But
that is exactly our starting point: we only expect the user to have a relatively good idea of what she is
looking for, not too vague but not too precise either. In fact, we expect this original query to subsume the
exact information she is hoping to find on the Web. In a subsequent step, the user will be given the
opportunity to reconsider her query in the light of the search results returned for her original query. These
search results are typically Web sites containing textual information. Of course, there is the possibility
that the number of Web sites returned by the search engine will be very large. The user can set (in an
arbitrary manner in the present state of our work) the maximum number of web sites to consider.
Phase #2 : We consider that each Web site represents a text segment (domain of information). Thus, a set
of Web sites forms a set of text segments which, taken altogether, can be looked at as a corpus. We then
submit this corpus to the GRAMEXCO software (see next section) that will help us identify segments
sharing lexical regularities. Assuming that such similarities correspond to content similarities between the
Web sites, then related web sites will tend to be grouped in the same classes. In other words, the classes
produced by the GRAMEXCO will tend to contain web pages about the same topic and, by the same
token, will identify the lexical units that tend to co-occur within these topics. And this is where we have a
first gain with respect to the original query: GRAMEXCO’s results will provide a list of candidate query
terms that are related to the user’s original query and that will help her reformulate a new, more precise.
The classes of Web pages obtained at the end of this phase are considered as contextual information,
relative to the user’s original query, that will allow her to reformulate her query (if necessary).
Phase #3 : Words in classes of co-occurring words can act as carrier of more selective or complementary
sense when compared with the keywords used in the original query. For instance, such words will be
more selective when the original keywords are polysemous or when they have multiples usages in
different domains; these new words will be complementary when they broaden the subset of the WWW
implied by the original query. Now, which of these new words should be picked by the user to reformulate
her query? It is at this stage that the process of extracting association rules comes up, the aim being to
offer the user a tool to select the words that associate (as the most likely depending of the M-support and
the M-confidence) with the keywords of the original query. The user will not have to go through all
classes of words, which is a non-trivial task.
At that point the user can formulate an updated query from which she will obtain, through the processing
already presented in phase #1 and #2 and #3, another bunch of classes containing similar Web pages, co-
occurring words and maximal association rules. Again, at the end of this step, the user may discover new
words that could guide her in a more precise reformulation of her query. It entirely up to the user to
determine when the query reformulation iterative process will end.
Page 5
Figure 1: From the user’s original query to a satisfying query
IDENTIFICATION OF MAXIMAL ASSOCIATION RULES IN SIMILARITY CLASSES
GRAMEXCO (n-GRAMs in the EXtraction of knowledge (COnnaissance)) is our software tool that has
been developed for the numerical classification of multimedia documents (Rompré et al., 2008),
particularly text documents. The numerical classification takes place by way of a numerical classifier.
The unit of information considered in GRAMEXCO is the n-gram of characters, the value of n being
configurable.
The main objective is to provide the same processing chain, regardless of the corpus language, but with
easily legible layouts in the presentation of the results. Recall that the use of n-grams of characters is not
Identification of Maximal
association rules in classes of
co-occuring words
Phase #1
new query
Phase #2
the user determines if
the results satisfy her
information needs
OK
not
OK
Phase #3
the user picks new
words according to
the obtained
maximal association
rules
Satisfying query
WWW
original
query
list of
Web page
adresses
numerical
classifier
corpus made up of
Web page contents
classes of similar (or related) Web
sites & classes of co-occurring
words
corpus
construction
process
User
Web search engine
(Google)
Page 6
recent. It was first used in work by Damashek (1995) on text analysis and work by Greffenstette (1995)
on language identification. The interest in n-grams today has been extended to the domains of images
(Laouamer et al., 2006), and musicology, particularly in locating refrains (Patel & Mundur, 2005). A
character n-gram is defined here as a sequence of n characters: bigrams for n=2, trigrams for n=3,
quadrigrams for n=4, etc. For example, in the word informatique the trigrams are: inf, nfo, for, orm, rma,
mat, ati, tiq, iqu, que. We justify our choice of n-gram of characters as the unit of information by: (i) The
cutting into sequences of n consecutive characters is possible in most languages. It is necessary that any
approach can be adapted to several languages because of the "multilingual" nature of the web ; (ii) The
necessary tolerance for a certain ratio of deformation or flexion of lexical units. The functioning of
GRAMEXCO is not entirely automatic. The choice of certain parameters is made by the user according to
their own objectives. GRAMEXCO takes a raw (non indexed) text as input in UTF format. There are then
three first main steps where the user can customize certain processes.
1. The first step consists of building a list of information units and information domains (parts of texts
to be compared for similarity). From the two operations carried out simultaneously, we retrieve an
output matrix with a list of the frequency of appearance of each information unit in each information
domain. The information units may be in the form of bigrams, trigrams, quadrigrams, etc. Obtaining
information domains passes through the process of text segmentation which may be done in words,
phrases, paragraphs, documents, web sites or simply in sections of text delimited by a character or a
string of characters. The choice of the size of the n-gram and the type of textual segment is
determined by the user according to the goals of their analysis.
2. The second step consists of reducing the size of the matrix. This operation is indispensable given the
important cost in resources that an overly large matrix would represent.
Thus, during this step, a list of n-grams undergoes some trimming that corresponds to:
the elimination of n-grams whose frequency is lower than a certain threshold or above
another threshold,
the elimination of specific n-grams selected from a list (for example, n-grams containing
spaces or n-grams containing non-alphabetic characters),
the elimination of certain n-grams considered as functional, such as suffixes.
3. In the third step, the classification process takes place. The classifier used here is the neural network
ART (Meunier et al., 1997). The choice of classifier is not dictated by particular performance reasons
since this is not our objective. We could have just as easily chosen another classifier that would have
admittedly yielded different results. Such variations continue to be the focus of research such as was
presented in Turenne (2000).
At the end of this step, segments considered as similar by the classifier are regrouped into similarity
classes. Furthermore, the lexicon of these segments forms the vocabulary of the classes to which they
belong.
Whether the goal is lexical disambiguation, or searching for “conceptual” relationships, or information
retrieval, etc., the interpretation of similarity classes is not a trivial task. Similarity classes are generally
presented as lists of words that co-occur. These lists are quite frequently lengthy and despite
organizational attempts, their vocabulary remains rather noisy.
The extraction process of maximal association rules proves to be one of the most interesting for
permitting the discovery of lexical associations relevant to making an informed decision. The classes
obtained at the end of the classification operation will be the transactions of the process that will allow the
extraction of maximal association rules. Finally, in order for the process to be carried out, it must be
Page 7
supervised by the user who will have to first determine the word for which the most probable associations
will be found.
To illustrate this step, let us posit the following scenario that will allow us to discover maximal
association rules X maxY based on the results of a classification.
The input of the classification is a text in which the vocabulary represents a category set E1: {x, a, b, c, d,
e, f}. The classification outputs classes with their respective lexicon: C1 : {x, a, b, c}, C2 : {a, c, d}, C3 :
{x, e, f, d}.
If the classes represent the transactions, the vocabulary of the input text represents a set E1 for
categorizing the textual data (the vocabulary) in which set X is chosen.
This being established, the extraction process of maximal association rules is carried out in three steps:
1st step: choice of set X: it is the user who chooses the lexicon from a list of elements of E1 that will
represent X. Let us assume for explanatory purposes that X = {x}.
2nd
step: identification of set Y and set E2: the identification of the category set E2 in which Y would be a
subset largely depends on the set X selected and on the classes of which X is a subset.
In the case of our illustration, X is included in C1 and in C3. Y may therefore be a subset either of {a, b,
c} or of {e, f, d}. In other words, Y may represent one of the following subsets: {a}, {b}, {c}, {a, b}, {a,
c}, {b, c}, {a, b, c}, {e}, {f}, {d}, {e, f}, {e, d}. {f, d}, {e, f, d}.
The measures of M-Support and of M-Confidence will be calculated with regards to these different
possible values of Y. An iterative process would allow for testing the set of these possibilities. We may,
however, limit the number of iterations in order to avoid an overly prohibitive computational cost, for
example, by fixing (via parameter) the cardinality of subset Y.
Let us suppose that Y = {a, c}; in order to construct E2, the respective categories of elements a and c must
first be established. These are obtained by uniting classes that contain a (or c, respectively).
Consequently, E2 = category(Y) = category{a, c} would be obtained by intersecting category(a) with
category(c). Thus:
dcbadcacbaacategory ,,,,,,,)(
and
dcbadcacbaccategory ,,,,,,,)(
therefore :
E2 = dcbaccategoryacategorycacategoryYcategory ,,,)()(),()(
3rd
step: once the sets E1, E2, X and Y as well as the transactions have been clearly identified, the
calculation of the measures may be made.
Consider the association cax ,max . Using the classes C1: {x, a, b, c}, C2: {a, c, d}, C3: {x, e, f, d}
as transactions, and E2 = {a, b, c, d}, it follows that M-support equals 1, since only Class 1 contains X=
{x} and Y= {a, c}, and an M-confidence of 0.5 since two classes contain X while only one contains X
and Y.
Page 8
EXPERIMENTS
The whole of the theory presented here was implemented in C#. The results of the analyses are stored in
XML databases. Furthermore, in the short term, we hope to graft a more practical visualisation module
that would permit the one-step capture of the set of associations of a given lexical unit.
The following experiments were applied to four corpora (three of them are extracted from web sites).
Two corpora are in French and two are in Arabic. The first corpus is a collection of interviews with
directors of small and medium Quebecois businesses in order to learn about their perspectives on the
notion of risk. The second corpus addresses the history of the reign of King Hassan II of Morocco. The
third corpus (in Arabic) addresses the Organisation of the Petroleum Exporting Countries (OPEC).
Finally, the fourth and final corpus (in Arabic) summarizes the biography of the American President,
Barack Obama. The domains are sufficiently different to draw conclusions on the efficacy of the
methodology. Note : we limit ourselves to show just maximal associations and scores of each association
(M-support and M-Confidence). We assume that the reader is sufficiently familiar with the methods of
classification and we do not need to show classes of similarities.
1st experiment: the corpus, as mentioned above, addresses the perspective of directors of small and
medium Quebecois businesses with regards to the notion of risk. One of the constraints during the
interviews was the obligation put on the directors to use the word risk when they deemed it necessary. In
our experiments, this aspect is crucial since we need to know which words are associated to risk in the
discourse of the directors.
Thus, despite the presence of noisy data such as, for example, Pause and X, which were intentionally
inserted into the text for ethical reasons (X represents the name of people who were questioned) and to
represent silences (Pause), interesting results were still obtained. For example:
Risk maxProject is an association that is found in 10 classes (M-support = 10) with a
confidence of 100%.
Risk maxManagement, Project is an association that we find in 7 classes (M-support = 7)
with a confidence of 70%. In other words, 30% of the time, it is possible to find the word Risk in
classes where Management and Project did not occur together.
Risk maxManagement is an association that we find in 7 classes (M-support = 7) with a
confidence of 70%.
Risk maxProduct is an association that we find in 5 classes (M-support = 5) with a
confidence of 50%.
The following table summarizes the results obtained:
Page 9
X Y M-Support M-Confidence
Risk Decision, Product 2 20%
Year 2 20%
Markets, Price 2 20%
Science 3 30%
Interview, Studies 3 30%
Function 4 40%
Manner, Level 5 50%
Product 5 50%
Question 6 60%
Interview, Risk 6 60%
Level, X 7 70%
Management 7 70%
Management, Project 7 70%
Project, Risks 8 80%
X 10 100%
Pause 10 100%
Project, X 10 100%
Pause, X 10 100%
Project 10 100%
Table 1 : Results of the 1st Experiment
2nd
experiment: For the second experiment, we chose a short 4-page text about the reign of King Hassan
II. For this experiment, we intentionally chose to consider the cardinality of set Y equal to 1. For X =
{Hassan}, we obtained the results summarized in table 2.
Note that, for example, the association Hassan maxII is very strong. Its confidence is 100%.
Likewise for the associations Hassan max Morocco and Hassan max
King. Although their
confidence is only 61.54%, this is sufficiently high to consider the two associations as maximal.
X Y M-Support M-Confidence
Hassan
Doctor 1 7.69 %
Professor 1 7.69 %
Spain 1 7.69 %
Tunisia 1 7.69 %
Spanish 2 15.38 %
Journalist 3 23.08 %
History 3 23.08 %
Prepare 3 23,08 %
Title 4 30,77 %
France 5 38.46 %
Politics 6 46.15 %
Year 7 53,85 %
King 8 61.54 %
Morocco 8 61.54 %
II 13 100 %
Table 2 : Results of the 2nd
Experiment
Page 10
3rd
experiment: For the third experiment, we chose an Arabic text regarding the Organisation of the
Petroleum Exporting Countries (OPEC), the goal being to evaluate the validity of the method with
regards to the Arabic language. For the purposes of the experiment, we chose X = {OPEC}. The
following table provides a summary of the results (a translation of the Arabic words is provided):
X Y M-Support M-Confidence
OPEC
Mechanisms 1 9,09 %
Paris, Countries 1 9,09 %
Creation, prices 2 18,18 %
Petroleum 3 27,27 %
Countries, members 3 27,27 %
Prices 3 27,27 %
Organisation, prices 3 27,27 %
Creation 3 27,27 %
Members 4 36,36 %
Summit 4 36,36 %
World 4 36,36 %
Organisation, country 4 36,36 %
Organisation 6 54,55 %
Countries 7 63,64 %
In 9 81,82 %
Table 3 : Results of the 3rd
Experiment
The results obtained indeed show the tight relationship between the acronym OPEC and the two words
Organisation and Countries. However, there is an association with a relatively high M-support and M-
confidence that relates OPEC to the function word in. We consider this association as being noise that
may be eliminated if a post-process is added to suppress associations with function words.
4th experiment: The corpus studied here is a short biography of President Barack Obama. The text is
written in Arabic. Upon reading the following table, it can be noted that in the text, Obama is strongly
associated (M-confidence = 100%) to Barack even if the M-support is only 3. It is also noted that in terms
of important values for M-confidence, Obama is strongly associated to the word pairs origins, African
and states, united. However, there is a weak association of Obama with the function words like and of
with an M-confidence of 66.67%. Once more, this type of noise can be eliminated with the addition of a
post-process that would suppress the undesired associations.
X Y M-Support M-Confiance
Obama candidate, last 1 33,33 %
arms 1 33,33 %
president life 1 33,33 %
Washington, American 1 33,33 %
like 2 66,67 %
of 2 66,67 %
states, united 2 66,67 %
origins, African 2 66,67 %
Barack 3 100,00 %
Table 4 : Results of the 4th Experiment
Page 11
In general, the results of our experiments seem interesting. The configuration of the classification results
seems, in fact, to discourage users who found themselves helpless in the face of “voluminous word lists”.
The downstream use of the numerical classification of an extraction process of maximal association rules
may help to better read the results of a classification.
In each experiment, the main topic of each document is represented in the extracted association rules,
since the first keyword used. We can conclude, in general, that the maximum extraction rules capture all
the main topics of the documents.
Maximal association rules are clues that can help the user to reformulate his query. The M-support and
the M-confidence indicate lexical proximity in the documents, but also in the language used and in the
areas covered by the textual content of the documents.
Initially, the query is limited to a single keyword: risk, Hassan, OPEC, or Obama. Now, to improve its
results, the user can reformulate her queries by adding new keywords from those associated with them.
For this, the user takes into account the M-support and the M-Confidence. But, of course, associations,
M-support and M-confidence are only clues. The most important is that the user no longer has to go
through all possible classes.
CONCLUSION
Information Retrieval is a relatively mature discipline. Much work has been presented to the scientific
community (TREC, SIGIR, etc.). However, the difficulties in carrying out this work and the
computational costs make it necessary to continue research in this domain.
In this chapter, we hope to have highlighted certain difficulties encountered in the information retrieval
process. The goal was not to create an exhaustive list of these difficulties, but rather to demonstrate that
possible elegant, user-oriented solutions exist. These solutions must be adapted to the information
retrieval contexts: searching the Web, large documents, multilingualism, new users, etc.
Textual classification allows the identification of similar documents (where in the case of the internet,
documents are web pages). It also allows us to highlight lexical co-occurrences, in particular, terms that
co-occur with key words in a query. A user may then consider these terms to better tailor their query to
their needs. However, the size of vocabularies makes the user’s task an arduous one.
The process of extracting maximal association rules allows us to identify co-occurrences in classes, but
also to attribute a score according to their relevance. These associations are clues to the disposition of the
users that may thus reformulate their queries in an informed manner.
REFERENCES
Agrawal, R., Srikant, R., (1994). Fast algorithms for mining association rules in large databases. In
Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International
Conference on Very Large Data Bases. Santiago, Chile.
Agrawal, R., Imielinski, T., Swami, A., (1993). Mining association rules between sets of items in large
databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
Washington.
Amir, A., Aumann, Y., (2005). Maximal association rules: a tool for mining association in text. Kluwer
Academic Publishers Hingham.
Boughanem, M., Savoy, J., (2008). Recherche d’information : états des lieux et perspectives. Éditions
Hermès/Lavoisier, Paris.
Page 12
Cherfi, H., Napoli, A. (2005). Deux méthodologies de classification de règles d'association pour la fouille
de textes. Revue des nouvelles technologies de l'information
Cherfi, H., Toussaint Y. (2002). Adéquation d’indices statistiques à l’interprétation de règles
d’association. Actes des 6èmes
Journées internationales d'Analyse statistique des Données Textuelles.
Saint-Malo.
Desclés, J.P., Djioua B. (2009). La recherche d'information par accès aux contenus sémantiques,
"Annotations automatiques et recherche d'informations ", Eds. dir. Desclés Jean-Pierre, Le Priol
Florence, Hermes - Traite IC2 -- serie Cognition et Traitement de l'information.
Diop, C. T., Lo, M. (2007). Intégration de règles d’association pour améliorer la recherche d’informations
XML. Actes de la Quatrième conférence francophone en Recherche d'Information et Applications. Saint-
Étienne.
Damashek, M., (1995). Gauging Similarity with n-Grams : Language-Independent Categorization Of
Text. Science, 267, 843-848.
El Amrani, M. Y., Delisle, S., Biskri, I., (2004). @GEWEB : Agents personnels d'aide à la recherche sur
le Web". In proceedings of international conférence TALN'04. Rabat, Morocco.
Greffenstette, G., (1995). Comparing Two Language Identification Schemes. Actes des 3èmes
Journées
internationales d'Analyse statistique des Données Textuelles. Rome.
Hajek, P., Havel, I., Chytil, M., (1966). The GUHA method of automatic hypotheses determination. In
Computing.
Lallich, S., Teytaud O. (2003). Évaluation et validation de l'intérêt des règles d'association. Revue des
nouvelles Technologies de l'information.
Laouamer, L., Biskri, I., Houmadi, B., (2005). Towards an Automatic Classification of Images:
Approach by the N-Grams. In Proceedings of WNSCI 2005. Orlando.
Meunier, J.G., Biskri, I., Nault, G., Nyongwa, M., (1997). Exploration de classifieurs connexionnistes
pour l'analyse terminologique. Actes de la conférence Recherche d'Informations Assistée par Ordinateur.
Montréal.
Patel, N., Mundur, P., (2005). An N-gram based approach to finding the repeating patterns in musical. In
Proceedings of Euro/IMSA 2005. Grindelwald.
Rompré, L. Biskri, I., Meunier, F., (2008). Text Classification: A Preferred Tool for Audio File
Classification. In Proceedings of the 6th ACS/IEEE International Conference on Computer Systems and
Applications. Doha.
Turenne, N. (2000). Apprentissage statistique pour l’extraction de concepts à partir de textes
(Application au filtrage d’informations textuelles). Thèse de doctorat en informatique, Université Louis-
Pasteur, Strasbourg, France.
Vaillant, B., Meyer P. (2006). Mesurer l’intérêt des règles d’association. Revue des Nouvelles
Technologies de l’Information (Extraction et gestion des connaissances: État et perspectives).
Page 13
Key terms and definitions
Textual classification: The textual classification is the formal method, which groups together, in classes
of similarities, similar documents. It also captures, in texts, patterns of co-occurrence of units of
information.
Unit of information: A unit of information allows to represent a text as a vector system. In our paper, the
unit of information selected, is the n-gram of characters.
N-grams of characters: An n-gram of characters is a sequence of n successive characters.
Maximal Association rules: A maximal association rule is represented by an association between two
distinct sets of words, which according to a specific score, co-occurr regularly together.
To reformulate a query: Action that involves the use of new keywords in an information retrieval
process.
GRAMEXCO: Our software for the classification of textual documents.
Multilinguism: Multilingualism is the ability of a computational method to deal with several different
languages.