Top Banner
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009
15

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

Jan 02, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

1

University of Palestine

Topics In CISITBS 3202

Ms. Eman Alajrami

2nd Semester 2008-2009

Page 2: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

2

Chapter 7Text Operations

Part2

Page 3: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

3

Elimination of StopwordsWords with high frequency are not good

discriminators. (in fact a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval ). They are frequently referred to as stopwords filtered out.

Examples:Articles: a, an, the,…Prepositions: on , in ,over,…Conjunctions: and ,or.

Page 4: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

4

Elimination of Stopwords (derived from Brown corpus): 425 words:

a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became, ...

Page 5: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

5

Elimination of StopwordsElimination of stopwords reduces the size

of the indexing structure (a bout 40% compression in the size of the indexing structure ).

Unfortunately, sometimes elimination of stopwords could eliminate words that have a profound impact on the retrieved documents. Ex:’to be or not to be’ be (is only left).

Solution full text index.

Page 6: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

6

StemmingA stem is the portion of the word which is

left after the removal of the affixes (prefixes and suffixes).

They are thought to be useful for improving retrieval performance (reduce the variants of the same root to a common concept)

connect connected, connecting, connection, connections.

Page 7: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

7

StemmingFrakes distinguish 4 types of stemming

strategies:Affix removalTable lookup: simple, but needs dataSuccessor variety.N-gram.

Page 8: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

8

Stemming• Affix removal: intuitive, simple and can be

implemented efficiently • Table lookup: looking for the stem of a word in a

table( simple ,but needs data for the whole language and considerable storage space).

• Successor variety: based on the determination of the morpheme boundaries , uses knowledge from structural Linguistic (complex, expensive maintenance).

• N-gram: based on the identification of digrams and trigrams, and it is more clustering procedure than a stemming one .(no data, but imprecise).

Page 9: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

9

Stemming

Term Stemengineering engineerengineered engineerengineer engineer

Table lookup

Page 10: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

10

Successor VarietyDefinition (successor variety of a

string)the number of different characters that follow it in words in some body of text.

Page 11: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

11

Successor Variety (Continued)Idea

The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase.

ExampleTest word: READABLECorpus: ABLE, BEATABLE, FIXABLE, READ,READS

READABLE, READING, RED, ROPE, RIPEPrefix Successor Variety LettersR 3 E, O, IRE 2 A, DREA 1 DREAD 3 A, I, SREADA 1 BREADAB 1 LREADABL 1 EREADABLE 1 blank

Page 12: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

12

Affix Removal Stemmersprocedure

Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. E.g., Porter’s algorithm (Eng Lang.) Porter algorithm. Martin Porter. Ready code in the

web. Substitution rules: sses s, s stresses stress.

Page 13: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

13

Affix Removal StemmersExample: plural forms

If a word ends in “ies” but not “eies” or “aies”then “ies” --> “y” (surgeries--> surgery).

If a word ends in “es” but not “aes”, “ees”, or “oes” then

“es” --> “e” (houses house).If a word ends in “s”, but not “us” or “ss” then “s” --

> NULL. (doors--> door).

Page 14: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

14

Index Terms SelectionA sentence in natural language text is

usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives.

Most of the semantics meaning is carried by the noun words .so it is a promising strategy to use the nouns in the text (by eliminating the others like verbs ,etc,..).

Page 15: 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

15

Index Terms SelectionSince it is common to combine two or three

nouns in a single component (ex. Computer science) it makes sense to cluster nouns which appear nearby in the text into a single indexing component (concept).

Thus instead of simply using nouns as index terms, we adopt noun groups)a set of nouns whose syntactic distance in the text and does not exceed a predefined threshold (for instance,3)(