Data Structure. Two segments of data structure –Storage –Retrieval.

Data Structure

• Two segments of data structure

– Storage

– Retrieval

Item normalization

Document File Creation

DocumentManager

Document SearchManager

Original document file

Proc. Token search file

– Stemming– Inverted file system– N-gram– PAT trees and arrays– Signature– Hypertext

– Stemming– Inverted file system– N-gram– PAT trees and arrays– Signature– Hypertext

– Inverted file system• most common data structure

• Minimizes secondary storage access– When using multiple search terms

• Document, inversion list / posting files, dictionary

• Storing an inversion of documents

– Inverted file system• most common data structure

• Minimizes secondary storage access– When using multiple search terms

• Document, inversion list / posting files, dictionary

• Storing an inversion of documents

– N-gram• Fixed length consecutive series of ‘n’ characters

• Algorithmically based upon a fixed number of characters

• Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7)

• Does not involve semantics - concepts

– N-gram• Fixed length consecutive series of ‘n’ characters

• Algorithmically based upon a fixed number of characters

• Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7)

• Does not involve semantics - concepts

– N-gram• Symbol # to represent inter-word symbol (fig. 4.7)

– Blank, period, semi-colon, colon etc.

• Word fragments

• Uses– Spelling error detection and correction (fig. 4.8)

– Text compression

• Ignores words and treat input as a continuous data

– N-gram• Symbol # to represent inter-word symbol (fig. 4.7)

– Blank, period, semi-colon, colon etc.

• Word fragments

• Uses– Spelling error detection and correction (fig. 4.8)

– Text compression

• Ignores words and treat input as a continuous data

– N-gram• False hits can occur when without #

• The longer n-gram, the less likely is the error

• Problems– Increased size of inversion lists

– No semantic meaning and concept relationship

• Can achieve high recall

– N-gram• False hits can occur when without #

• The longer n-gram, the less likely is the error

• Problems– Increased size of inversion lists

– No semantic meaning and concept relationship

• Can achieve high recall

– PAT trees• PATRICIA trees

– Practical algorithm to retrieve information coded in alphanumerics

• Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input

• Substrings are termed as sistrings (Figure 4.9 - 4.11)

• Best for string searching but not widely used commercially

– PAT trees• PATRICIA trees

– Practical algorithm to retrieve information coded in alphanumerics

• Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input

• Substrings are termed as sistrings (Figure 4.9 - 4.11)

• Best for string searching but not widely used commercially

Signature• To provide a fast test to eliminate the majority of items

that are not related to a query

• A linear scan of the compressed version of items

• Coding based upon words in the item

• Words are mapped onto a word signature– A fixed length code with a fixed number of bits set to 1– Set to 1 determined by the hash function– ORed to create the signature of an item– Fig 4.13

• Words in the query are mapped to the word signature

• Search via template matching

Signature• To provide a fast test to eliminate the majority of items

that are not related to a query

• A linear scan of the compressed version of items

• Coding based upon words in the item

• Words are mapped onto a word signature– A fixed length code with a fixed number of bits set to 1– Set to 1 determined by the hash function– ORed to create the signature of an item– Fig 4.13

• Words in the query are mapped to the word signature

• Search via template matching

Signature

• Longer code length reduces probability of collision hashing the same words

• Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

Signature

• Longer code length reduces probability of collision hashing the same words

• Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

Hypertext (HTML and XML)• Allow one item to reference another item via an

embedded pointer

• A node (separate item)

• Link (reference pointer)– Similar or different data type than the original

• Navigates – Managing the loosely structured information

• Issue– Linkage integrity (no update of the removed or deleted

items)

Hypertext (HTML and XML)• Allow one item to reference another item via an

embedded pointer

• A node (separate item)

• Link (reference pointer)– Similar or different data type than the original

• Navigates – Managing the loosely structured information

• Issue– Linkage integrity (no update of the removed or deleted

items)

Hypertext (HTML and XML)

• Dynamic HTML– Combination of the latest HTML tags and options, style

sheets and programming

– Creation of animated Web pages and responsive to user interaction

• Dynamic HTML Object Model– Object-oriented view of Web pages and its elements

– Cascading style sheets

– Programming addressing the page elements with dynamic fonts

Hypertext (HTML and XML)

• Dynamic HTML– Combination of the latest HTML tags and options, style

sheets and programming

– Creation of animated Web pages and responsive to user interaction

• Dynamic HTML Object Model– Object-oriented view of Web pages and its elements

– Cascading style sheets

– Programming addressing the page elements with dynamic fonts

DOCUMENTS DICTIONARY INVERSION LISTS

Doc #1, computer, bit (2) bit - 1, 3

bit, byte

DOCUMENTS DICTIONARY INVERSION LISTS

Doc #1, computer, bit (2) bit - 1, 3

bit, byte

• Inversion list– Weights

– Words with special characteristics e.g. date

• Searching– Locate the inversion lists

– Apply appropriate logic between lists

– Final hit of the list of items is the result

• Inversion list– Weights

– Words with special characteristics e.g. date

• Searching– Locate the inversion lists

– Apply appropriate logic between lists

– Final hit of the list of items is the result

– B trees• e.g. of order m

• A root node with between 2 and 2m keys

• All other internal nodes have between m and 2m keys

• All keys are kept in order from smaller to larger

• All leaves are at the same level or differ by at most one level

– B trees• e.g. of order m

• A root node with between 2 and 2m keys

• All other internal nodes have between m and 2m keys

• All keys are kept in order from smaller to larger

• All leaves are at the same level or differ by at most one level

– Inversion list structures• Provide optimum performance in searching

large databases• Minimization of data flow• Involve only directly related data• Good for storing concepts and their

relationship• Each list for representing a concept• A concordance of all of the items containing the

concepts• Location of the concepts• Do not solely work for natural language

processing

– Inversion list structures• Provide optimum performance in searching

large databases• Minimization of data flow• Involve only directly related data• Good for storing concepts and their

relationship• Each list for representing a concept• A concordance of all of the items containing the

concepts• Location of the concepts• Do not solely work for natural language

processing

Stemming algorithm• Goal: to improve performance and require less

system resources by reducing number of unique words that a system has to contain

• Currently reviewed for potential improvements of recall and associated decline in precision

• Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants

• Creates a large index for the stem vs. term masking (ORing)

Stemming algorithm• Goal: to improve performance and require less

system resources by reducing number of unique words that a system has to contain

• Currently reviewed for potential improvements of recall and associated decline in precision

• Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants

• Creates a large index for the stem vs. term masking (ORing)

Stemming algorithm• Conflation: refer to mapping multiple morphological

variants to a single representation (stem)

• Stem: carries the meaning of the concept associated with the word

• Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes

• Languages: grammars defining usage and evolve on human usage

• Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

Stemming algorithm• Conflation: refer to mapping multiple morphological

variants to a single representation (stem)

• Stem: carries the meaning of the concept associated with the word

• Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes

• Languages: grammars defining usage and evolve on human usage

• Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

Stemming algorithm

• Compression – savings in storage and processing?

• Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list

• Size of inversion list

• Compression does not significantly reduce storage requirements – small vs. large-sized collection

Stemming algorithm

• Compression – savings in storage and processing?

• Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list

• Size of inversion list

• Compression does not significantly reduce storage requirements – small vs. large-sized collection

Stemming algorithm

• Improve recall?

• As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming

• Improve precision?

• Only if the expansion guarantees every item retrieved by the expansion is relevant

Stemming algorithm

• Improve recall?

• As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming

• Improve precision?

• Only if the expansion guarantees every item retrieved by the expansion is relevant

Stemming algorithm

• System must recognize the word before stemming

• Proper names and acronyms – no stemming applied since no common core concept

• Problems for natural language processing system – loss of information needed for aggregate levels of processing

• e.g. tenses needed to determine a particular concept

• Time – important in natural language processing

Stemming algorithm

• System must recognize the word before stemming

• Proper names and acronyms – no stemming applied since no common core concept

• Problems for natural language processing system – loss of information needed for aggregate levels of processing



Stemming algorithm

• Removal of suffixes and prefixes

• Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network)

• Successor stemming – determine prefix overlap as the length of a stem is increased



Stemming algorithm

• Removal of suffixes and prefixes

• Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network)

• Successor stemming – determine prefix overlap as the length of a stem is increased



Stemming algorithm

• Porter stemming algorithm

• Dictionary look-up stemmers

• Successor stemmers

Stemming algorithm

• Porter stemming algorithm

• Dictionary look-up stemmers

• Successor stemmers

Porter Stemming algorithm

• Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition

• Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mV

• C and V – optional and m is the number VC repeats

• *<X>, *v*, *d, *o

Porter Stemming algorithm

• Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition

• Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mV

• C and V – optional and m is the number VC repeats

• *<X>, *v*, *d, *o

Dictionary Look-Up Stemmers

• Simple stemming rules – fewest exceptions (plural)

• Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it

• e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root

• Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

Dictionary Look-Up Stemmers

• Simple stemming rules – fewest exceptions (plural)

• Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it

• e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root

• Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

Successor Stemmers

• Based upon length of the prefixes that optimally stem expansions of additional suffixes

• Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes

• e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

Successor Stemmers

• Based upon length of the prefixes that optimally stem expansions of additional suffixes

• Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes

• e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

Successor Stemmers

• Methods: cut-off, peak and plateau, complete word method, and entropy method

• Cut-off method: cut-off value to define stem length, value varies for each possible set of words

• Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off)

• Complete word method: break on boundaries of complete words (not needing cut-off)

• Entropy method: uses the distribution of successor variety letters

• Figure 4.3

Successor Stemmers

• Methods: cut-off, peak and plateau, complete word method, and entropy method

• Cut-off method: cut-off value to define stem length, value varies for each possible set of words

• Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off)

• Complete word method: break on boundaries of complete words (not needing cut-off)

• Entropy method: uses the distribution of successor variety letters

• Figure 4.3

Stemming Algorithm

• Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming

• Stemming is dependent upon the nature of the vocabulary

• Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4

• Measure the ability to partition terms semantically and morphologically related to each other into “concept groups”

• Understemming index – concept groups with multiple stem

• Overstemming index – same stem is found in multiple groups

Stemming Algorithm

• Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming

• Stemming is dependent upon the nature of the vocabulary

• Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4

• Measure the ability to partition terms semantically and morphologically related to each other into “concept groups”

• Understemming index – concept groups with multiple stem

• Overstemming index – same stem is found in multiple groups

Data Structure. Two segments of data structure –Storage –Retrieval.

Documents

word signaturesearch

code word pattern present

interword symbol

inversion of documentsn

itemhypertext html

input string

different data type

fixed number of bits