Top Banner
Data Structure
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Structure. Two segments of data structure –Storage –Retrieval.

Data Structure

Page 2: Data Structure. Two segments of data structure –Storage –Retrieval.

• Two segments of data structure

– Storage

– Retrieval

Page 3: Data Structure. Two segments of data structure –Storage –Retrieval.

Item normalization

Document File Creation

DocumentManager

Document SearchManager

Original document file

Proc. Token search file

Page 4: Data Structure. Two segments of data structure –Storage –Retrieval.

– Stemming– Inverted file system– N-gram– PAT trees and arrays– Signature– Hypertext

– Stemming– Inverted file system– N-gram– PAT trees and arrays– Signature– Hypertext

Page 5: Data Structure. Two segments of data structure –Storage –Retrieval.

– Inverted file system• most common data structure

• Minimizes secondary storage access– When using multiple search terms

• Document, inversion list / posting files, dictionary

• Storing an inversion of documents

– Inverted file system• most common data structure

• Minimizes secondary storage access– When using multiple search terms

• Document, inversion list / posting files, dictionary

• Storing an inversion of documents

Page 6: Data Structure. Two segments of data structure –Storage –Retrieval.

– N-gram• Fixed length consecutive series of ‘n’ characters

• Algorithmically based upon a fixed number of characters

• Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7)

• Does not involve semantics - concepts

– N-gram• Fixed length consecutive series of ‘n’ characters

• Algorithmically based upon a fixed number of characters

• Searchable data structure transformed into a overlapping n-grams to create the searchable database (fig. 4.7)

• Does not involve semantics - concepts

Page 7: Data Structure. Two segments of data structure –Storage –Retrieval.

– N-gram• Symbol # to represent inter-word symbol (fig. 4.7)

– Blank, period, semi-colon, colon etc.

• Word fragments

• Uses– Spelling error detection and correction (fig. 4.8)

– Text compression

• Ignores words and treat input as a continuous data

– N-gram• Symbol # to represent inter-word symbol (fig. 4.7)

– Blank, period, semi-colon, colon etc.

• Word fragments

• Uses– Spelling error detection and correction (fig. 4.8)

– Text compression

• Ignores words and treat input as a continuous data

Page 8: Data Structure. Two segments of data structure –Storage –Retrieval.

– N-gram• False hits can occur when without #

• The longer n-gram, the less likely is the error

• Problems– Increased size of inversion lists

– No semantic meaning and concept relationship

• Can achieve high recall

– N-gram• False hits can occur when without #

• The longer n-gram, the less likely is the error

• Problems– Increased size of inversion lists

– No semantic meaning and concept relationship

• Can achieve high recall

Page 9: Data Structure. Two segments of data structure –Storage –Retrieval.

– PAT trees• PATRICIA trees

– Practical algorithm to retrieve information coded in alphanumerics

• Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input

• Substrings are termed as sistrings (Figure 4.9 - 4.11)

• Best for string searching but not widely used commercially

– PAT trees• PATRICIA trees

– Practical algorithm to retrieve information coded in alphanumerics

• Each position in the input string is the anchor point for a sub-string that starts at that point and includes all new text up to the end of the input

• Substrings are termed as sistrings (Figure 4.9 - 4.11)

• Best for string searching but not widely used commercially

Page 10: Data Structure. Two segments of data structure –Storage –Retrieval.

Signature• To provide a fast test to eliminate the majority of items

that are not related to a query

• A linear scan of the compressed version of items

• Coding based upon words in the item

• Words are mapped onto a word signature– A fixed length code with a fixed number of bits set to 1– Set to 1 determined by the hash function– ORed to create the signature of an item– Fig 4.13

• Words in the query are mapped to the word signature

• Search via template matching

Signature• To provide a fast test to eliminate the majority of items

that are not related to a query

• A linear scan of the compressed version of items

• Coding based upon words in the item

• Words are mapped onto a word signature– A fixed length code with a fixed number of bits set to 1– Set to 1 determined by the hash function– ORed to create the signature of an item– Fig 4.13

• Words in the query are mapped to the word signature

• Search via template matching

Page 11: Data Structure. Two segments of data structure –Storage –Retrieval.

Signature

• Longer code length reduces probability of collision hashing the same words

• Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

Signature

• Longer code length reduces probability of collision hashing the same words

• Fewer bits per code reduce the effect of a code word pattern present in the final signature block while the word is actually not in the item

Page 12: Data Structure. Two segments of data structure –Storage –Retrieval.

Hypertext (HTML and XML)• Allow one item to reference another item via an

embedded pointer

• A node (separate item)

• Link (reference pointer)– Similar or different data type than the original

• Navigates – Managing the loosely structured information

• Issue– Linkage integrity (no update of the removed or deleted

items)

Hypertext (HTML and XML)• Allow one item to reference another item via an

embedded pointer

• A node (separate item)

• Link (reference pointer)– Similar or different data type than the original

• Navigates – Managing the loosely structured information

• Issue– Linkage integrity (no update of the removed or deleted

items)

Page 13: Data Structure. Two segments of data structure –Storage –Retrieval.

Hypertext (HTML and XML)

• Dynamic HTML– Combination of the latest HTML tags and options, style

sheets and programming

– Creation of animated Web pages and responsive to user interaction

• Dynamic HTML Object Model– Object-oriented view of Web pages and its elements

– Cascading style sheets

– Programming addressing the page elements with dynamic fonts

Hypertext (HTML and XML)

• Dynamic HTML– Combination of the latest HTML tags and options, style

sheets and programming

– Creation of animated Web pages and responsive to user interaction

• Dynamic HTML Object Model– Object-oriented view of Web pages and its elements

– Cascading style sheets

– Programming addressing the page elements with dynamic fonts

Page 14: Data Structure. Two segments of data structure –Storage –Retrieval.

DOCUMENTS DICTIONARY INVERSION LISTS

Doc #1, computer, bit (2) bit - 1, 3

bit, byte

DOCUMENTS DICTIONARY INVERSION LISTS

Doc #1, computer, bit (2) bit - 1, 3

bit, byte

Page 15: Data Structure. Two segments of data structure –Storage –Retrieval.

• Inversion list– Weights

– Words with special characteristics e.g. date

• Searching– Locate the inversion lists

– Apply appropriate logic between lists

– Final hit of the list of items is the result

• Inversion list– Weights

– Words with special characteristics e.g. date

• Searching– Locate the inversion lists

– Apply appropriate logic between lists

– Final hit of the list of items is the result

Page 16: Data Structure. Two segments of data structure –Storage –Retrieval.

– B trees• e.g. of order m

• A root node with between 2 and 2m keys

• All other internal nodes have between m and 2m keys

• All keys are kept in order from smaller to larger

• All leaves are at the same level or differ by at most one level

– B trees• e.g. of order m

• A root node with between 2 and 2m keys

• All other internal nodes have between m and 2m keys

• All keys are kept in order from smaller to larger

• All leaves are at the same level or differ by at most one level

Page 17: Data Structure. Two segments of data structure –Storage –Retrieval.

– Inversion list structures• Provide optimum performance in searching

large databases• Minimization of data flow• Involve only directly related data• Good for storing concepts and their

relationship• Each list for representing a concept• A concordance of all of the items containing the

concepts• Location of the concepts• Do not solely work for natural language

processing

– Inversion list structures• Provide optimum performance in searching

large databases• Minimization of data flow• Involve only directly related data• Good for storing concepts and their

relationship• Each list for representing a concept• A concordance of all of the items containing the

concepts• Location of the concepts• Do not solely work for natural language

processing

Page 18: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm• Goal: to improve performance and require less

system resources by reducing number of unique words that a system has to contain

• Currently reviewed for potential improvements of recall and associated decline in precision

• Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants

• Creates a large index for the stem vs. term masking (ORing)

Stemming algorithm• Goal: to improve performance and require less

system resources by reducing number of unique words that a system has to contain

• Currently reviewed for potential improvements of recall and associated decline in precision

• Trade-off: increased overhead for processing token vs. reduced search time overhead for processing query terms with trailing ‘don’t cares’ for the inclusion of all the variants

• Creates a large index for the stem vs. term masking (ORing)

Page 19: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm• Conflation: refer to mapping multiple morphological

variants to a single representation (stem)

• Stem: carries the meaning of the concept associated with the word

• Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes

• Languages: grammars defining usage and evolve on human usage

• Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

Stemming algorithm• Conflation: refer to mapping multiple morphological

variants to a single representation (stem)

• Stem: carries the meaning of the concept associated with the word

• Affixes (endings) introduce subtle modifications to the concept or are used for syntactical purposes

• Languages: grammars defining usage and evolve on human usage

• Existence of exceptions and non-consistent variants – thus requires exception look-up tables beside normal reduction rules

Page 20: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm

• Compression – savings in storage and processing?

• Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list

• Size of inversion list

• Compression does not significantly reduce storage requirements – small vs. large-sized collection

Stemming algorithm

• Compression – savings in storage and processing?

• Savings – dictionary, requires weighted positional information in stem inversion list and un-stemmed inversion list

• Size of inversion list

• Compression does not significantly reduce storage requirements – small vs. large-sized collection

Page 21: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm

• Improve recall?

• As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming

• Improve precision?

• Only if the expansion guarantees every item retrieved by the expansion is relevant

Stemming algorithm

• Improve recall?

• As long as a semantically consistent stem can be identified for a set of words – generalization process of stemming

• Improve precision?

• Only if the expansion guarantees every item retrieved by the expansion is relevant

Page 22: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm

• System must recognize the word before stemming

• Proper names and acronyms – no stemming applied since no common core concept

• Problems for natural language processing system – loss of information needed for aggregate levels of processing

• e.g. tenses needed to determine a particular concept

• Time – important in natural language processing

Stemming algorithm

• System must recognize the word before stemming

• Proper names and acronyms – no stemming applied since no common core concept

• Problems for natural language processing system – loss of information needed for aggregate levels of processing

• e.g. tenses needed to determine a particular concept

• Time – important in natural language processing

Page 23: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm

• Removal of suffixes and prefixes

• Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network)

• Successor stemming – determine prefix overlap as the length of a stem is increased

• e.g. tenses needed to determine a particular concept

• Time – important in natural language processing

Stemming algorithm

• Removal of suffixes and prefixes

• Table look-up – requires a large data structure (e.g. RetrievalWare due to large thesaurus/concept network)

• Successor stemming – determine prefix overlap as the length of a stem is increased

• e.g. tenses needed to determine a particular concept

• Time – important in natural language processing

Page 24: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming algorithm

• Porter stemming algorithm

• Dictionary look-up stemmers

• Successor stemmers

Stemming algorithm

• Porter stemming algorithm

• Dictionary look-up stemmers

• Successor stemmers

Page 25: Data Structure. Two segments of data structure –Storage –Retrieval.

Porter Stemming algorithm

• Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition

• Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mV

• C and V – optional and m is the number VC repeats

• *<X>, *v*, *d, *o

Porter Stemming algorithm

• Based upon a set of conditions of the stem, suffix and prefix and associated actions given the condition

• Measure (m) of a stem is a function of sequences of vowels followed by a consonant. If v is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mV

• C and V – optional and m is the number VC repeats

• *<X>, *v*, *d, *o

Page 26: Data Structure. Two segments of data structure –Storage –Retrieval.

Dictionary Look-Up Stemmers

• Simple stemming rules – fewest exceptions (plural)

• Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it

• e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root

• Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

Dictionary Look-Up Stemmers

• Simple stemming rules – fewest exceptions (plural)

• Original term or stemmed version – looked-up in dictionary and replaced by the stem that best represents it

• e.g. Kstem – a morphological analyzer conflating word variants to a root form and avoid collapsing words with different meanings into the same root

• Six major data files: dictionary of words, supplemental list of words, exception list for words that should retain an “e” at the end, direct conflation, country nationality

Page 27: Data Structure. Two segments of data structure –Storage –Retrieval.

Successor Stemmers

• Based upon length of the prefixes that optimally stem expansions of additional suffixes

• Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes

• e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

Successor Stemmers

• Based upon length of the prefixes that optimally stem expansions of additional suffixes

• Based upon the analogy in structural linguistics that investigated word and morpheme boundaries based upon the distribution of phonemes

• e.g. bag, barn, bring, both, box, bottle (Fig. 4.2)

Page 28: Data Structure. Two segments of data structure –Storage –Retrieval.

Successor Stemmers

• Methods: cut-off, peak and plateau, complete word method, and entropy method

• Cut-off method: cut-off value to define stem length, value varies for each possible set of words

• Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off)

• Complete word method: break on boundaries of complete words (not needing cut-off)

• Entropy method: uses the distribution of successor variety letters

• Figure 4.3

Successor Stemmers

• Methods: cut-off, peak and plateau, complete word method, and entropy method

• Cut-off method: cut-off value to define stem length, value varies for each possible set of words

• Peak and plateau: a segment break made after a character whose successor variety exceeds that of the character immediately preceding it and the character immediately following it (not needing cut-off)

• Complete word method: break on boundaries of complete words (not needing cut-off)

• Entropy method: uses the distribution of successor variety letters

• Figure 4.3

Page 29: Data Structure. Two segments of data structure –Storage –Retrieval.

Stemming Algorithm

• Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming

• Stemming is dependent upon the nature of the vocabulary

• Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4

• Measure the ability to partition terms semantically and morphologically related to each other into “concept groups”

• Understemming index – concept groups with multiple stem

• Overstemming index – same stem is found in multiple groups

Stemming Algorithm

• Stemming affects recall (positive) in one study, not proven in many studies, but reduce precision – minimized via ranking items, categorization of terms and selective exclusion of some terms from stemming

• Stemming is dependent upon the nature of the vocabulary

• Performance measure: Error rate relative to truncation (distance from the origin to the coordinate of the stemmer being evaluated vs. the distance from the origin to the worst case intersection of the line generated by pure truncation), Fig. 4.4

• Measure the ability to partition terms semantically and morphologically related to each other into “concept groups”

• Understemming index – concept groups with multiple stem

• Overstemming index – same stem is found in multiple groups