ISCSLP’02 L. F. Chien Information Retrieval Techniques for Spoken Language Processing Lee-Feng Chien ( ) Institute of Information Science Institute of Information Science Academia Academia Sinica Sinica , Taiwan , Taiwan International Symposium on Chinese Spoken Language Processing (ISCSLP 2002) Taipei, Taiwan August 23Ć24, 2002 ISCA Archive http://www.iscaĆspeech.org/archive
119
Embed
Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ISCSLP’02 L. F. Chien
Information Retrieval Techniques for Spoken Language Processing
Lee-Feng Chien (簡立峰)
Institute of Information Science Institute of Information Science Academia Academia SinicaSinica, Taiwan, Taiwan
IR vs. SLPConventional IR TechniquesWeb IR Techniques Web Mining Techniques Term Clustering through Web MiningAnchor Text Mining
ISCSLP’02 L. F. Chien
I. IR vs. SLP
ISCSLP’02 L. F. Chien
Information Retrieval
a research with a long-term research goal of exploration of information storage, classification, extraction, indexing and browsing techniques for the retrieval of non-structural databases such as textual documents
ISCSLP’02 L. F. Chien
Different Research Aspects
Text IRWeb IRMultimedia IRIntelligent IR
ISCSLP’02 L. F. Chien
Different Research Aspects
Text IRText indexing, searching, presentation, user study
Web IRCrawling, page ranking, distributed search, scalability, multi-lingual and multi-culture
Multimedia IRRetrieving multimedia contents such as speech, audio, music, image, video
Intelligent IRAdvanced language and information processing topics such as question answering, cross-language, information tracking, information extraction, summarization, speech interaction, etc.
ISCSLP’02 L. F. Chien
Chinese Information Retrieval
Language issuesCulture/geographical issuesChinese people issues
ISCSLP’02 L. F. Chien
Chinese Information Retrieval
Language issuesWord segmentation, term extraction, parsing, linguistic resources Font display, conversion between simplified Chinese and traditional Chinese, etc.
Culture/geographical issuesChinese pages from world wide, language identification required, preferred topics different, etc.
Chinese people issuesUser behaviors
ISCSLP’02 L. F. Chien
Web Users and Pages (3 years ago)
Area Users Web Pages Time World-wide 150M 800M 7/99 China 4M 2.5~3M 7/99 Taiwan 4M 3M 7/99
ASR or dictation machine: lexicon, corpus, and language model Voice portal: search via spoken queries Speech retrieval: indexing & searching Topic detection & tracking : document classification & clustering
ISCSLP’02 L. F. Chien
Natural Language Processing
Speech Recognition
Information Retrieval Search
Engine
DictationMachine
Parser
IR Via Voice
IR vs. SLP
Q&A
Speech IR
ISCSLP’02 L. F. Chien
Search Engine
DictationMachine
Parser
IR Via Voice
Research Paradigm
Web Mining
ISCSLP’02 L. F. Chien
Anchor Text Mining for Query Translation (Lu, ICDM’01)
Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.
ISCSLP’02 L. F. Chien
II. Conventional IR Techniques
ISCSLP’02 L. F. Chien
The Vector Space Model
Measure closeness between query and document.Queries and documents represented as n dimensional vectors.Each dimension corresponds to a word/term. Advantages: Conceptual simplicity and use of spatial proximity for semantic proximity.
ISCSLP’02 L. F. Chien
Vector Similarity
d = The man said that a space age man appeared d’= Those men appeared to say their age
ISCSLP’02 L. F. Chien
Vector Similarity (Cont.)
cosine measure or normalized correlation coefficient
Euclidean Distance:
ISCSLP’02 L. F. Chien
Term Weighting
Quantities used:tfi,j (Term frequency) : # of occurrences of wiin di
dfi (Document frequency) : # of documents that wi occurs incfi (Collection frequency) : total # of occurrences of wi in the collection
ISCSLP’02 L. F. Chien
Inverted File for Keyword Matching
Google’s Index File Structure
ISCSLP’02 L. F. Chien
Chinese IR & Indexing Unit Section
Character-based indexing and searchspeed/space problemincorrect matching due to free combination of characters, EX: 電腦科學
Word-based indexing and searchlexicon is a prerequisite and limitationunknown word identification, disambiguation of word segmentation
Csmart’s approach (Chien, SIGIR’95)signature-based, feature grouping (unigram, bigramcharacters)two-stage search, fuzzy search; suited for not too large files and demand of fuzzy search
ISCSLP’02 L. F. Chien
Chinese Track in TREC’5
Berkeley (A. Chen, SIGIR’97)Indexing units
• use dictionary to segment texts• obtained from public domain with 91,000 words and phrases• stopword list with 444 entries
Searching algorithm, segmentation method• 0.461 average precision for manual queries, 0.32 for automatic
run
CUNY (Kwok, SIGIR’97)Indexing units
• Use 2,000 words to segment texts initially• use a learning strategy to extend the word entries to 15,000
finallySearching algorithm, segmentation method
• 0.40 in word-based; 0.42 in word and character-based
ISCSLP’02 L. F. Chien
Text Indexing
Indexing Chinese texts Indexing Units
• Single character, Bi-character, word, term, stringNew word identification and word segmentation problems
Indexing English texts Indexing units
• Word, term, string (few)Problems
• Stemming, capitalization, hyphen, word sense disambiguation, typing errors
Structure for indexingInverted file, PAT array
Term extraction and term clustering techniques are the required key techniques
ISCSLP’02 L. F. Chien
Term Extraction
Term is a meaningful and representative unit in terms of information retrieval, e.g., name, location, proper noun, topicTerms are derived and most excluded in common dictionariesTerm extraction can reduce word sense ambiguities in text retrieval and remedy weaknesses of word-based approaches, EX:
computer network (linking, net, mesh)Bank America, current theory
Term extraction is the first step toward concept-based IR
ISCSLP’02 L. F. Chien
Chinese Term Extraction
• PAT-tree-based Approach • Poster Presentation Award by ACM SIGIR’98
ISCSLP’02 L. F. Chien
Context Dependency
Association
Left Context Dependency (LCD) Right Context Dependency (RCD)
Table 4. The detailed results for incremental term extraction when the threshold value was larger than 2 in the significance analysis;
the results were obtained from a total of 1,872 political news abstracts published in July 1997.
ISCSLP’02 L. F. Chien
Incremental Term Extraction (Cont.)
S(Y) Total
Extracted
Terms(A)
No. of
Correct Terms
Extracted(B)
No. of
Correct Terms
Outside
Dictionary(C)
Precision
(B/A)
Recall
>1.5 2,291 1,374 297 0.60 0.53
>2 1,455 1,135 258 0.78 0.44
>2.5 723 593 172 0.82 0.23
>3 214 184 66 0.86 0.07
Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results
were obtained from a total of 1,872 political news abstracts published in July, 1997.
ISCSLP’02 L. F. Chien
Incremental Term Extraction (Cont.)
S(Y) Total
Extracted
Terms(A)
No. of
Correct Terms
Extracted(B)
No. of
Correct Terms
Outside
Dictionary(C)
Precision
(B/A)
Recall
>1.5 2,291 1,374 297 0.60 0.53
>2 1,455 1,135 258 0.78 0.44
>2.5 723 593 172 0.82 0.23
>3 214 184 66 0.86 0.07
Table 3. The testing results for incremental term extraction using different threshold values in the significance analysis; the results
were obtained from a total of 1,872 political news abstracts published in July, 1997.
How to deal with low-frequency terms ?
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (a set of large computer)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:•一台大電腦 (National Taiwan University)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
Word segmentation:. 參加一台大電腦會議 (Attend a computer science meeting in National Taiwan University)
ISCSLP’02 L. F. Chien
Characteristics of the Chinese
Language
Semantics
Form
Phonetics
New word identification:•台大電腦公司 (Taida Computer Inc.)
ISCSLP’02 L. F. Chien
III. Web IR Techniques
ISCSLP’02 L. F. Chien
Spectrums of Web Search
Types of contentText, e.g. Web text、documents、newsAudio, e.g. music、speech、sounds、broadcast newsImage, e.g. pictures、photos、graphicsVideo, e.g. films、clipsFormatted Data, e.g. products
Scopes of contentGeneral or specific
Languages
ScalabilityPersonal、content site、intranet 、InternetThousands, millions or billions of (documents、users、queries)
InterfaceWeb-based、WAP-based、Voice-based
ISCSLP’02 L. F. Chien
IndexSpace
User Space
DocumentSpace
Information UseInformation Need
Seek
Use
Users Authors
Short QuerySubject TermsReal Names
X YX1,X2... Y1,Y2...
An Analytical Model
ISCSLP’02 L. F. Chien
Different Web Search ModelsYahoo
manual recommendation in index space
Altavista、Inktomifull-text pattern matching in document space
Googlecitation information in document space
Realnamemanual real-name retrieval in user space
DirectHitcollaborative analysis in user space
AskJeevesQ&A (or FAQ search) in specific domains
ISCSLP’02 L. F. Chien
Hypertext on the Web
Internal Affairs
People
IIS
CS&IE, NTU
Institute of Information Science
http://www.iis.sinica.edu.tw
IISInstitute of
Information Science SE
Academia Sinica
Research Institutions
Hyperlink reference Sibling information
Web usage informationQuery & Click stream
Local content
ISCSLP’02 L. F. Chien
Basic Architecture of a Spider-based Web Search Engine
SESE
SESE
SESE BrowserBrowserIndexIndexWeb
50M queries/day
Quality resultsLogLog
Spider
Spam
Freshness3B pages
Scalable, e.g., 20K PCs in Scalable, e.g., 20K PCs in GoogleGoogle
ISCSLP’02 L. F. Chien
Crawling
Indexed Page
Out LinksDuplication Authority
Out link Traverse
Authorized Pages
ISCSLP’02 L. F. Chien
Indexed Features & Page Ranking
Page Title: Academia Sinica
Indexed Page
Anchor Text:Highest Government Research Institution in Taiwan
abstractPopularity
Anchor Text:Chien’s Lab
Authority
ISCSLP’02 L. F. Chien
Distributed Search
Query
Query Processor
SE
SE
SE
SEDocument Delivery
ISCSLP’02 L. F. Chien
Facts and Problems I
Query short query problem50% are personal and company namesBoolean or natural language query is few
Browsingno more 2nd pageprecision is more important than recall
Robotlow coverage、deadlinks、garbage sites and pages
ISCSLP’02 L. F. Chien
Facts and Problems II- Relevancy
Who judge the retrieval relevancyUsers
• What user want? What do they input?• Short query or NLQ?• HFQ、LFQ or MFQ?
Search engines• Technology limitations?• How many indexed pages? millions or billions
of pages ? • Ranking algorithm?
ISCSLP’02 L. F. Chien
Quality vs. Quantity
Source: The Direct Hit Technology: A White Paper (http://system.directhit.com/whitepaper.html)
ISCSLP’02 L. F. Chien
Facts and Problems III - Speed
What make the retrieval speed ?Users
• Where you are and how is the bandwidth?• Dial-up or T1? Cache or proxy?
• What is the query?• HFQ、LFQ or MFQ ?
Search Engines• How far and how good the infrastructure • Document delivery speed is the key
ISCSLP’02 L. F. Chien
Important Issues
Web user studyUser behavior analysisQuery log mining
Content indexingLanguage identification Information conversionUnified indexing Term extraction Term clustering
Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages
Term VectorEach query term is assigned a term vector
• Record the co-occurred feature terms and their frequency values in the retrieved documents.
Term Similaritytf*idf-based Cosine measurement
Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.
ISCSLP’02 L. F. Chien
Feature Extraction
Use co-occurred seed terms extracted from retrieved top pages
Creative Nude Photography Network -- Fine Art Nude and ...... The Creative Nude and Erotic Photography Network is the number one net portal to the best in fine art nude and erotic photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the forest, cruising the lake in open boats, swimming, picnicking and nude photography are all enjoyed in the nude. 60 minutes $39.95. ...
A Brave Nude World... A Brave Nude World! Warning: This site contains links to fine art nude & erotic photography. If you are under 18 or do not wish to view this material, You can ...
nudeCo-occurred
feature terms
3/2erotic photography
1/1naked
………3/2art
2/2photography
tf/dfterm
ISCSLP’02 L. F. Chien
Term Weighting
ISCSLP’02 L. F. Chien
Extraction of Basic Feature TermsPerformance of different features: randomly selected, hi-frequency, and seed terms
Popular queries not affected by ephemeral trends, e.g., “movie”, “basketball”, “mutual fund”, etc.More expressive and distinguishable in describing a particular categoryTwo logs compared and extracted 9,709 overlapping top query terms as feature terms
Feature ExtractionUse co-occurred seed terms extracted from retrieved top pages
Term VectorEach query term is assigned a term vector
• Record the co-occurred feature terms and their frequency values in the retrieved documents.
Term SimilarityTF *IDF-based Cosine measurement
Hierarchical Term ClusteringCluster popular query terms in the log into initial categoriesQuery terms with similar features are grouped into clusters.
ISCSLP’02 L. F. Chien
Term Similarity
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Merge the most similar (closest) two clusters• Complete linkage method
Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Term
Cluster
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method
0.3
0.10.5
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Merge the most similar (closest) two clusters• Complete-linkage method
ISCSLP’02 L. F. Chien
Hierarchical Term Clustering
Agglomerative hierarchical clustering (AHC)Compute the similarity between all pairs of clusters• Estimate similarity between all pairs of composed terms• Use the lowest term similarity value as the cluster
similarity value
Merge the most similar (closest) two clusters• Complete linkage method
Update the cluster vector of the new cluster Repeat steps 2 and 3 until only a single cluster remains
Cross-Language Web SearchA Web search service allows users to query in one language and search documents that are written or indexed in another language.
ISCSLP’02 L. F. Chien
- in USATaiwan -
www.yahoo.comwww.yahoo.com.tw
Yahoo Yahoo
Source Query
Observation of Anchor Text
ISCSLP’02 L. F. Chien
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Translation Candidates
Anchor-Text Set
Observation of Anchor Text
ISCSLP’02 L. F. Chien
……(#in-link= 187)
……(#in-link= 21)
- in USATaiwan - 台灣 - 搜尋引擎
www.yahoo.comwww.yahoo.com.tw
Yahoo 雅虎雅虎 Yahoo
Page Authority
Observation of Anchor Text
Co-occurrence
ISCSLP’02 L. F. Chien
Asymmetric model:
Symmetric model with link information :)(
)()|(s
tsst
TPTTPTTP ∩
=
inktotal in-lUin-linkUP
UP)|U)P(T|UP(TUTPUTP
U)P|U)P(T|UP(T
UPUTTP
UPUTTP
TTPTTPTTP
ii
Uiitisitis
Uiitis
Uiits
Uiits
ts
tsts
i
i
i
i
# of #)( where
)(])|()|([
)(
)()|(
)()|(
)()()(
=
−+≈
∪
∩=
∪∩
=↔
∑∑
∑∑
……
Probabilistic Inference Model
Page Authority
Co-occurrence
PageRank
ISCSLP’02 L. F. Chien
Type of Model Top-1 Top-10
MA 41% 81% MAL 44% 83% MS 51% 84%
MSL* 53% 85%
* Training Data -- 109,416 anchor-text sets
from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries
* Training Data -- 109,416 anchor-text sets
from 1,980,816 pages* Test Query Set--622 English terms from 1,230 most popular English queries
Effects of Different Models
Using different modelsMA: Asymmetric modelMAL: Asymmetric model with link informationMS: Symmetric modelMSL: Symmetric model with link information
ISCSLP’02 L. F. Chien
Top-n inclusion rates obtained with three different approaches.
ApproachesTop1Top2Top3Top4Top5
Anchor Text Mining
57.0%68.6%74.3%77.9%80.1%
Dictionary Lookup
30.5%30.5%30.5%30.5%30.5%
Combine Anchor Text Mining and Dictionary Lookup
74.0%82.9%86.7%88.9%90.2%
Performance
Back
ISCSLP’02 L. F. Chien
References
Overview Chien, L. F., Pu, H. T., “Important Issues on Chinese Information Retrieval;”, Computational Linguistics and Chinese Language Processing, August 1996, pp. 205-221.Lua, K. T., “Chinese Information Processing – Past, Present and Future”, IRAL’98.
Text Retrieval Chien, L. F., “Fast and Quasi-Natural Language Search for Gigabytes of Chinese Texts”, ACM SIGIR’95. Huang, X. and Robertson, “Okapi Chinese Text Retrieval Experiments at TREC-6. Nie, J., “On Chinese Text Retrieval”, SIGIR’96.Kwok, K. L., Comparing Representations in Chinese Information Retrieval, SIGIR’97.Chen, A. Chinese Text Retrieval without Using a Dictionary, SIGIR’97.
Text Classification Lam, W., “Using a Generalized Instance Set for Automatic Text Categorization”, SIGIR’98, pp. 81-89.
Natural Language ProcessingChen, K. J. et al., “Word Identification for Mandarin Chinese Sentences”, CONLING’92.Sproat, R. and Shih, C., “A Statistical Method for Finding Word Boundaries in Chinese Text”, CPCOL, 1989, pp. 240-271.Wu, Zimin and Tseng, G., “Chinese Text Segmentation for Text Retrieval: Achievements and Problems”, JASIS, 1994, pp. 532-542. Term ExtractionGao, J., et al., “Toward a Unified Approach to Statistical Language Modeling for Chinese”, ACM Trans. On Asian Language Information Processing, 2002, pp. 3-33.
ISCSLP’02 L. F. Chien
References (Cont.)
Term Extraction Chang, J. S. and Su, K. Y., “An Unsupervised iterative Method for Chinese New Lexicon Extraction”, International Journal of Computational Linguistics & Chinese Language Processing, August 1997, pp. 97-147.Chien, L. F., “ PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval”, SIGIR’97, pp. 50-58.Lai, Y. S. and Wu, C. H., “Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology”, ACM Trans. On Asian Language Information Processing, 2002, pp. 34-64.
Web IRPu, H. T., "Understanding Chinese Users' Information Behaviors through Analysis of Web Search Term Logs”, Journal of Computers (電腦學刊), December 2000, pp. 75-82.
Speech RetrievalWang, H. M., “Experiments in Syllable-based Retrieval of Broadcast News Speech in Mandarin Chinese”, Speech Communication, 2000, pp. 49-60. Li, X. and Crosft, B., Evaluating Question-Answering Techniques in Chinese, 2001.
Information Tracking and DetectionChen, H. H. and Ku, L. W., “An NLP & IR Approach to Topic Detection”, Topic Detection and Tracking: Event-based Information Organization, Chapter 12, ed. By Allan J., KluwerAcademic Publishers, 2002, pp. 243-264.
Cross-Language IRKwok, K. L., “Evaluation of an English-Chinese Cross-lingual Retrieval Experiment”, TREC’97.
ISCSLP’02 L. F. Chien
ReferencesWeb MiningKosala, R., & Blockheel, H. (2000). Web Mining Research: A Survey. SIGKDD Explorations, 2(1),1-15. PS PDFSrivastava,J. Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining:discovery and application of usage patterns from web data. SIGKDD Explorations,1, 12-23. PSJ. Sirvastava & R. Cooley, Mining web data for e-commerce: concepts & applications, PKDD’01 Conferences & Workshops KDD 2001, PKDD 2001, WebKDD 1999l, WebKDD 2000, WebKDD 2001Web Content MiningD. Mladenic et al., Text Mining: What if your data is made of words, PKDD’01M. Hearst, Untangling Text Data Mining, ACL’99. (Chang et al., 2001) (s.a.) Chapter 6 HandapparatChakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations 1(2), 1-11. PS PDFWeb Structure Mining (Chang et al., 2001) (s.a.) Chapter 7.3 Handapparat(Chakrabarti, 2000) s.a.Page, L., Brin, S., Motwani, R.,& Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. PS
ISCSLP’02 L. F. Chien
References (Cont.)
Web Usage Mining(Srivastava et al., 2000) s.a.Spiliopoulou, M. (2000). Web usage mining for site evaluation: Making a sitebetter fit its users. Special Section of the Communications of ACM on "Personalization Technologies with DataMining'', 43(8), 127-134. HandapparatACM Digital LibraryCooley, R. 2000. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. University of Minnesotal. PSBorges, J.L. (2000).A Data Mining Model to Capture User Web Navigation Patterns. Department of Computer Science, University College London, London University. PS PDFFor more references can refer at http://www.wiwi.hu-berlin.de/~berendt/lehre/2001w/wmi/literature.html
ISCSLP’02 L. F. Chien
References (Cont.)
Text and Web page categorizationS. Chakrabarti, B. Dorm, and P. Indyk. Enhanced hypertext categorization using hyperlinks. SIGMOD’98, pp. 307-318, 1998.J. M. Pierre, Practical issues for automated categorization of Web sites, ECDL 2000 Workshop on the Semantic Web, 2000.C.Y. Quek. Classification of World Wide Web Documents. Senior Honors Thesis, School of Computer Science, CMU, May 1997.Y. Yang and X. Liu. A re-examination of text categorization methods, SIGIR’99, pp. 42-49, 1999.
Web page classification applicationsC. Chekuri, M.H. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. WWW’97.M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. AAAI’98, pp. 509-516, 1998.M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori, Focused crawling using context graphs, VLDB2000, pp. 527-534, 2000.
Link and context analysisG. Attardi, A. Gulli, and F. Sebastiani. Automatic web page categorization by link and context analysis. Proceedings of THAI’99, European Symposium on Telematics, Hypermedia and Artificial Intelligence, pp. 105-119, 1999.S. Brin and L. Page. The anatomy of large-scale hypertextual web search engine, WWW’98.J. Dean and M. R. Henzinger. Finding related pages in the world wide web. WWW’99, pp. 389-401, 1999.J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of the 9th annual ACM SIAM Symposium on Discrete Algorithms, pp. 668-677, 1998.
ISCSLP’02 L. F. Chien
References (Works in Academia Sinica)
1. S. L. Chuang, L. F. Chien, “Automatic Subject Categorization of Query Terms for Web Information Retrieval”, accepted by Decision Support System, 2002.2. Lee-Feng Chien, et al., “Incremental Extraction of Domain-Specific Terms from Online Text Collections”, Recent Advances in Computational Terminology, ed. By D. Bourigault et al., 2001.3. Lee-Feng Chien, “PAT-Tree-Based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval” , special issue on “Information Retrieval with Asian Languages”, Information Processing and Management ,Elsevier Press, 1999.4. W. H. Lu, L. F. Chien, H. J. Lee, “ Mining Anchor Texts for Translation of Web Queries”, accepted by ACM Trans on Asian Language Information Processing, 2002.5. W. H. Lu, L. F. Chien, S. J. Lee, “Web Anchor Text Mining for Translation of Web Queries”, IEEE Conference on Data Mining, Nov., San Jose, 2001. 6. C. K. Huang, L. F. Chien, Y. J. Oyang, “Interactive Web Multimedia Search Using Query-Session-Based Query Expansion”, The 2001 Pacific Conference on Multimedia (PCM2001), Oct., Beijing. 7. C. K. Huang, Y. J. Oyang, L. F. Chien, “A Contextual Term Suggestion Mechanism for Interactive Search”, The First Web Intelligence Conference(WI’2001), Japan.8. Lee-Feng Chien. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, The 1997 ACM SIGIR Conference, Philadelphia, USA, 50-58 (SIGIR’97).