1 Alexander Gelbukh Moscow, Russia
Mar 27, 2015
1
Alexander GelbukhMoscow, Russia
2
Mexico
3
Computing Research Center (CIC), Computing Research Center (CIC), MexicoMexico
4
Chung-Ang University, KoreaChung-Ang University, KoreaElectronic Commerce andElectronic Commerce andInternet Application LabInternet Application Lab
5
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art ofThe Art ofInformation RetrievalInformation Retrieval
Alexander Gelbukh
www.Gelbukh.com
6
Information RetrievalInformation Retrieval
In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it
The Web User information need Ranking
7
8
9
Information RetrievalInformation Retrieval
In a huge amount of poorly structured information find the information that you need when you don’t know exactly what you need or can’t explain it
The Web User information need Ranking
10
ImportanceImportance
Knowledge: the main treasure of man Web: Repository? Cemetery of information! Natural language and multimedia information
o Poorly structured, badly written
Corporate and organizational document baseso Senate speeches: Mexicoo Medical data collectionso Corporate memory. Microsoft knowledge base
Future: data explosion increasing importance
11
PerspectivesPerspectives
Corporations: corporate databases Organizations: document bases Government
o European Union multilingual problemo The same in Asia
Academyo Lots of open research topicso Web topicso Computational Linguistics topicso Intelligent technologies, AI
12
TextbookTextbook
http://sunsite.dcc.uchile.cl/irbook/
13
ContentsContents
1. Introduction 2. Modeling 3. Retrieval Evaluation 4. Query Languages5. Query Operations 6. Text and Multimedia Languages and Properties 7. Text Operations8. Indexing and Searching9. Parallel and Distributed IR10.User Interfaces and Visualization11.Multimedia IR: Models and Languages12.Multimedia IR: Indexing and Searching13.Searching the Web 14.Libraries and Bibliographical Systems15.Digital Libraries
14
CalendarCalendar
1. September 18 Chapter 1 Introduction
2. September 25 Chapter 2 Modeling
3. October 2 Chapter 3 Retrieval Evaluation
4. October 9 Chapter 4 Query Languages
5. October 16 Chapter 5 Query Operations
October 23 – midterm exam
6. October 30 Chapter 6 Text and Multimedia Languages...
7. November 6 Chapter 7 Text Operations
8. November 13 Chapter 8 Indexing and Searching
9. November 20 Chapter 10 User Interfaces and Visualization
10. November 27 Chapter 13 Searching the Web
11. December 4 Chapter 14 Libraries and Bibliographical Systems
12. December 11 Chapter 15 Digital Libraries
December – final exam
15
Class structureClass structure
Main course: Information Retrieval Discussion of previous chapter. Questions I briefly present a new chapter
Research seminar: Natural Language Processing Discussion of previous paper. Questions.
o Identification of possible research topics
Presentation of a new paper or current work Discussion and questions Goal: publications!
16
Natural Language Processing Natural Language Processing Research SeminarResearch Seminar
17
What CL is aboutWhat CL is about
Computers to process natural language text “Understand” Generate Search Organize Translate …
Useful in IR
18
MethodsMethods
No: text as a stream of letterso Brute force statisticso Simplified heuristics (ex.: Porter)
Yes: attention to language ruleso Linguistically motivated approacheso Knowledge-based approacheso Corpus-based approaches
19
What IR is aboutWhat IR is about
Classical IR: find words? Concepts! Question answering Summarization Clustering …
Take language seriously
20
Text representations for IRText representations for IR
Represent the retrieval featureso Strings → stems (lexemes), synsets, phrases.o Women → woman, lady, femaleo Old men and women → old woman
Structured representation of texto Network of related events and entitieso Enables logical inference
21
CL tasks useful in IRCL tasks useful in IR
Morphology (stemming) POS / Word dense disambiguation Word relatedness Anaphora resolution Parsing and semantics (phrase search) Synonymic rephrasing Translation etc…
Each one a whole science in itself
22
MorphologyMorphology
Q: pig T: piggish Simple: stemming
o piggish → pig- Lexeme: set of word forms
o same stem can give different wordso pigment → not pig; piny → pine, not pin
Dictionary/corpus-based methodso Learning; dictionary management
23
Part of Speech DisambiguationPart of Speech Disambiguation
Q: oil well T: He did very well Q: what is an are? T: They are nice Important for English, Chinese. Less
important for other types Perhaps not so helpful directly, but is
necessary for most other tasks Usually statistical / heuristic methods
24
Word Sense DisambiguationWord Sense Disambiguation
Q: bank account T: on the beautiful banks of Han river ...
bill: document, banknote, law, ax, peak, Gates...
Very frequent, almost any word in text Statistical & dictionary methods International competitions
25
Word relatednessWord relatedness
Q: female T: woman (women)o Synonyms. Subtypes/super-typeso Dictionaries. WordNet. Similarity. Lesk.
Q: Korea T: Seoulo Other linguistic relationships (e.g., part)o Real-world relationships (facts)
Q: Clinton T: Lewinskyo Statistical co-occurrence (MI)
26
Anaphora resolutionAnaphora resolution
Q: Awards of Prof. Han T: Prof. Han said... He did... IBM awarded him...o Frequencyo Phrases, co-occurrence, summarization, infer
ence, translation Heuristic (Mitkov) and knowledge-base
d methods Other types of co-reference
27
Parsing, semanticsParsing, semantics
Q: Awards of Prof. Han T1: Prof. Han among many other prizes has several IBM awards T2: Mr. Kang has an award Prof. Han does not know of
Understanding of texto Rich structured representation
Better phrase search; question answering, summarization, ...
28
Synonymic rephrasing, reasoningSynonymic rephrasing, reasoning
Q: experienced computer scientists T: Prof. Han has been programming for many years and awarded an IBM award
Requires good syntactic and semantic analysis
Knowledge-based methods
29
Multilingual accessMultilingual access
Q: 요구르트 T: We sell excellent yoghurt. Продаем йогурт. Se vende rico yogur.o Search multilingual collections
Europe: dozens of official languages of EU
o If you don’t know how to say it in English
Dictionaries, bilingual corpora, ...
30
Tasks are entangledTasks are entangled
Many of CL tasks require other taskso Morphology → syntax → semantics
Many CL tasks form circleso parsing ← WSD ← parsingo I see a wild cat with a telescope (tripod?)
Can be done quick-and-dirty (?)o Fighting for last %so Zipf law: 20% of men drink 80% of beer
31
Tools and infrastructureTools and infrastructure
Analysis toolso Tasks, methods
Dictionaries and grammarso Types, structureo Automatic acquisition
Corporao Corpora analysis tools and methods
32
Possible tasksPossible tasks
WSD to help IR Clustering + summarization in IR results Anaphora and coreference resolution to help
IR Multilingual IR Applications to Korean ... a lot of others
33
ReadingReading
Textbookso Manning & Schütze, Allen, Jurafsky, Hausser, ...
CICLing proceedings Computational Linguistics Google, ResearchIndex
34
QuestionsQuestions
Who expects to publish? Who will make a presentation at the next
seminar?
35
Thank you!
Till September 18