•1 1 TDT4215 - Introduction TDT4215 - Introduction TDT4215 Web-intelligence Main topics: • Information Retrieval • Large textual document collections • Text mining • NLP for document analysis • Ontologies for document management • Examples from Clinical Decision Support How to extract knowledge from large document collections? 2 TDT4215 - Introduction Lectures and Exercises Lectures • Øystein Nytrø • Guests: - Laura Slaughter from Oslo University Hospital - A leading guru on clinical ontologies and decision support (TBA) • Mondays 10.15-13.00 in F3 (that’s right, three hours!) Exercises • PhD student Nafiseh Shabib • Tuesdays 16.15-18.00 in F4 All relevant information will be published at http://www.idi.ntnu.no/emner/tdt4215/
16
Embed
2 Lectures and Exercises - · PDF filegamezer 10. facebook to Zeitgeist 2010 •11 21 TDT4215 - Introduction Text Mining Part I • Text mining = Linguistic analysis? • Task: Analyze
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
• 1
1
TDT4215 - Introduction TDT4215 - Introduction
TDT4215 Web-intelligence Main topics: • Information Retrieval • Large textual document collections • Text mining • NLP for document analysis • Ontologies for document management • Examples from Clinical Decision Support How to extract knowledge from large document collections?
2
TDT4215 - Introduction
Lectures and Exercises Lectures • Øystein Nytrø • Guests:
- Laura Slaughter from Oslo University Hospital - A leading guru on clinical ontologies and decision support (TBA)
• Mondays 10.15-13.00 in F3 (that’s right, three hours!)
Exercises • PhD student Nafiseh Shabib • Tuesdays 16.15-18.00 in F4 All relevant information will be published at http://www.idi.ntnu.no/emner/tdt4215/
• 2
3
TDT4215 - Introduction
Curriculum
• Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval. Addison-Wesley, 2011. (selected chapters)
• Manning, Raghavan and Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. (selected chapters, available for download)
• Compendium from IDI (selected book chapters and papers)
• Details are published at the homepage of the course
4
TDT4215 - Introduction
Assessment
• Group project: 25% of grade – Groups of 3-5 people – Discuss a particular theoretical topic – Develop an information retrieval / text mining application – Evaluate application – To be carried out the first half of the term (25th Feb – 7th Apr) – Nafiseh Shabib is responsible for the group project
• Individual written examination: 75% of grade – 4th of June – 4 hours written examination (discussions, calculations, no programming) – Based on everything we will learn in the course
• 3
5
TDT4215 - Introduction
Course Characteristics
• Experimental science: – No clear answers or theories – Lots of formulas (that are hard to justify) – Reappearance of logics & reasoning in web context
• Relevance: – Concerns real-world problems – A basis for knowledge management applications:
Search engines, document management systems, publication systems, digital libraries, enterprise business applications, business/web intelligence systems, semantic interoperation/integration software, etc.
• Multi-disciplinary: – Combines techniques from several other sciences:
Statistics, linguistics, conceptual modeling, artificial intelligence, knowledge representation, query processing and databases, etc.
6
TDT4215 - Introduction
Projects and Exercises Important
• One mandatory project: – Practice in setting up an application – How to evaluate the quality of IR/TM applications? – How to extract knowledge from specific types of text? Which
techniques for which types of text?
• Exercises: – Examples from lectures – Understand how formulas are used in practice – Be comfortable with “unproven theories” – Representative for examination questions
• Exercises are important!
• 4
7
TDT4215 - Introduction
Lecture Plan (1)
8
TDT4215 - Introduction
Lecture Plan (2)
• 5
9
TDT4215 - Introduction
Lecture Plan (3)
10
TDT4215 - Introduction
From Documents to Knowledge
• Document collections • Knowledge and documents • Document retrieval • Text Mining • Ontologies
80% of organizational data is textual with no proper structure!
• 6
11
TDT4215 - Introduction
Information Retrieval Text Mining
Ontology
Text
Retrieve document Discover knowledge
Knowledge elicitation
Knowledge representation
Morpho-syntax
Semantics
Existing New
Overall approach
12
TDT4215 - Introduction
Document Collections
• Domain-dependent or domain-independent • Structured or non-structured text • Formatted or non-formatted documents • Textual or multimedia documents • Monolingual and multilingual document collections • Centralized or non-centralized document management • Confidential or non-confidential • Controlled or free addition of documents • Stable or non-stable collections
Information system
Document collection
User
• 7
13
TDT4215 - Introduction
Case 1: SAP at STATOIL
• SAP used for major internal business processes • Named user accounts: 29,000
Analyze linguistic or statistical content of single documents – Transform document or add information to document – Tagging, lemmatization, NP recognition, etc.
• Example: Lemmatization for document retrieval
<html> <body> The professor’s assistant reads two papers... </body> </html>
<html> <body> The professor’s <lem> professor</lem> assistant reads <lem>read</lem> two papers <lem>paper</lem>... </body> </html>
Index document
22
TDT4215 - Introduction
Text Mining Example 1
• Marmot (from UMass) – Sentences are separated and segmented into noun phrases, verb
phrases, and prepositional phrases – Recognizes dates and duration phrases – Scopes conjunctions and disjunctions
David Brown, University for Industry visits the OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM Government. David was accompanied by Jeanette Pugh, Josh Hillman and Nick Pearce.
Vargas-Vera et al.: Knowledge Extraction by using an Ontology-based Annotation tool
SUBJ (1) : DAVID BROWN %COMMA% UNIVERSITY PP (2) : FOR INDUSTRY VB (3) : VISITS OBJ1 (4) : THE OU PUNC(5) : %PERIOD%
• 12
23
TDT4215 - Introduction
Text Mining Part II
• Text mining = knowledge discovery (in text)? • Task:
Discover or derive new information from large document collections
– find patterns across datasets/documents – separate signal from noise – statistical (and linguistic) approach
• Techniques: – Concept extraction – Ontology construction – TOC construction – Clustering – Text categorization – Subtechniques:
information extraction, text analysis
D avid B rown, U niversity for Industry visits the O U
J ohn D ominque W ed, 1 5 O ct 1 997D avid B rown, the C hairman of the U niversity for Industry D esign and Implementation A dvisory G roup and C hairman of M otorola, visited the O U as part of a fact finding exercise, prior to drafting his initial 1 00 D ays Report to H M G overnment. D avid was accompanied by J eanettePugh, J osh H illman and N ick Pearce.
D avid B rown, U niversity for Industry visits the O U
J ohn D ominque W ed, 1 5 O ct 1 997D avid B rown, the C hairman of the U niversity for Industry D esign and Implementation A dvisory G roup and C hairman of M otorola, visited the O U as part of a fact finding exercise, prior to drafting his initial 1 00 D ays Report to H M G overnment. D avid was accompanied by J eanettePugh, J osh H illman and N ick Pearce.
D avid B rown, U niversity for Industry visits the O U
J ohn D ominque W ed, 1 5 O ct 1 997D avid B rown, the C hairman of the U niversity for Industry D esign and Implementation A dvisory G roup and C hairman of M otorola, visited the O U as part of a fact finding exercise, prior to drafting his initial 1 00 D ays Report to H M G overnment. D avid was accompanied by J eanettePugh, J osh H illman and N ick Pearce.
D avid B rown, U niversity for Industry visits the O U
J ohn D ominque W ed, 1 5 O ct 1 997D avid B rown, the C hairman of the U niversity for Industry D esign and Implementation A dvisory G roup and C hairman of M otorola, visited the O U as part of a fact finding exercise, prior to drafting his initial 1 00 D ays Report to H M G overnment. D avid was accompanied by J eanettePugh, J osh H illman and N ick Pearce.
D avid B rown, U niversity for Industry visits the O U
J ohn D ominque W ed, 1 5 O ct 1 997D avid B rown, the C hairman of the U niversity for Industry D esign and Implementation A dvisory G roup and C hairman of M otorola, visited the O U as part of a fact finding exercise, prior to drafting his initial 1 00 D ays Report to H M G overnment. D avid was accompanied by J eanettePugh, J osh H illman and N ick Pearce.
Knowledge
24
TDT4215 - Introduction
Text Mining Example 2 • Document collection from X • What is the content?
• Prominent terms:
• Terms used together in text – Journalforskriften: