Text Technologies - University of Edinburgh · – “tagging” works for multi-media ... Hours spent on TTS coursework in past years: ... 01.intro.ppt Author:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Text Technologies Victor Lavrenko vlavrenk@inf
Monday 16th September 2013
What you will learn
• How to build a search engine – which search results to rank at the top – how to do it fast and on a massive scale
• How to evaluate a search algorithm – is system A really better than system B
• How to work with text – two web-pages discuss the same topic? – handle misspellings, morphology, synonyms – build algorithms for languages you don’t know
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)
• IR – core technology for text processing • widely used in NLP/DBMS applications • driving force behind web technologies
• Effectiveness – need to find relevant documents – needle in a haystack: – very different from relational DBs (SQL)
• Efficiency – need to find them quickly: – vast quantities of data (100b pages) – thousands queries per second – data constantly changes, need to keep up – compared with other NLP areas IR is very fast
• “documents” has a very wide meaning: – web-pages, emails, word/pdf/excel, news – photos, videos, musical pieces, code – answers to questions – product descriptions, advertisements – may be in a different language – may not have words at all (e.g. DNA)
• IR: match A against a large set of Bs – problem arises in many different domains
• get the data into the system – acquire the data from crawling, feeds, etc. – store the originals (if needed) – transform to BOW and “index”
• satisfy users’ requests – assist user in formulating query – retrieve a set of results – help user browse / re-formulate – log user’s actions, adjust retrieval model
Market Event Detection Software giant Microsoft saw its shares dip a few percentage points this morning after U.S. District Judge Thomas Penfield Jackson issued his "findings of fact" in the government's ongoing antitrust case against the Seattle wealth-creation machine...
News:
P ( shares | MSFT↓ ) = 0.071 P ( antitrust | MSFT↓ ) = 0.044 P ( judge | MSFT↓ ) = 0.039 P ( trading | MSFT↓ ) = 0.029 P ( against | MSFT↓ ) = 0.027 P ( Jackson | MSFT↓ ) = 0.025
P ( shares ) = 0.074 P ( antitrust ) = 0.009 P ( judge ) = 0.006 P ( trading ) = 0.032 P ( against ) = 0.025 P ( Jackson ) = 0.001
P ( MSFT↓ | Jackson ) = P ( Jackson | MSFT↓ ) P ( MSFT↓ ) / P ( Jackson )
Words like Jackson and antitrust are more likely in the stories preceding the plunge.
Course Information • Lectures: Mon/Thu 12-1pm in 7 George Sq, F.21 • Coursework due 4pm Mondays in weeks 4,6,8,10 • Practical labs: TBD in weeks 2,4,6,8
– sign up: TBD – TA: Dominik Wurzer
• Textbook: “Search Engines: Information Retrieval in Practice” – very useful but not required (attend all lectures)
• Submitted work must be your own – ok to discuss assignments with each other – not ok to share code / data / figures – suggestion: talk, don’t write anything down
• Always cite your sources – including web, fellow students, other courses – exception: lectures from this course – when in doubt: cite