Top Banner
Language Technology I Language Technology I © 2004 © 2004 Hans Uszkoreit Hans Uszkoreit Language Technology Language Technology 2005/06 2005/06 Hans Uszkoreit Hans Uszkoreit Universität des Saarlandes Universität des Saarlandes and and German Research Center for Artificial Intelligence (DFKI) German Research Center for Artificial Intelligence (DFKI)
37

Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Jan 11, 2016

Download

Documents

Malcolm Hensley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language TechnologyLanguage Technology

2005/062005/06

Hans UszkoreitHans Uszkoreit

Universität des SaarlandesUniversität des Saarlandes

andand

German Research Center for Artificial Intelligence (DFKI)German Research Center for Artificial Intelligence (DFKI)

Page 2: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

OverviewOverview

What is Language TechnologyWhat is Language Technology

Some Selected Technologies Some Selected Technologies

Some Selected ApplicationsSome Selected Applications

Information ExtractionInformation Extraction

Cross-Linguistic Information RetrievalCross-Linguistic Information Retrieval

Email Management Email Management

Language CheckingLanguage Checking

Page 3: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

© 2000 Hans Uszkoreit

CL

MotivationsMotivations

engineeringengineering cognitioncognition

linguistics linguistics

Page 4: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

© 2000 Hans Uszkoreit

MotivationenMotivationen

modells of grammarmodells of grammar

languagelanguagetechnologytechnologyapplicationsapplications

models of models of human languagehuman language

processingprocessing

engineeringengineering cognitioncognition

linguistics linguistics

Page 5: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

What is a TechnologyWhat is a Technology

Technology: methods and techniques that together enable some application.Technology: methods and techniques that together enable some application.

In real life usage of the word there is a continuum between methods and In real life usage of the word there is a continuum between methods and applications.applications.

method/techniquemethod/technique finite state transductionfinite state transduction

component technology component technology tokenizer tokenizer

technology technology named entity recognitionnamed entity recognition

high precision text indexinghigh precision text indexing

applicationapplication concept based search engineconcept based search engine

Page 6: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Types of TechnologiesTypes of Technologies

Communication partners:Communication partners: humans and machines humans and machines (technology),(technology),

humans and humanshumans and humanshumans and infostructurehumans and infostructure

Modes and media for input and output:Modes and media for input and output: text, speech, pictures, gestures text, speech, pictures, gestures

Synchronicity:Synchronicity: synchronous vs. asynchronous synchronous vs. asynchronous

Situatedness:Situatedness: sensitivity to context, location, time, plans sensitivity to context, location, time, plans

Type of linguality:Type of linguality: monolingual, multilingual, translingual monolingual, multilingual, translingual

Type of processing:Type of processing: Categorization, summarization, extraction, Categorization, summarization, extraction, understanding, translating, respondingunderstanding, translating, responding

Level of linguistic description:Level of linguistic description: phonology, morphology, syntax, phonology, morphology, syntax, semantics,pragmatics semantics,pragmatics

Page 7: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

speechtechnologies

texttechnologies

knowledgetechnologies

multimedia & multimodality technologies

languagetechnologies

Language TechnologiesLanguage Technologies

Page 8: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 9: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Text Technologies

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 10: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Language Technologies

Speech TechnologiesText Technologies

Page 11: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Speech TechnologiesText Technologies

gathering

indexing

categorization

clustering

summarization

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 12: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Speech TechnologiesText Technologies

text understanding

text translation

information extraction

report generation

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 13: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Speech TechnologiesText Technologies

Voice RecognitionVoice RecognitionSpeech VerificationSpeech Verification

Speech RecognitionSpeech RecognitionVoice ModellingVoice Modelling

Speech SynthesisSpeech SynthesisSpeaker IdentificationSpeaker Identification

Language IndentificationLanguage Indentification

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 14: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Speech TechnologiesText Technologies

Speech GenerationSpeech GenerationSpeech UnterstandingSpeech Unterstanding

Spoken Dialogue SystemsSpoken Dialogue SystemsSpeech Translation SystemsSpeech Translation Systems

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 15: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Language Technologies

Speech TechnologiesText Technologies

language understanding

language generation

dialogue modelling

machine translation

LLANGUAGEANGUAGE T TECHNOLOGIESECHNOLOGIES

Page 16: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Speech recognition

Spoken language is recognized and transformed in

into text as in dictation systems, into commands as

in robot control systems, or into some other internal

representation.

Page 17: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Speech Synthesis

(also Speech Generation)

Utterances in spoken language are produced from text

(text-to-speech systems) or from internal representations

of words or sentences (concept-to-speech systems)

Page 18: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Text Categorization

This technology assigns texts to categories. Texts may

belong to more than one category, categories may

contain other categories. Filtering is a special case of

categorization with just two categories.

Page 19: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Text Summarization

The most relevant portions of a text are extracted as

a summary. The task depends on the needed lengths

of the summaries. Summarization is harder if the summary has to be specific to a certain query.

Page 20: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Text Indexing

As a precondition for document retrieval, texts areare stored in an indexed database. Usually a textis indexed for all word forms or – after lemmatization –

for all lemmas. Sometimes indexing is combined with categorization and summarization.

Page 21: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Text Retrieval

Texts are retrieved from a database that best match

a given query or document. The candidate documents

are ordered with respect to their expected relevance.

Indexing, categorization, summarization and retrievalare often subsumed under the term information retrieval.

Page 22: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Information Extraction

Relevant information pieces of information are discovered

and marked for extraction. The extracted pieces can be: the topic, named entities such as company, place or person names, simple relations such as prices, desti-nations, functions etc. or complex relations describing

accidents, company mergers or football matches.

Page 23: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Data Fusion and Text Data Mining

Extracted pieces of information from several sources arecombined in one database. Previously undetected relationships may be discovered.

Page 24: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Question Answering

Question AnsweringNatural language queries are used to access information in a database. The database maybe a base of structured data or a repository of digital texts in which certain parts have been marked

as potential answers.

Page 25: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Report Generation

A report in natural language is produced that describesA report in natural language is produced that describesthe essential contents or changes of a database. The the essential contents or changes of a database. The report can contain accumulated numbers, maxima, report can contain accumulated numbers, maxima, minima and the most drastic changes.minima and the most drastic changes.

Page 26: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Spoken Dialogue Systems

The system can carry out a dialogue with a human

user in which the user can solicit information or conduct

purchases, reservations or other transactions.

Page 27: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Translation Technologies

Technologies that translate texts or assist human trans-lators. Automatic translation is called machine translation.Translation memories use large amounts of texts together with existing translations for efficient look-up of possible translations for words, phrases and sentences.

Page 28: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Formal and Computational MethodsFormal and Computational Methods

Generic CS MethodsProgramming languages, algorithms for generic data types, and software

engineering methods for structuring and organizing software development and quality assurance.

Specialized Algorithms Dedicated algorithms have been designed for parsing, generation and translation,

for morphological and syntactic processing with finite state automata/transducers and many other tasks.

Nondiscrete Mathematical MethodsStatistical techniques have become especially successful in speech processing, information retrieval, and the automatic acquisition of language models. Other methods in this class are neural networks and powerful techniques for optimization and search.

Page 29: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Linguistic Methods and ResourcesLinguistic Methods and Resources

Logical and Linguistic Formalisms

For deep linguistic processing, constraint based grammar formalisms are employed. Complex formalisms have been developed for the representation of semantic content and knowledge.

Linguistic Knowledge

Linguistic knowledge resources for many languages are utilized: dictionaries, morphological and syntactic grammars, rules for semantic interpretation, pronunciation and intonation.

Corpora and Corpus Tools

Large collections of application-specific or generic collections of spoken and written language are exploited for the acquisition and testing of statistical or rule-based language models.

Page 30: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Methods from Cognitive Science (Psychology)Methods from Cognitive Science (Psychology)

Models of Cognitive Systems and their ComponentsThe interaction of perception, knowledge, reasoning and action including

communication is modelled in cognitive psychology. Such models can be consulted or employed for the design of language processing systems. Formalized models of components such as memory, reasoning and auditive perception are also often utilized for models of language processing.

Empirical methods fromn Experimental PsychologySince cognitive psychology investigates the intelligent behavior of human

organisms, many methods have been developed for the observation and empirical analysis of language production and comprehension. Such methods can be extremely useful for building computer models of human language processing (Examples: "Wizard of Oz Experiments" and measurements of syntactic and semantic processing complexity.

Page 31: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

State of the ArtState of the Art

Correct recognition of word categoriesCorrect recognition of word categories(part-of-speech-tagging)(part-of-speech-tagging)

recognition of names of people, companies, places, recognition of names of people, companies, places, products (named-entity-recognition)products (named-entity-recognition)

statistical recognition of major phrasesstatistical recognition of major phrases(HMM chunk parsing)(HMM chunk parsing)

parsing of newspaper texts by statistically trained parsing of newspaper texts by statistically trained parsersparsers(probibilistic context free parsing)(probibilistic context free parsing)

deep parsing of newspaper texts deep parsing of newspaper texts (HPSG or LFG parsing with large lexicon)(HPSG or LFG parsing with large lexicon)

95%-98%95%-98%

85%-98%85%-98%

95%95%

91%91%

40%-60%40%-60%

Page 32: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Maturity of Speech TechnologiesMaturity of Speech Technologies

Voice Control SystemsVoice Control Systems

Dictation SystemsDictation Systems

Text-to-Speech SystemsText-to-Speech Systems

Machine Initiative Spoken Dialogue SystemsMachine Initiative Spoken Dialogue Systems

Identification and Verification SystemsIdentification and Verification Systems

Spoken Information AccessSpoken Information Access

Mixed Initiative Spoken Dialogue SystemsMixed Initiative Spoken Dialogue Systems

Speech Translation SystemsSpeech Translation Systems

Deployed. On the marketMature or close to maturityResearch prototypes in R&D

Page 33: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Maturity of Text TechnologiesMaturity of Text Technologies

Spell CheckersSpell Checkers

Machine-Assisted Human TranslationMachine-Assisted Human Translation

Translation MemoriesTranslation Memories

Indicative Machine TranslationIndicative Machine Translation

Grammar CheckersGrammar Checkers

Information ExtractionInformation Extraction

Human Assisted Machine TranslationHuman Assisted Machine Translation

Report GenerationReport Generation

High Quality Text TranslationHigh Quality Text Translation

Text Generation SystemsText Generation SystemsDeployed. On the marketMature or close to maturityResearch prototypes in R&D

Page 34: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Maturity of IM TechnologiesMaturity of IM Technologies

Word-Based Information RetrievalWord-Based Information Retrieval

Summarization by Simple CondensationSummarization by Simple Condensation

Simple Statistical CategorizationSimple Statistical Categorization

Simple Automatic HyperlinkingSimple Automatic Hyperlinking

Cross-Lingual Information RetrievalCross-Lingual Information Retrieval

Automatic Hyperlinking With DisambiguationAutomatic Hyperlinking With Disambiguation

Simple Information Extraction (Unary, Binary Relations)Simple Information Extraction (Unary, Binary Relations)

Complex Information Extraction (Ternary+ Relations)Complex Information Extraction (Ternary+ Relations)

Dense Associative HyperlinkingDense Associative Hyperlinking

Concept-Based Information RetrievalConcept-Based Information Retrieval

Text UnderstandingText Understanding Deployed. On the marketMature or close to maturityResearch prototypes in R&D

Page 35: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

MMEGAEGATTRENDSRENDS

ambient computingubiquitous computingsituated computing

pervasive computingdisappearing computers

global infostructurecollective memory

collective knowledgelearning organizations

meta-knowledge repositories

personalizationadaptation

learning

ubiquitousaccess

Page 36: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Vector Space ModelVector Space Model

Imagine a vector whose length is equal to the number of content words Imagine a vector whose length is equal to the number of content words of the language. v= (wof the language. v= (w11. w. w22, ..., w, ..., wnn))

A document is represented as a vector A document is represented as a vector

d= (td= (t11, t, t22, ..., t, ..., tnn) )

where twhere ti i represents the number of occurences of word w represents the number of occurences of word w ii in the in the document.document.

a query is represented as a vector as wella query is represented as a vector as well

q= (tq= (t11, t, t22, ..., t, ..., tnn))

The distance between vectors is expressed by the cosine value. The distance between vectors is expressed by the cosine value.

Page 37: Language Technology I © 2004 Hans Uszkoreit Language Technology 2005/06 Hans Uszkoreit Universität des Saarlandes and German Research Center for Artificial.

Language Technology ILanguage Technology I© 2004 © 2004 Hans UszkoreitHans Uszkoreit

Classification MethodsClassification Methods

• knn (k nearest neighbours)knn (k nearest neighbours)• simple neural networks simple neural networks • hierarchically organized neural network built up from a number of hierarchically organized neural network built up from a number of

independent self-organizing maps independent self-organizing maps • Kohonen type self-organizing maps Kohonen type self-organizing maps • support vector machinessupport vector machines• genetic programming genetic programming • naive Bayes classifiernaive Bayes classifier• hierarchical Bayesian clustering hierarchical Bayesian clustering • Bayesian network classifier Bayesian network classifier