Top Banner
Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics 1. Course Overview Jonas Kuhn Universität Potsdam, 2007
21
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ppt

Statistische Methoden in der ComputerlinguistikStatistical Methods in Computational Linguistics

1. Course Overview

Jonas Kuhn

Universität Potsdam, 2007

Page 2: ppt

Outline

Course Overview & Introduction Some Python Programming

Page 3: ppt

Course Overview

Simple Python Programming Basic Probability Theory N-Gram Language Modeling

Basic Information Theory: Entropy Data Sparseness & Smoothing Techniques

Machine Learning Paradigms Part-of-Speech-Tagging with Statistical and ML

Techniques Probabilistic Grammars & Parsing Statistical Machine Translation

Page 4: ppt

The Status of Statistical Methods

Eric Brill and Raymond J. Mooney (1997):

An Overview of Empirical Natural Language Processing

In: AI Magazine, 18(4): Winter 1997, 13-24.

The linguistic knowledge-acquisition problem

Rationalist methods Empirical or corpus-based methods

Page 5: ppt

Rationalist methods

Page 6: ppt

Empirical or corpus-based methods

Page 7: ppt

History of NLP

1950s: empirical and statistical analyses of natural language (compare: behaviorism in psychology; Skinner)

Mid-1950s: Chomsky’s program

Observational and explanatory adequacy Arguments against learnability of language from data;

Innateness hypothesis Rationalist methods in AI research in NLP

Hand coding of rules Starting in early 1980s

Some work on induction of lexical and syntactic information from text

Empirical methods in speech recognition (hidden Markov models; HMMs)

Page 8: ppt

History of NLP

Late 1980s/1990s: Statistical techniques in various areas of NLP POS tagging Machine translation Probabilistic context-free grammars Word sense disambiguation Anaphora resolution

Page 9: ppt

Reasons for the Resurgence of Empiricism Empirical methods offer potential solutions to several

related, long-standing problems in NLP: (1) Acquisition, automatically identifying and coding

all the necessary knowledge (2) Coverage, accounting for all the phenomena in a

given domain or application (3) Robustness, accommodating real data that

contain noise and aspects not accounted for by the underlying model

(4) Extensibility, easily extending or porting a system to a new set of data or a new task or domain

Page 10: ppt

Reasons for the Resurgence of EmpiricismAdditional factors: (1) computing resources, the availability of relatively

inexpensive workstations with sufficient processing and memory resources to analyze large amounts of data

(2) data resources, the development and availability of large corpora of linguistic and lexical data for training and testing systems

(3) emphasis on applications and evaluation, industrial and government focus on the development of practical systems that are experimentally evaluated on real data

Page 11: ppt

Categories of Empirical Methods (1)

Probabilistic methods Symbolic learning methods Neural network/connectionist methods

Page 12: ppt

Categories of Empirical Methods (2)

Different dimension: type of training data Supervised learning

Annotated text Unsupervised learning

Indirect feedback

Important: combination of rationalist and empirical methods

Page 13: ppt

An Interdisciplinary Field

Computational Neuroscience

Computer Science

Linguistics

Mathematics

Electrical Engineering

Artificial Intelligence

Computational Linguistics

Philosophy

Algorithms &Data Structures

SearchAlgorithms

MachineLearningNeural

Networks

Natural LanguageParsing

GrammarFormalisms

ComplexityTheory Formal Language Theory

Probability Theory

InformationTheory

Pattern/SpeechRecognition

InformationRetrieval

Clustering

Corpus Linguistics

Empirical Sciences

Statistics

Psycho-linguistics

StatisticalNLP

Page 14: ppt

Practical Aspects

We will use Python for small programming exercises

http://www.python.org/ NLTK library (in Python) – Natural Language

Toolkithttp://nltk.sourceforge.net/

(probably) WEKA for small Machine Learning experimentshttp://www.cs.waikato.ac.nz/ml/weka/

Page 15: ppt

Python

Tutorial introduction in an NLP context:http://nltk.sourceforge.net/docs.html Chapter 2: Programming

Page 16: ppt

Python: Key Features

Simple yet powerful, shallow learning curve Object-oriented: encapsulation, re-use Scripting language, facilitates interactive exploration Excellent functionality for processing linguistic data Extensive standard library, incl graphics, web,

numerical processing Downloaded for free from http://www.python.org/

Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP

Page 17: ppt

Python example

import sys

for line in sys.stdin.readlines():

for word in line.split():

if word.endswith(’ing’):

print word

1. whitespace: nesting lines of code; scope

2. object-oriented: attributes, methods (e.g. line)

3. readable

Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP

Page 18: ppt

Comparison with Perl

while (<>) {foreach my $word (split) {

if ($word =~ /ing$/) {print "$word\n";

}}

}

1. syntax is obscure: what are: <> $ my split ?2. “it is quite easy in Perl to write programs that simply

look like raving gibberish, even to experienced Perl programmers” (Hammond Perl Programming for Linguists 2003:47)

3. large programs difficult to maintain, reuse

Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP

Page 19: ppt

What NLTK adds to Python

NLTK defines a basic infrastructure that can be used to build NLP programs in Python. It provides: Basic classes for representing data relevant to natural

language processing Standard interfaces for performing tasks, such as

tokenization, tagging, and parsing Standard implementations for each task, which can be

combined to solve complex problems Extensive documentation, including tutorials and

reference documentation

Slide taken from: Bird/Loper/Klein: NLTK – Introduction to NLP

Page 20: ppt

Installing Python and NLTK

1. Install Python, Numeric

2. Install NLTK-Lite, NLTK-Lite-Corpora

3. Set environment variable NLTK_LITE_CORPORA

For detailed instructions, see:http://nltk.sourceforge.net/install.html

Page 21: ppt

Running Project Idea

Language Identification In what language is a given text document?

First ideas?

(Using simple text processing techniques)