Top Banner
LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong
43

LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Dec 25, 2015

Download

Documents

Philippa Marsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

LING/C SC/PSYC 438/538

Lecture 1Sandiway Fong

Page 2: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

SyllabusDetails:• 538: introductory level, no formal pre-requisites• 438: LING 388 or familiarity with one or more of the following: formal languages, syntax,

data structures, or compilers• Instructor: Sandiway Fong, Depts. of Linguistics and Computer Science• Office: Douglass 311 (ph. 626-6567)• Hours:

– by appt. or walk-in– after class (best if you have quick Qs)

• Email: [email protected]• Meet: Tuesday/Thursdays in AME S314, 2-3:15pm• No class on

– November 11th (Veterans Day)– November 27th (Thanksgiving)

Page 3: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Syllabus• Course objectives:

– introduction to computational linguistics– survey a range of topics– introduction to programming

• Expected learning outcomes:– acquire ability to write short programs– familiarity with basic concepts, techniques and applications– be equipped to take more advanced classes in computational

linguistics, e.g. 581 (Spring)

Page 4: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Syllabus• Grading

– 438

• homeworks 100%• note: all homeworks are required

– 538

• homeworks 75%• (homeworks, will be a superset of the exercises for 438)• chapter presentation 25%

• Homework submissions– email only– [email protected]– by midnight of due date– typically: one week – (homeworks will be presented in class)

Page 5: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Syllabus

• Homeworks– you may discuss questions with other students– however, you must write it up yourself (in your own words)– cite (web) references and your classmates (in the case of

discussion)– Student Code of Academic Integrity: plagiarism etc.

• http://deanofstudents.arizona.edu/codeofacademicintegrity

• Revisions to the syllabus– “the information contained in the course syllabus, other than

the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”

Page 6: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Syllabus• Absences

– tell me ahead of time so we can make special arrangements

– I expect you to attend lectures (though attendance will not be taken)

• Required text– Speech and Language Processing,

Jurafsky & Martin, 2nd edition, Prentice Hall 2008

• Special equipment– none– all software required for the course is

freely available off the net

• Classroom etiquette– ask questions– use your own laptop or lab computer

• Topics (16 weeks)– Programming Language: Perl – Regular Expressions– Automata (Finite State)– Transducers (Finite State)– Programming Language: Prolog

(definite clause grammars)– Part of Speech Tagging– Stemming (Morphology)– Edit Distance (Spelling)– Grammars (Regular, Context-free)– Parsing (Syntax trees, algorithms)– N-grams (Probability, Smoothing)– and more …

Page 7: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Course website• Download lecture slides from my homepage

– http://dingo.sbs.arizona.edu/~sandiway/#courses– available from class time (and afterwards, look for corrections/updates)– in .pptx (animations) and .pdf formats

Page 8: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Course website

Page 9: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Miss a lecture?

• Available for review:– linked via course homepage to http://ua.lecturecast.arizona.edu/– access to low-res video, laptop screen, slides, index (searchable)

Page 10: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Textbook (J&M)2008 (2nd edition)

Nearly 1000 pages(maybe more than a full year’s worth…)25 chaptersDivided into 5 parts

I. WordsII. Speech – not this courseIII. SyntaxIV. Semantics and PragmaticsV. Applications

Page 11: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Book chapters• 1. Introduction• 1.1. Knowledge in Speech and Language Processing• 1.2. Ambiguity• 1.3. Models and Algorithms• 1.4. Language, Thought, and Understanding• 1.5. The State of the Art• 1.6. Some Brief History• 1.6.1. Foundational Insights: 1940s and 1950s• 1.6.2. The Two Camps: 1957–1970• 1.6.3. Four Paradigms: 1970–1983• 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993• 1.6.5. The Field Comes Together: 1994–1999• 1.6.6. The Rise of Machine Learning: 2000–2008• 1.6.7. On Multiple Discoveries• 1.6.8. A Final Brief Note on Psychology• 1.7. Summary• Bibliographical and Historical Notes• I. Words• 2. Regular Expressions and Automata• 2.1. Regular Expressions• 2.1.1. Basic Regular Expression Patterns• 2.1.2. Disjunction, Grouping, and Precedence• 2.1.3. A Simple Example• 2.1.4. A More Complex Example• 2.1.5. Advanced Operators• 2.1.6. Regular Expression Substitution, Memory, and ELIZA• 2.2. Finite-State Automata• 2.2.1. Use of an FSA to Recognize Sheeptalk• 2.2.2. Formal Languages• 2.2.3. Another Example• 2.2.4. Non-Deterministic FSAs• 2.2.5. Use of an NFSA to Accept Strings• 2.2.6. Recognition as Search• 2.2.7. Relation of Deterministic and Non-Deterministic Automata• 2.3. Regular Languages and FSAs• 2.4. Summary• Bibliographical and Historical Notes• Exercises• 3. Words and Transducers• 3.1. Survey of (Mostly) English Morphology• 3.1.1. Inflectional Morphology• 3.1.2. Derivational Morphology• 3.1.3. Cliticization

• 3.1.4. Non-Concatenative Morphology• 3.1.5. Agreement• 3.2. Finite-State Morphological Parsing• 3.3. Construction of a Finite-State Lexicon• 3.4. Finite-State Transducers• 3.4.1. Sequential Transducers and Determinism• 3.5. FSTs for Morphological Parsing• 3.6. Transducers and Orthographic Rules• 3.7. The Combination of an FST Lexicon and Rules• 3.8. Lexicon-Free FSTs: The Porter Stemmer• 3.9. Word and Sentence Tokenization• 3.9.1. Segmentation in Chinese• 3.10. Detection and Correction of Spelling Errors• 3.11. Minimum Edit Distance• 3.12. Human Morphological Processing• 3.13. Summary• Bibliographical and Historical Notes• Exercises• 4. N-Grams• 4.1. Word Counting in Corpora• 4.2. Simple (Unsmoothed) N-Grams• 4.3. Training and Test Sets• 4.3.1. N-Gram Sensitivity to the Training Corpus• 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks• 4.4. Evaluating N-Grams: Perplexity• 4.5. Smoothing• 4.5.1. Laplace Smoothing• 4.5.2. Good-Turing Discounting• 4.5.3. Some Advanced Issues in Good-Turing Estimation• 4.6. Interpolation• 4.7. Backoff• 4.7.1. Advanced: Details of Computing Katz Backoff α and P*• 4.8. Practical Issues: Toolkits and Data Formats• 4.9. Advanced Issues in Language Modeling• 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing• 4.9.2. Class-Based N-Grams• 4.9.3. Language Model Adaptation and Web Use• 4.9.4. Using Longer-Distance Information: A Brief Summary• 4.10. Advanced: Information Theory Background• 4.10.1. Cross-Entropy for Comparing Models• 4.11. Advanced: The Entropy of English and Entropy Rate Constancy• 4.12. Summary• Bibliographical and Historical Notes• Exercises

1. Introduction1.1. Knowledge in Speech and Language Processing1.2. Ambiguity1.3. Models and Algorithms1.4. Language, Thought, and Understanding1.5. The State of the Art1.6. Some Brief History1.6.1. Foundational Insights: 1940s and 1950s1.6.2. The Two Camps: 1957–19701.6.3. Four Paradigms: 1970–19831.6.4. Empiricism and Finite-State Models Redux: 1983–19931.6.5. The Field Comes Together: 1994–19991.6.6. The Rise of Machine Learning: 2000–20081.6.7. On Multiple Discoveries1.6.8. A Final Brief Note on Psychology1.7. SummaryBibliographical and Historical Notes

Page 12: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Syllabus

• Coverage– Intro to programming

• we’re going to use Perl • Python is another (perhaps more) popular language

– Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25

Page 13: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Homework: Reading

• Chapter 1 from JM– introduction and history– available online– http://www.cs.colorado.

edu/~martin/SLP/Updates/1.pdf

• Whole book is available as an e-book– www.coursesmart.com

Page 14: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Homework: Install Perl• Install Perl on your laptop

– should be pre-installed on macs and Linux (Ubuntu), check your machine– on Windows PCs, if you don’t already have it, it’s freely available here– http://www.activestate.com/ (don’t pay, get the free version)

Page 15: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Homework: Install Perl

• Ubuntu:

• Mac:

perl –vwhich perl

Page 16: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Homework: Install Perl

Other methodsSee http://learn.perl.org/installing/

Page 17: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Learning Perl

• Learn Perl– Books…– Online resources

• http://learn.perl.org/• Next time, we begin with ...• http://perldoc.perl.org/perlintro.html

Page 18: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Enormous amounts of data stored– world-wide web (WWW)– corporate databases– your own hard drive

• Major categories of data– numeric – Language: words, text, sound– pictures, video

Page 19: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• We know what we want from computer software• “killer applications”

– those that can make sense of language data• retrieve language data: (IR)• summarize knowledge contained in language data• sentiment analysis from online product reviews• answer questions (QA), make logical inferences• translate from one language into another• recognize speech: transcribe• etc...

Page 20: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• In other words, we’d like computers to be smart about language• possess “intelligence”• pass the Turing Test …

Page 21: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• In other words, we’d like computers to be smart about language – possess intelligence – well, perhaps not too smart…

From 2001… (HAL)

Page 22: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• (Un)fortunately, we’re not there yet…– gap between what computers can do and – what we want them to be able to do

Often quoted (but not verified):

"The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten."

but with Google translate or babelfish, it’s not difficult to find (funny) examples…

Page 23: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• and how can we tell if the translation is right anyway?

• http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg

Page 24: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

Page 25: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Obama: "At a certain point, I've just concluded that for me personally it is important for me to go ahead and affirm that I think same-sex couples should be able to get married."

Is this sentence complicated? Why?

Page 26: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

Page 27: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

ExecutiveSummarization

Page 28: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and ComputersDo you trust Google Translate?• a real case: 4,000,000 yen or 40,000 yen?

Page 29: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Puzzle: translation of 4 万円以下

• Now fixed (almost)

with auto-detect on

4 万円10,000 yen

以下less than/below/not exceeding

no spaces: segmentation task

Page 30: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Non-compositionality Puzzle

Page 31: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• What happened? 4 万円以下– can be segmented as follows:

4 万円10,000 yen

以下less than/below/not exceeding

4 万円以下Million yen

Page 32: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Still problems remain (as of August 27 2013):

another glitch but an order of magnitude in the other direction: 10,000 -> 1,000but better than “million”

Page 33: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers• a visit to the Peabody Essex Museum

(Massachusetts)– Qing dynasty Huīzhōu ( 徽州 ) -style house

… so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean? 蔭餘堂  (simplified 荫余堂 )

Page 34: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 荫余堂 / 蔭餘堂 (simplified/traditional spelling)

the strange romanization is not the translation I’m looking for…

Page 35: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 荫余堂 / 蔭餘堂

Meaning in language is (mostly) compositional

Page 36: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 1st character: 荫 / 蔭 

Page 37: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 2nd character: 余 /  餘

Page 38: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 余 /  餘

Page 39: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 3rd character: 堂

Page 40: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Language and Computers

• Meaning of 蔭  餘 堂• shade I Church• shady remainder Hall

Page 41: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Applications– technology is still in development

• even if we are willing to pay...– machine translation has been worked on since after World War II

(1950s)– still not perfected today– why?– what are the properties of human languages that make it hard?

Page 42: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Natural Language Properties

• which properties are going to be difficult for computers to deal with?

• grammar (Rules for putting words together into sentences)– How many rules are there?

• 100, 1000, 10000, more …– Portions learnt or innate– Do we have all the rules written down somewhere?

• lexicon (Dictionary)– How many words do we need to know?

• 1000, 10000, 100000 …

• meaning and inference (semantic interpretation, commonsense world knowledge)

Page 43: LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong. Syllabus Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with.

Computers vs. Humans

• Knowledge of language– Computers are way faster than humans

• They kill us at arithmetic and chess

– But human beings are so good at language, we often take our ability for granted

• Processed without conscious thought• Do pretty complex things

and now Jeopardy as well …