Top Banner
A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)
103

A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

Dec 28, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • A Bayesian Approach to the Poverty of the Stimulus Amy PerforsMITWith Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

  • InnateLearned

  • InnateLearnedExplicit StructureNo explicit Structure

  • Language has hierarchical phrase structureNoYes

  • Why believe that language has hierarchical phrase structure?Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) Dependency structure of language:

    A finite-state grammar cannot capture the infinite sets of English sentences with dependencies like thisIf we restrict ourselves to only a finite set of sentences, then in theory a finite-state grammar could account for them: but this grammar will be so complex as to be of little use or interest.

  • Why believe that structure dependence is innate?Simple declarative: The girl is happy, They are eatingSimple interrogative: Is the girl happy? Are they eating?Linear: move the first is (auxiliary) in the sentence to the beginning Hierarchical: move the auxiliary in the main clause to the beginning ResultHypothesesComplex declarative: The girl who is sleeping is happy.DataChildren say: Is the girl who is sleeping happy?NOT: *Is the girl who sleeping is happy?TestChomsky, 1965, 1980; Crain & Nakayama, 1987The Argument from the Poverty of the Stimulus (PoS):

  • Why believe its not innate?There are actually enough complex interrogatives (Pullum & Scholz 02)Childrens behavior can be explained via statistical learning of natural language data (Lewis & Elman 01; Reali & Christiansen 05)It is not necessary to assume a grammar with explicit structure

  • InnateLearnedExplicit StructureNo explicit Structure

  • InnateLearnedExplicit StructureNo explicit Structure

  • Our argument

  • Our argumentWe suggest that, contra the PoS claim:It is possible, given the nature of the input and certain domain-general assumptions about the learning mechanism, that an ideal, unbiased learner can realize that language has a hierarchical phrase structure; therefore this knowledge need not be innateThe reason: grammars with hierarchical phrase structure offer an optimal tradeoff between simplicity and fit to natural language data

  • PlanModelData: corpus of child-directed speech (CHILDES)GrammarsLinear & hierarchicalBoth: Hand-designed & result of local searchLinear: automatic, unsupervised MLEvaluation Complexity vs. fitResultsImplications

  • The model: DataCorpus from CHILDES database (Adam, Brown corpus)55 files, age range 2;3 to 5;2Sentences spoken by adults to childrenEach word replaced by syntactic categorydet, n, adj, prep, pro, prop, to, part, vi, v, aux, comp, wh, cUngrammatical sentences and the most grammatically complex sentence types were removed: kept 21792 out of 25876 utterancesTopicalized sentences(66); sentences serial verb constructions (459), subordinate phrases (845), sentential complements (1636), and conjunctions (634). Ungrammatical sentences (444)

  • DataFinal corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens

  • Data: variationAmount of evidence available at different points in development

  • Data: variationAmount of evidence available at different points in developmentAmount comprehended at different points in development

  • Data: amount availableRough estimate split by ageEpoch 1Epoch 2Epoch 3Epoch 4Epoch 5# FilesAge% types# types2;3 to 3;12;3 to 2;8879129517352090233638%55%74%89%100%112;3 to 3;52;3 to 4;22;3 to 5;22;31737.4%Epoch 0133225544

  • Data: amount comprehendedRough estimate split by frequencyLevel 1Level 2Level 3Level 4Level 5Level 6Frequency# types% tokens% types837671152682336500+100+50+25+10+1+ (all)0.3%1.6%2.9%4.9%12%100%28%55%64%71%82%100%

  • The modelDataChild-directed speech (CHILDES)GrammarsLinear & hierarchicalBoth: Hand-designed & result of local searchLinear: automatic, unsupervised ML)EvaluationComplexity vs. fit

  • Grammar typesContext-free grammarRulesExampleFlat grammarRulesList of each sentenceExampleRegular grammarRulesNT t NTExampleNT tNT NT NTNT t NTNT NTNT tHierarchicalLinearRulesExample1-state grammarAnything accepted

  • Specific hierarchical grammars: Hand-designedCFG-SDescriptionDesigned to be as linguistically plausible as possibleExample productionsStandard CFGCFG-LDescriptionDerived from CFG-S; contains additional productions corresponding to different expansions of the same NT (puts less probability mass on recursive productions)Example productionsLarger CFG77 rules, 15 non-terminals133 rules, 15 non-terminals

  • Specific linear grammars: Hand-designedFLATList of each sentence2336 rules, 0 non-terminals1-STATEAnything accepted26 rules, 0 non-terminalsExact fit, no compressionPoor fit, high compression

  • Specific linear grammars: Hand-designedREG-NNarrowest regular derived from CFG289 rules, 85 non-terminalsFLATList of each sentence2336 rules, 0 non-terminals1-STATEAnything accepted26 rules, 0 non-terminalsExact fit, no compressionPoor fit, high compression

  • Specific linear grammars: Hand-designedMid-level regular derived from CFGREG-M169 rules, 14 non-terminalsREG-NNarrowest regular derived from CFG289 rules, 85 non-terminalsFLATList of each sentence2336 rules, 0 non-terminals1-STATEAnything accepted26 prods, 0 non-terminalsExact fit, no compressionPoor fit, high compression

  • REG-BBroadest regular derived from CFG117 rules, 10 non-terminalsMid-level regular derived from CFGREG-M169 prods, 14 non-terminalsREG-NNarrowest regular derived from CFG289 prods, 85 non-terminalsFLATList of each sentence2336 rules, 0 non-terminals1-STATEAnything accepted26 rules, 0 non-terminalsExact fit, no compressionPoor fit, high compressionSpecific linear grammars: Hand-designed

  • Automated searchLocal search around hand-designed grammarsLinear: unsupervised, automatic HMM learningGoldwater & Griffiths, 2007Bayesian model for acquisition of trigram HMM (designed for POS tagging, but given a corpus of syntactic categories, learns a regular grammar)

  • The modelDataChild-directed speech (CHILDES)GrammarsLinear & hierarchicalHand-designed & result of local searchLinear: automatic, unsupervised MLEvaluationComplexity vs. fit

  • GrammarsT: type of grammarG: Specific grammarD: Dataunbiased (uniform)

  • GrammarsT: type of grammarG: Specific grammarD: Datadata fit (likelihood)complexity (prior)

  • Tradeoff: Complexity vs. FitLow prior probability = more complexLow likelihood = poor fit to the dataFit: lowSimplicity: highFit: moderateSimplicity: moderateFit: highSimplicity: low

  • Measuring complexity: priorDesigning a grammar (Gods eye view)

    Grammars with more rules and non-terminals will have lower prior probabilityn = # of nonterminals Ni = # items in production iPk = # productions of nonterminal k V = vocab sizek = production probability parameters for k

  • Measuring fit: likelihoodProbability of that grammar generating the dataProduct of the probability of each parseEx: pro aux det n= 0.25= 0.5*0.25*1.0*0.25*0.5 = 0.016

  • PlanModelData: corpus of child-directed speech (CHILDES)GrammarsLinear & hierarchicalHand-designed & result of local searchLinear: automated, unsupervised MLEvaluation Complexity vs. fitResultsImplications

  • Results: data split by frequency levels (estimate of comprehension)Log posterior probability (lower magnitude = better)

  • Results: data split by age (estimate of availability)

  • Results: data split by age (estimate of availability)Log posterior probability (lower magnitude = better)

  • Generalization: How well does each grammar predict sentences it hasnt seen?

  • Generalization: How well does each grammar predict sentences it hasnt seen?Complex interrogatives

    TypeIn corp?ExampleRGNRG-MRG-BAUTO1-STCFG-SCFG-LSimple DeclarativeEagles do fly. (n aux vi)Simple InterrogativeDo eagles fly? (aux n vi)Complex DeclarativeEagles that are alive do fly. (n comp aux adj aux vi)Complex InterrogativeDo eagles that are alive fly? (aux n comp aux adj vi)Complex InterrogativeAre eagles that alive do fly? (aux n comp adj aux vi)

  • Take-home messagesShown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input

    This paradigm is valuable: it makes any assumptions explicit and enables us to rigorously evaluate how different representations capture the tradeoff between simplicity and fit to dataIn some ways, higher-order knowledge may be easier to learn than specific details (the blessing of abstraction)

  • Implications for innateness?Ideal learnerStrong(er) assumptions:The learner can find the best grammar in the space of possibilitiesWeak(er) assumptionsThe learner has the ability to parse the corpus into syntactic categoriesThe learner can represent both linear and hierarchical grammarsAssume a particular way of calculating complexity & data fitHave we actually found representative grammars?

  • The End Thanks also to the following for many helpful discussions: Virginia Savova, Jeff Elman, Danny Fox, Adam Albright, Fei Xu, Mark Johnson, Ken Wexler, Ted Gibson, Sharon Goldwater, Michael Frank, Charles Kemp, Vikash Mansinghka, Noah Goodman

  • GrammarsT: grammar typeG: Specific grammarD: Data

  • GrammarsT: grammar typeG: Specific grammarD: Data

  • The Argument from the Poverty of the Stimulus (PoS)GBTDP1. Children show a specific pattern of behavior BP2. A particular generalization G must be grasped in order to produce BP3. It is impossible to reasonably induce G simply on the basis of the data D that children receiveC1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessary

  • The Argument from the Poverty of the Stimulus (PoS)GBTDP1. Children show a specific pattern of behavior BP2. A particular generalization G must be grasped in order to produce BP3. It is impossible to reasonably induce G simply on the basis of the data D that children receiveC1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessaryCorollary: The abstract knowledge T could not itself be learned, or could not be learned before G is knownC2. T must be innate

  • The Argument from the Poverty of the Stimulus (PoS)P1. Children show a specific pattern of behavior BP2. A particular generalization G must be grasped in order to produce BP3. It is impossible to reasonably induce G simply on the basis of the data D that children receiveC1. Some abstract knowledge T, limiting which specific generalizations G are possible, is necessaryCorollary: The abstract knowledge T could not itself be learned, or could not be learned before G is knownC2. T must be innateG: a specific grammarD: typical child-directed speech inputB: children dont make certain mistakes (they dont seem to entertain structure-independent hypotheses)

    T: language has hierarchical phrase structure

  • DataFinal corpus contained 2336 individual sentence types corresponding to 21792 sentence tokensWhy types?Grammar learning depends on what sentences are generated, not on how many of each type there areMuch more computationally tractableThe distribution of sentence tokens depends on many factors other than the grammar (e.g., pragmatics, semantics, discussion topics) [Goldwater, Griffiths, Johnson 05]

  • Specific linear grammars: Hand-designedREG-BBroadest regular derived from CFG117 rules, 10 non-terminalsMid-level regular derived from CFGREG-M169 prods, 14 non-terminalsREG-NNarrowest regular derived from CFG289 prods, 85 non-terminalsFLATList of each sentence2336 rules, 0 non-terminals1-STATEAnything accepted26 rules, 0 non-terminalsExact fit, no compressionPoor fit, high compression

  • Why these results?Natural language actually is generated from a grammar that looks more like a CFGThe other grammars overfit and therefore do not capture important language-specific generalizationsFlat

  • Computing the priorCFGREGContext-free grammarRegular grammarNT t NTNT tNT NT NTNT t NTNT NTNT t

  • Likelihood, intuitivelyZ: rule out because it does not explain some of the data pointsX and Y both explain the data points, but X is the more likely source

  • Possible empirical testsPresent people with data the model learns FLAT, REG, and CFGs from; see which novel productions they generalize toNon-linguistic? To small children?Examples of learning regular grammars in real life: does the model do the same?

  • Do people learn regular grammars?S1 s2 s3 w1 w1 w1

    Miss Mary Mack, Mack, MackAll dressed in black, black, blackWith silver buttons, buttons, buttonsAll down her back, back, backShe asked her mother, mother, mother, X s1 s2 s3

    Spanish dancer, do the splits.Spanish dancer, give a kick.Spanish dancer, turn around.Childrens Songs: Line level grammar

  • Do people learn regular grammars?Teddy bear, teddy bear, turn around.Teddy bear, teddy bear, touch the ground.Teddy bear, teddy bear, show your shoe.Teddy bear, teddy bear, that will do.Teddy bear, teddy bear, go upstairs.Bubble gum, bubble gum, chew and blow,Bubble gum, bubble gum, scrape your toe,Bubble gum, bubble gum, tastes so sweet,Childrens Songs: Song level: X X s1 s2 s3Dolly Dimple walks like this,Dolly Dimple talks like this,Dolly Dimple smiles like this,Dolly Dimple throws a kiss.

  • Do people learn regular grammars?A my name is AliceAnd my husband's name is Arthur,We come from Alabama,Where we sell artichokes.B my name is BarneyAnd my wife's name is Bridget,We come from Brooklyn,Where we sell bicycles.Songs containing items represented as lists (where order matters)Dough a Thing I Buy Beer WithRay a guy who buys me beerMe, the one who wants a beerFa, a long way to the beerSo, I think I'll have a beerLa, -gers great but so is beer!Tea, no thanks I'll have a beerCinderella, dressed in yella,Went upstairs to kiss a fella,Made a mistake and kissed a snake,How many doctors did it take?1, 2, 3,

  • Do people learn regular grammars?You put your [body part] inYou put your [body part] outYou put your [body part] inand you shake it all about

    You do the hokey pokeyAnd you turn yourself aroundAnd that's what it's all about!Most of the song is a template, with repeated (varying) elementIf I were the marrying kindI thank the lord I'm not sirThe kind of rugger I would beWould be a rugby [position/item] sirCos I'd [verb phrase]And you'd [verb phrase]We'd all [verb phrase] togetherIf youre happy and you know it[verb] your [body part]If youre happy and you know it then your face will surely show itIf youre happy and you know it[verb] your [body part]

  • Do people learn regular grammars?There was a farmer had a dog,And Bingo was his name-O.B-I-N-G-O!B-I-N-G-O!B-I-N-G-O!And Bingo was his name-O!

    (each subsequent verse, replace a letter with a clap)Other interesting structuresI know a song that never ends,It goes on and on my friends,I know a song that never ends,And this is how it goes:(repeat)Oh, Sir Richard, do not touch me

    (each subsequent verse, remove the last word at the end of the sentence)

  • New PRG: 1-stateSEndDet, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, partDet, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, partLog(prior) = 0; no free parameters

  • Another PRG: standard + noiseFor instance, level-1 PRG + noise would be the best regular grammar for the corpus at level 1, plus the 1-state model This could parse all levels of evidencePerhaps this would be better than a more complicated PRG at later levels of evidence

  • Results: frequency levels (comprehension estimates)Log prior, log likelihood (abs)Log posterior (smaller is better)PPPPPPLLLLLL

  • Results: availability by ageLog prior, log likelihood (abs)PPPPPPLLLLLLLog posterior (smaller is better)

  • Specific grammars of each typeOne type of hand-designed grammar69 productions, 14 nonterminals390 productions, 85 nonterminals

  • Specific grammars of each typeThe other type of hand-designed grammar126 productions, 14 nonterminals170 productions, 14 nonterminals

  • The Argument from the Poverty of the Stimulus (PoS)P1. It is impossible to have made some generalization G simply on the basis of data DP2. Children show behavior BP3. Behavior B is not possible without having made GG: a specific grammarD: typical child-directed speech inputB: children dont make certain mistakes (they dont seem to entertain structure-independent hypotheses)

    T: language has hierarchical phrase structureC1. Some constraints T, which limit what type of generalizations G are possible, must be innate

  • #1: Children hear complex interrogativesWell, a few, but not manyAdam (CHILDES) 0.048%No yes-no questionsFour wh-questions (e.g., What is the music its playing?)Nina (CHILDES) 0.068%No yes-no questions14 wh-questionsIn all, most estimates are
  • #1: Children hear complex interrogativesWell, a few, but not manyAdam (CHILDES) 0.048%No yes-no questionsFour wh-questions (e.g., What is the music its playing?)Nina (CHILDES) 0.068%No yes-no questions14 wh-questionsIn all, most estimates are
  • #2: Can get the behavior without structureThere is enough statistical information in the input to be able to conclude which type of complex interrogative is ungrammaticalReali & Christiansen 2004; Lewis & Elman, 2001Rare: comp adj auxCommon: comp aux adj

  • #2: Can get the behavior without structureResponse: there is enough statistical information in the input to be able to conclude that Are eagles that alive can fly? is ungrammaticalReali & Christiansen 2004; Lewis & Elman, 2001Rare: comp adj auxCommon: comp aux adj

    Sidesteps the question: does not address the innateness of structure (knowledge X)Explanatorily opaque

  • Why do linguists believe that language has hierarchical phrase structure?Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956)A sentence has an (i,j) dependency if replacement of the ith symbol ai of S by bi requires a corresponding replacement of the jth symbolf aj of S by bjIf S has an m-termed dependency set in L, at least 2^m states are necessary in the finite-state grammar that generates LTherefore, if L is a finite-state language, then there is an m such that no sentence S of L has a dependency set of more than m terms in LThe mirror language made up of sentences consisting of a string X followed by X in reverse (e.g., aa, abba, babbab, aabbaa, etc), has the property that for any m we can find a dependency set D = {(1,2m), (2,2m-1),..,(m,m+1)}. Therefore it cannot be captured by any finite-state grammarEnglish has infinite sets of sentences with dependency sets with more than any fixed number of terms. E.g. the man who said that S5 is arriving today, there is a dependency between man and is. Therefore English cannot be finite-stateThere is the possible counterargument that since any finite corpus could be captured by a finite-state grammar, then English is only not finite-state in the limit but in practice, it could beEasy counterargument: simplicity considerations. Chomsky: If the processes have a limit, then the construction of a finite-state grammar will not be literally impossible (since a list is a trivial finite-state grammar), but this grammar will be so complex as to be of little use or interest.

  • The big pictureInnateLearned

  • Grammar Acquisition (Chomsky)InnateLearned

  • The Argument from the Poverty of the Stimulus (PoS)P1. Children show behavior B

    B

  • The Argument from the Poverty of the Stimulus (PoS)P1. Children show behavior BP2. Behavior B is not possible without having some specific grammar or rule G

    GB

  • The Argument from the Poverty of the Stimulus (PoS)P1. Children show behavior BP2. Behavior B is not possible without having some specific grammar or rule GP3. It is impossible to have learned G simply on the basis of data D

    GBDX

  • The Argument from the Poverty of the Stimulus (PoS)C1. Some constraints T, which limit what type of grammars are possible, must be innateGBTDP1. Children show behavior BP2. Behavior B is not possible without having some specific grammar or rule GP3. It is impossible to have learned G simply on the basis of data D

  • Replies to the PoS argumentThere are enough complex interrogatives in DP1. It is impossible to have made some generalization G simply on the basis of data DP2. Children show behavior BP3. Behavior B is not possible without having made Ge.g., Pullum & Scholz 2002C1. Some constraints T, which limit what type of generalizations G are possible, must be innate

  • Replies to the PoS argumentThere are enough complex interrogatives in DP1. It is impossible to have made some generalization G simply on the basis of data DP2. Children show behavior BP3. Behavior B is not possible without having made GPullum & Scholz, 2002There is a route to B other than G (statistical learning)e.g., Lewis & Elman , 2001Reali & Christiansen, 2005C1. Some constraints T, which limit what type of generalizations G are possible, must be innate

  • InnateLearned

  • InnateLearnedExplicit structureNo explicit structure

  • InnateLearnedExplicit structureNo explicit structure

  • Our argumentAssumptions: equipped withCapacity to represent both linear and hierarchical grammars (no bias)Rational Bayesian learning mechanism & probability calculationAbility to effectively search the space of possible grammars

  • Take-home messageShown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input

  • Take-home messageShown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input

    Can use this paradigm to explore the role of recursive elements in a grammar The winning grammar contains additional non-recursive counterparts for complex NPsPerhaps language, while fundamentally recursive, contains duplicate non-recursive elements that more precisely match the input?

  • The role of recursion

    Evaluated an additional grammar (CFG-DL) that contained no recursive complex NPs at all instead, multiply-embedded, depth-limited ones

    No sentence in the corpus occurred with more than two levels of nesting

  • The role of recursion: resultsLog posterior probability (lower magnitude = better)

  • The role of recursion: Results

  • The role of recursion: Implications

    Optimal tradeoff results in a grammar that goes beyond the data in interesting ways:Auxiliary frontingRecursive complex NPs

    A grammar with recursive complex NPs is more optimal, even though:Recursive productions hurt in the likelihoodThere are no sentences with more than two levels of nesting in the input