Grammar Induction Through Machine Learning Part 1 ...

The Machine Learning ParadigmSupervised vs. Unsupervised Learning

Supervised Learning with a Probabilistic GrammarA Bayesian Reply to the APS

Grammar Induction Through Machine LearningPart 1

Linguistic Nativism Reconsidered

Alexander Clark1 and Shalom Lappin2

1Department of Computer ScienceRoyal Holloway, University of London

2Department of PhilosophyKing’s College, London

February 5, 2008Department of Philosophy

King’s College, LondonClark and Lappin Grammar Induction Through Machine Learning Part 1



Outline

1 The Machine Learning Paradigm

2 Supervised vs. Unsupervised Learning

3 Supervised Learning with a Probabilistic Grammar

4 A Bayesian Reply to the APS

Clark and Lappin Grammar Induction Through Machine Learning Part 1



Machine Learning Algorithms and Models of theLearning Domain

A machine learning system implements a learningalgorithm that defines a function from a domain of inputsamples to a range of output values.A corpus of examples is divided into a training and a testset.The learning algorithm is specified in conjunction with amodel of the phenomenon to be learned.This model defines the space of possible hypotheses thatthe algorithm can generate from the input data.When the values of its parameters are set through trainingof the algorithm on the test set, an element of thehypothesis space is selected.
























Evaluating a Parsing AlgorithmSupervised Learning

If one has a gold standard of correct parses in a corpus,then it is possible to compute the percentage of correctparses that the algorithm produces for a blind test subpartof this corpus.A more common procedure for scoring an ML algorithm ona test set is to determine its performance for recall andprecision.The recall of a parsing algorithm A is the percentage oflabelled brackets of the test set that it correctly identifies.A’s precision is the percentage of the brackets that itreturns which correspond to those in the gold standard.A unified score for A, known as an F score, can becomputed as an average of its recall and its precision.
























Learning Biases and Priors

The choice of parameters and their possible values definesa bias for the language model by imposing prior constraintson the set of learnable hypotheses.All learning requires some sort of bias to restrict the set ofpossible hypotheses for the phenomenon to be learned.This bias can express strong assumptions about the natureof the domain of learning.Alternatively, it can define comparatively weakdomain-specific constraints, with learning driven primarilyby domain-general procedures and conditions.



















Prior Probability Distributions on a Hypothesis Space

One way of formalising this learning bias is a priorprobability distribution on the elements of the hypothesisspace that favours some hypotheses as more likely thanothers.The paradigm of Bayesian learning in cognitive scienceimplements this approach.The simplicity and compactness measure that Perfors et al.(2006) use is an example of a very general prior.














Learning Bias and the Poverty of Stimulus

The poverty of stimulus issue can be formulated as follows:

What are the minimal domain-specific linguistic biases thatmust be assumed for a reasonable learning algorithm tosupport language acquisition on the basis of the trainingset available to the child?

If a model with relatively weak language-specific biasescan sustain effective grammar induction, then this resultundermines poverty of stimulus arguments for a rich theoryof universal grammar.


















Supervised Learning

When the samples of the training set are annotated withthe classifications and structures that the learningalgorithm is intended to produce as output for the test set,then learning is supervised.Supervised grammar induction involves training an MLprocedure on a corpus annotated with the parse structuresof the gold standard.The learning algorithm infers a function for assigning inputsentences to appropriate parse output on the basis of atraining set of sentence argument-parse value pairs.




Supervised Learning





Supervised Learning





Unsupervised Learning

If the test set is not marked with the properties to bereturned as output for the test set, then learning isunsupervised.

Unsupervised learning involves using clustering patternsand distributional regularities in a training set to identifystructure in the data.




Unsupervised Learning

If the test set is not marked with the properties to bereturned as output for the test set, then learning isunsupervised.

Unsupervised learning involves using clustering patternsand distributional regularities in a training set to identifystructure in the data.




Supervised Learning and Language Acquisition

It could be argued that supervised grammar induction isnot directly relevant to poverty of stimulus arguments.It requires that target parse structures be represented inthe training set, while children have no access to suchrepresentations in the data they are exposed to.If negative evidence of the sort identified by Saxton (1997),and Chouinard and Clark (2003) is available and plays arole in grammar induction, then it is possible to model theacquisition process as a type of supervised learning.If, however, children achieve language solely on the basisof positive evidence, then it is necessary to treatacquisition as unsupervised learning.



















Probabilistic Context-Free Grammars

A Probabilistic Context-Free Grammar (PCFG) conditionsthe probability of a child nonterminal sequence on that ofthe parent nonterminal.It provides conditional probabilities of the formP(X1 · · ·Xn | N) for each nonterminal N and sequenceX1 · · ·Xn of items from the vocabulary of the grammar.It also specifies a probability distribution over the label ofthe root of the tree Ps(N).The conditional probabilities P(X1 · · ·Xn | N) correspond toprobabilistic parameters that govern the expansion of anode in a parse tree according to a context free ruleN → X1 · · ·Xn.




















The probabilistic parameter values of a PCFG can belearned from a parse annotated training corpus bycomputing the frequency of CFG rules in accordance witha Maximum Likelihood Expectation (MLE) condition.

c(A→β1...βk )c(A→γ)

Statistical models of this kind have achieved F-measures inthe low 70% range against the Penn Tree Bank.


















Lexicalized Probabilistic Context-Free Grammars

It is possible to significantly improve the performance of aPCFG by adding additional bias to the language model thatit defines.Collins (1999) constructs a Lexicalized ProbabilisticContext-Free Grammar (LPCFG) in which the probabilitiesof the CFG rules are conditioning on lexical heads of thephrases that nonterminal symbols represent.In Collins’ LPCFGs nonterminals are replaced bynonterminal/head pairs.















The probability distributions of the model are of the formPs(N/h) and P(X1/h1 · · ·H/h · · ·Xn/hn | N/h).Collins’ LPCFG achieves an F-measure performance ofapproximately 88%.Charniak and Johnson (2005) present a LPCFG with an Fscore of approximately 91%.














Bias in a LPCFG

Rather than encoding a particular categorical bias into hislanguage model by excluding certain context-free rules,Collins allows all such rules.He incorporates bias by adjusting the prior distribution ofprobabilities over the lexicalized CFG rules.The model imposes the requirements that

sentences have hierarchical constituent structure,constituents have heads that select for their siblings, andthis selection is determined by the head words of thesiblings.




Bias in a LPCFG






Bias in a LPCFG






Bias in a LPCFG






Bias in a LPCFG






LPCFG as a Weak Bias Model

The bias that Collins, and Charniak and Johnson specifyfor their respective LPCFGs do not express the complexsyntactic parameters that have been proposed aselements of a strong bias view of universal grammar.So, for example, these models do not contain ahead-complement directionality parameter.However, they still learn the correct generalizationsconcerning head-complement order.The bias of a statistical parsing model has implications forthe design of UG.



















Bayesian Learning

Let D be data, and H a hypothesis.Maximum likelihood chooses the H which makes the Dmost likely: argmaxH P(D|H)

Posterior probability is proportional to the prior probabilitytimes the likelihood.P(H|D) ∝ P(H)P(D|H)

The maximum a posteriori approach chooses the H whichmaximises the posterior probability: argmaxH P(H)P(D|H)

The bias is explicitly represented in the prior P(H).




Bayesian Learning








Bayesian Learning








Bayesian Learning








Bayesian Learning








Acquisition with a Weak BiasPerfors et al. (2006)

Perfors defines a very general prior that does not have abias towards constituent structure.It includes both grammars that impose hierarchicalconstituent structure and those that don’t.In general it favours smaller, simpler grammars, asexpressed in terms of the number of rules and symbols.Perfors et al. compute the posterior probability of threetypes of grammar for a subset of the CHILDES corpus.



















Acquisition without a Constituent Structure Bias

The three types of grammar that Perfors et al consider are:

a flat grammar that generates strings directly from S withoutintermediate non-terminal symbols,a probabilistic regular grammar (PRG), anda probabilistic context free grammar (PCFG).























The PCFG receives a higher posterior probability valueand covers significantly more sentence types in the corpusthan either the PRG or the flat grammar.The grammar with maximum a posteriori probability makesthe correct generalisation.This result suggests that it may be possible to decideamong radically distinct types of grammars on the basis ofa probabilistic model with relatively weak learning priors,for a realistic data set.












Grammar Induction Through Machine Learning Part 1 ...

Documents

cognitive

generates

complement

hierarchical

learning driven

penn tree

posteriori

lappin grammar