Statistical Learning and Language Acquisition

The potential contribution of statistical learning tosecond language acquisition

Luca Onnis

1. Introduction

Many fundamental aspects of human learning can be characterized as

problems of induction – finding patterns and generalizations in space and

time in conditions of uncertainty and from limited exposure. Some of

these problems include deriving abstract categories from experience (e.g.,

Tenenbaum & Gri‰ths, 2001); learning word meanings from their co-

occurrence with perceived events in the world (e.g., Frank, Goodman, &

Tenenbaum, 2009; Yu & Smith, 2007), learning the similarity and di¤er-

ence of meanings from their co-occurrence with other words (Landauer

& Dumais, 1997); and acquiring the di¤erent levels of linguistic structure

(e.g., Bod, 2009; Solan, Horn, Ruppin, & Edelman, 2005). Although the

areas of application and specific theoretical claims vary, all these forms

of inductive learning can be described under a common framework for

problems arising in developmental psychology (Gopnik & Tenenbaum,

2007), inductive reasoning (Chater & Oaksford, 2008), language acquisition

(Bannard, Lieven, & Tomasello, 2009; Solan, Horn, Ruppin & Edelman,

2005), computational linguistics (Bod, 2002; Chater & Manning, 2006;

Jurafsky, 2003), and machine learning (MacKay, 2003). This common

framework encompassing experimental and computational approaches

can be termed distributional or statistical learning, because the focus is on

how learners discover structure from probabilistic information in the

environment1.

Behavioral studies have indicated that infants, toddlers, and adults can

rapidly extract structural properties of stimuli from probabilistic infor-

mation inherent in the input they are exposed to. For example, before the

age of three children implicitly use frequency distributions to learn which

phonetic units distinguish words in their native language (Kuhl, 2002; 2004;

1. Here I mainly use the term statistical learning to comply with the general trendin the literature. Terms like distributional/probabilistic learning/approaches areequally viable and often used interchangeably in the literature.

Bereitgestellt von | provisional accountAngemeldet | 212.87.45.97

Heruntergeladen am | 04.03.13 15:54

Maye, Werker, & Gerken, 2002), use the transitional probabilities between

syllables to segment words (Sa¤ran, Aslin, & Newport, 1996), use word

distributions to discover syntactic-like relations among adjacent and non-

adjacent elements (see also Hay & Lany, this volume), learn to form

abstract categories (Gomez & Gerken, 2000), and rapidly establish form-

meaning mappings under conditions of uncertainty (Smith & Yu, 2008).

The behavioral findings on humans’ remarkable statistical learning

abilities have been enhanced and complemented by the rapid development

of robust and sophisticated computational methods that learn from corpora

of natural language (e.g. Bayesian, connectionist, dynamical systems; see

Chater & Manning, 2006; Gri‰th et al., 2010; McClelland et al., 2010).

Such methods make it possible to obtain detailed information on the

nature of the input to which learners are exposed (e.g., distributional prop-

erties of language), as well as the structured environment in which learners

interact (e.g. parent-child naturalistic dyadic interactions, Dale & Spivey,

2006; Goldstein, King, & West, 2003; Roy, 2003). Importantly, these

methods now allow researchers to simulate the putative mechanisms

responsible for the behavioral findings.

This chapter asks what specific role statistical learning might play in

understanding processes of second language (L2) learning after the acqui-

sition of a first language. My first goal is to propose that L2 learners may

(be made to) become attuned to useful distributional regularities (to be

discussed below). Such regularities correlate non-randomly with structural

properties of language, for instance, phonetic boundaries, word units in

connected speech, phrasal constituents, morphemic structure, and lexical

semantics, suggesting that at least part of the acquisition of language

may involve the acquisition of knowledge of distributional regularities. In

particular, I want to propose four general learning principles that can be

gleaned from the statistical learning literature and applied to L2 learning

scenarios. These principles are: (1) Integrate information sources; (2) Seek

invariant structure; (3) Reuse learning mechanisms for di¤erent tasks;

and (4) Learn to predict. These principles are exemplified in four studies

that highlight the benefits of statistical learning at the sublexical, lexical,

morpho-syntactic, phrasal and lexico-semantic levels. They all explore

how distributional information can be brought to bear on assisting second

language learning2.

2. Coverage here is selective and illustrative rather than comprehensive. Otherimportant related literature can be found in the accompanying chapters of thisbook (e.g., Ellis & O’Donnell; Johnson; Williams & Rebuschat, this volume).

204 Luca Onnis



My second goal is to elaborate on how these principles derived from

experimental studies in the laboratory can be put to use for specific prob-

lems arising in second language acquisition, and sketch out some practical

suggestions for bridging this laboratory-style research with L2 instruc-

tional practices. The upshot is that statistical learning can be used as a

diagnostic toolkit for identifying learner needs and pinpointing specific

areas for improvement in adult learners of language. In addition, statistical

learning principles can be e¤ectively implemented as supportive solutions

to enhancing instruction and curricula. By considering the implications

that statistical learning research may hold for practical aspects of learning,

I hope to indicate some tentative directions in the integration of basic and

applied research. Lastly, a third goal is to propose that computational

analyses of language corpora and behavioral experiments can be jointly

used in the service of the two goals above.

2. Origins and development of statistical learning

The origins of probabilistic approaches to language can be traced back to

structural linguistics, and the focus on finding regularities in languages. In

the 1950s Zelig Harris (1954) proposed a series of heuristics for discover-

ing phonemes and morphemes, based on the distributional properties of

these units in natural languages. For Harris and distributional linguists,

the process of discovering the structure of an unknown language (e.g., an

indigenous language of the Amazon basin) was akin to cracking a code

created with a secret language. This intuitive idea was explored mathemat-

ically by Claude Shannon (1948), who developed encryption/decryption

systems during World War II based on the statistical structure of sequences

of letters in an encrypted message. His concept of entropy in information

theory describes how much ‘uncertainty’ there is in a signal transmitted

over time. This uncertainty can be reduced if for example one knows that

the frequency for the character E is much more common in English than

the frequency of the character Z. Shannon (1951) contributed the first

rigorous statistical approach to language as a sequence of letters.

In the 1950s information-theoretic ideas inspired much work in the

nascent cognitive psychology, including the first experimental studies on the

learnability of formal linguistic systems. In a project named Grammarama,

George Miller (1956, 1967) asked adult participants to memorize strings

of letters such as XLLVXL that – unbeknownst to them – were either

random or followed a set of grammatical rules of sequencing (e.g.,

L must follow X ). The grammatical strings were generated by devices

Statistical learning and second language acquisition 205



called finite-state grammars, a class of possible formal grammars, and were

hence called artificial grammars. Miller found that participants memorized

grammatical strings much more quickly than random strings, suggesting

they had become sensitive to some of the rules generating such strings.

In the 1960s interest declined in both theoretical and behavioral ap-

proaches to distributional learning. The distributional methods of Harris

were seen as insu‰cient to capture the hierarchical linguistic relations

postulated by Chomsky (1957). In a similar way, Miller’s attempts to

explore the learnability of language-like systems were questioned because

of a perceived lack of common ground between artificial grammars and

natural languages to make plausible generalizations from one to the other.

For a couple of decades distributional approaches to language played a

minor role in language. Artificial grammars were seen as better suited to

the study of general processes of implicit learning, not necessarily related

to language acquisition (e.g., Reber, 1967).

It was not until the 1990s that artificial grammars were applied to

infants and toddlers (e.g., Mattys, Jusczyk, Luce & Morgan, 1999; Sa¤ran,

Aslin & Newport, 1996), documenting their remarkable abilities to use

a variety of probabilistic regularities in the speech signal. Adult learners

were also studied on the assumption that they usefully approximated

‘human simulations’ of infant learning (Gillette, Gleitman, Gleitman, &

Lederer, 1999; Redington & Chater, 1996). With respect to the implicit

learning studies of the earlier decades, researchers began to create miniature

languages that more closely mimicked distributional and structural aspects

of natural languages (see sections 3 to 5 below for practical examples). The

1990s also saw the development of sophisticated computational analyses of

language corpora. For example, Nick Chater and colleagues (Redington,

Chater, and Finch, 1998) provided large-scale computational analyses of

child-directed language transcriptions that distributional information may

actually be extremely useful to children in acquiring the abstract syntactic

category of words, such as nouns and verbs (see also Redington and

Chater, 1996, 1997).

Recently, an even more direct link between statistical learning and

natural language has been documented. Studies that compare directly statis-

tical learning and language processing (e.g., in within-subject designs where

the same participants are tested on statistical learning as well as natural

language tasks) are finding that similar cognitive and neural mechanisms

may be recruited for both syntactic processing of linguistic stimuli and sta-

tistical learning of structured sequence patterns more generally (Christiansen,

Conway, & Onnis, 2012; Misyak & Christiansen, 2010). Moreover, a break-

206 Luca Onnis



down in statistical learning abilities has been documented in children diag-

nosed with agrammatic aphasia (Christiansen, Kelly, Shillcock & Greenfield,

2010) and with specific language impairments (SLI; Evans, Sa¤ran, & Robe-

Torres, 2009).

Having provided a brief historical overview of statistical learning, in

the next sections I present research exemplifying the four learning principles

to be applied to second language learning.

3. Learning principle I: Integrate probabilistic sources of information

Words in natural languages exhibit a rich statistical sublexical structure.

Speech sounds (and written words), both within and between words, do

not occur with the same frequency and in any order, but display distribu-

tional regularities in the sequences they form. Are these regularities related

to the order of sounds (phonotactics) and letters (orthotactics) a mere epi-

phenomenon, or can they actually provide useful cues to learn a language?

Research to date already suggests that these properties of words are used

in the context of segmenting speech (Sa¤ran, Aslin, & Newport, 1996),

identifying phonetic contrasts (Thiessen, 2007), detecting orthographic

(Pacton, Perruchet, Fayol, & Cleeremans, 2001) and phonotactic restric-

tions (Chambers, Onishi, & Fisher, 2003), and constraining speech pro-

duction errors (Dell, Reed, Adams, & Meyer, 2000).

Furthermore, knowledge of phoneme distributions may aid in di¤erent

aspects of language learning simultaneously, such as speech segmentation

and identification of lexical categories (Christiansen, Onnis, & Hockema,

2009). In line with this work, a preliminary study in my laboratory is

investigating whether implicit knowledge of phonotactics (distribution of

sounds in speech) and orthotactics (distribution of letters in text) might

assist the learning of novel phonetic contrasts. As practical examples,

both English and Chinese native speakers are insensitive to the singleton/

geminate distinction in Italian, e.g., /pala/ (shovel) versus /palla/ (ball),

and Japanese learners of English need to learn a new phonemic distinction

between /l/ and /r/. Methods targeted at improving speech perception

contrasts have mainly focused on training learners with minimally dis-

similar word pairs, such as ‘right-light’ (e.g., Akahane-Yamada et al., 2004;

Lively, Pisoni, & Yamada, 1994). Although partially successful, this type

of training potentially loses structural cues like phonotactic probability,

since the training regime makes /r/ and /l/ equally likely in all contexts.

The computational analyses and experiment described next are an initial




demonstration that sublexical distributional information may facilitate the

identification of novel contrasts in adult second language learners.

3.1. Corpus analyses of probabilistic phonotactics

In order to assess the informativeness of contextual cues for predicting an

/l/ or /r/ segment and L and R letters in English words, statistical analyses

were carried out on the English Lexicon Project (ELP, Balota et al.,

2007)3 that includes a phonetic transcription. I calculated the probability

of phonetic and orthographic sequences that immediately precede and

follow the segments /l/ and /r/ and the letters L and R respectively. Imme-

diate context was operationalized as two elements (segment or letter) to

the left and to the right of a target element (letter L and R or segment /l/

or /r/). I refer to such immediate contexts as phonotactic and orthotactic

frames. For example, the word CURTAIN yields the frame U*T (if one

flanking letter is considered to each side of R) and the frame CU*TA (if

two flanking letters are considered to each side of R). The question was

whether frames can be used to reliably predict the target segment. The

degree of informativeness of a given frame can be estimated as the condi-

tional probability as follows:

P(frame | /l/) ¼ freq(frame occurring with /l / ) / (freq(frame occurring

with / l/)þ freq(frame occurring with /r/)).

For example:

P(CU_TA | L) ¼ freq(CURTA) / (freq(CULTA)þ freq(CURTA))

For each frame type found in the ELP corpus, the conditional probability

was estimated as above. Figure 1 provides a histogram showing the distri-

bution of phonotactic frames (left panel) and orthotactic frames (right

panel) as a function of how likely they are to flank an L, given the propor-

tion of occurrences in the English corpus for which a L and R were found.

The bar height indicates the number of frame types with a given probability

of having an L between them. There are 100 bins in the histogram, so each

bin accounts for a probability range of .01. The figure illustrates that the

3. This corpus is composed of more than 40,000 English word types accompaniedby their log-frequency of use. The corpus data reported here are part of amanuscript in preparation. The experimental data reported in Section 3.2 formpart of a thesis for the Advanced Graduate Certificate in Second LanguageStudies at the University of Hawaii (Uchida, 2010).

208 Luca Onnis



distribution of frames in English words is strongly bimodal. Most frames

are associated only with an L or with an R segment, but not both. Indeed,

the left- and right-most bins account for 60% of the frame types occurring

with L and R in speech. This analysis shows the distribution to be very

informative in terms of identifying two distinct categories, and is similar

in the case of letters and phonemes. In other words, a typical L1 or L2

speaker exposed to reasonable amounts of natural English input will expe-

rience L (but not R) mostly in the frame K_E and R (but not L) mostly

in the frame Z_A. Even if this fact is unbeknownst to speakers at the

conscious level, the information is important for SLA researchers and

language teachers when identifying learning di‰culties or when designing

materials that may support the learning of the L/R distinction. This is

because these consistent frames become predictive of when an L or an

Figure 1. Phonemes and letters immediately flanking L and R in English wordsare highly predictive of either L or R. y axis: the number of frame typesthat predict an L (as opposed to an R); x axis: the probability of pre-dicting an L versus an R. Of the 589 phonotactic frame types and 589orthotactic frame types found in English, most predict and L with a highdegree of certainty (rightmost column) or predict an R with a highdegree of certainty (leftmost column. A probability ¼ 0 for predicting Lequals probability ¼ 1 for predicting R). Analyses with frame tokensexhibit a similar bimodal distribution.




R is more likely. Thus, speakers may act upon this information in their

regular language use. As a corollary, if learners are helped to become

sensitive to this type of cue, their implicit mechanisms may be tuned to

perhaps learn to use it predictively as well.

3.2. An experiment with English pseudowords

The computational analyses carried out on natural language such as the

one above provide a way to estimate the potential informativeness of a cue

inherent in a language – here phonotactics and orthotactics of English. Do

people actually use such cues? This question was tested in a letter guessing

game similar to the classic ‘hangman’, in which English native speakers

and Japanese learners of English were presented with a list of ortho-

graphic pseudowords lacking one letter (e.g., SA*G ). The game consisted

of guessing which one of two letters is the most likely for a given pseudo-

word. Critical trials contained the R-L pair (‘‘Is L or R the missing letter?’’),

while filler trials contained other letter pairs (e.g., ‘‘Is M or N the missing

letter?’’). The critical trials were 90 frames from the corpus analyses above,

one third were sampled from the 30 most frequent frames in the left-most

bin of Figure 1, or those being in principle very informative in predicting

an L (L-informative). Another 30 critical trials were chosen among the 30

most frequent frames in the right-most bin of Figure 1 (R-informative).

As a control, another 30 critical trials were frames sampled among those

having closer to or equal to 0.5 probability, or those being the least infor-

mative (LR-ambiguous).

Results indicated that both native English speakers and Japanese learners

of English preferred L-responses most for trials containing frames predictive

of L and least for trials containing frames predictive of R. For frames that

were ambiguous between L and R, the di¤erence in preference for L or R

was not significant for both native and non-native groups. Thus, both

groups’ responses reflected the bimodal distribution of orthographic frames

in English. Participants did not just make a random guess about a single

letter in isolation. Rather, their linguistic choices under uncertainty were

guided intuitively by the integration of a larger context of information,

the sublexical distribution of letters in English. There were two further

interesting results pertaining to the Japanese participants. First their prefer-

ence for reading in English (measured on a self-assessment scale) correlated

with better predictions for the missing letter, suggesting that experience with

reading texts in a second language may naturally induce sensitivity to ortho-

210 Luca Onnis



tactics. Second, Japanese participants were poor on the classic perception

discrimination task with various spoken tokens of /l/ versus /r/, which tests

perception of sounds in isolation. The encouraging results on their ability

to use orthotactics leaves open the possibility that when tested on a per-

ception task that involves phonotactic frames these participants may

improve their perception judgments and better discriminate /l/ and /r/

sounds. Thus, the hard problem of perceiving novel speech contrasts may

be hardest when tested in isolation, and yet it may be attenuated in the

presence of other sources of information available in the signal (coarticu-

lation may be another cue not investigated here).

At present, these results suggest that pronunciation practices that situate

learning targets within highly informative phonotactic contexts may also

be advantageous in principle. Inviting L2 English learners to listen to

statements, then say whether they are ‘obviously true’, ‘strongly implied’,

or ‘clearly false’ may help surmount problems with training using minimal

pairs. Because these phrases exhibit a variety of /l/ and /r/ frames, some of

them strongly favoring one phoneme or the other (refer to the underlined

segments), they may be easier for learners who have already had some

exposure to the language to produce. Also, the pedagogical emphasis can

be on communication and intelligibility, rather than the far more di‰cult

ability to distinguish between sounds in contexts disguising such distribu-

tional tendencies (e.g., ‘light’ and ‘right’). In the next section, I illustrate a

case of useful distributional regularities above the lexical level, the discovery

of non-adjacent morphosyntactic relations in language.

4. Learning principle II: Seek invariance

Various aspects of inflectional morphology, such as gender and number

agreement on noun phrases and verb phrases, remain particularly di‰cult

to master even for second language learners at advanced levels of pro-

ficiency (e.g., Montrul, Foote, & Perpinan, 2008; Slabakova, 2008). The

phenomenon has generated a lively debate on the nature of such insensi-

tivity, with some accounts claiming lack of accessibility to L1-like linguistic

knowledge, and others placing the burden on online processing deficiencies

(for a review, see Clahsen & Felser, 2006). While much attention has been

devoted to the theoretical underpinnings of such insensitivity, and peda-

gogical research has addressed improving morphosyntax (see, e.g., Spada

& Tomita’s 2010 meta-analysis of the e¤ects of instruction on simple and




complex linguistic features), few studies have taken underlying statistical

learning ability into account, although studies have looked at the influence

of frequency of forms (e.g., Ellis & Schmidt, 1998) and the role of implicit

and explicit learning (e.g., Robinson, 2005). Are there ways to improve L2

processing abilities, for instance by making the target structures distribu-

tionally salient?

4.1. Artificial languages that mimic natural language

As noted in the introduction, artificial grammar learning tasks have been

used extensively to inquire into the nature of human implicit processes

(Cleeremans, Destrebecqz, & Boyer, 1998; Shanks, 2005) and their relation-

ship to language knowledge (Kaufman, DeYoung, Gray, Jimenez, Brown,

Mackintosh, 2010; Misyak & Christiansen, 2011). Tasks that tap into

language processes typically involve exposing participants to sentence-like

sequences of word-like stimuli (presented either visually or auditorily),

such as these: pel wadim jic, vot puser tood, dak wadim rud, vot loga tood.

While these pseudo-sentences appear random, they respect some under-

lying rule defined a priori by the experimenter, and learners exposed to

limited exemplars in relatively brief sessions end up becoming sensitive

to such rules, even though they cannot often explicitly verbalize what

the hidden rules were. For example, the pseudo-sentences above were

constructed by Gomez (2002; Figure 2) to simulate the learning of non-

adjacencies similar to morphosyntactic agreement and other non-local

structural regularities in natural languages: each specific first word predicts

a specific third word all the time. In the examples above, pel predicts jic,

vot predicts tood, and dak predicts rud consistently (e.g., probability

Pð jicjpel ¼ 1Þ, while the second middle word has no predictive value,

for example wadim precedes any third word with equal probability

ðPð jic jwadim ¼ 0:33Þ. It is possible to test learners’ implicit knowledge

after training, by presenting grammatical ( pel wadim jic), as well as un-

grammatical sentences (*pel wadim rud ), and even sentences that have

zero probability, for instance, the sentence pel hiftam jic is not encoun-

tered during the learning phase, but it crucially maintains the correct

structural non-adjacent relations ( pel __ jic).

In my brief review of the origins of statistical learning I noted that an

important development in the use of artificial grammars has been their

much closer contact with natural language phenomena. For example,

Gomez (2002) noted that sequences in natural languages typically involve

212 Luca Onnis



some items belonging to a relatively small set (functor words and mor-

phemes like am, the, -ing, -s, are) interspersed with items belonging to a

very large set (e.g. nouns, verbs, adjectives). Such asymmetry translates

into patterns of highly invariant nonadjacent items separated by highly

variable material (am cooking, am working, am going, etc.). How do learners

detect non-adjacent invariant structures? Gomez showed that the variability

of the material intervening between dependent elements (the first and third

word in her study) modulates the ability to detect non-local dependencies

in the grammar above. Learning improves consistently as the variability of

elements that occur between two dependent items increases. One explana-

tion for this pattern is that when the set of items that participate in the

dependency is small relative to the set of elements intervening, the non-

adjacent dependencies stand out as invariant structure against the changing

background of more varied material, as in pel wadim jic, pel puser jic, pel

coomo jic, pel loga jic, dak coomo rud, dak wadim rud, dak puser rud, etc.

(see Figure 2 and 3, columns 2–5. The di¤erent intervening words are

indicated as indexed Xs).

Figure 2. The underlying structure of the artificial grammars used by Gomez(2002; columns 2–5) and Onnis et al. (2003; 2004; columns 1–5).Sentences with three non-adjacent dependencies are constructed with anincreasing number of possible intervening X words. Gomez used 2, 6,12, and 24 intervening words. Onnis et al. added a new condition inwhich X ¼ 1.




This e¤ect also holds in the absence of variability of intervening words

shared by di¤erent nonadjacent items, as in pel wadim jic, vot wadim tood,

dak wadim rud (the word wadim is common to all sentences, see Figure 2

and 3, first column), as the intervening material becomes invariant with

respect to the variable dependencies (Onnis, Christiansen, Chater, &

Gomez, 2003). In natural languages, long-distance relationships such as

singular and plural agreement between noun and verb may in fact be

separated by the same material, for example the books on the shelf aredusty and the book[0] on the shelf is dusty.

Importantly, while artificial grammars appear at first limited in their

generativity, they can be used to test learners’ knowledge to generalize cor-

rectly to novel sentences never encountered in the training. For example,

the ability of adult learners to endorse pel hiftam jic while rejecting *pel

hiftam rud (where hiftam is a new word) is modulated by the same con-

ditions of zero or high variability (Onnis, Monaghan, Christiansen, &

Chater, 2004). The upshot of these studies on variability is that there is a

Figure 3. Data from Onnis et al. (2003) incorporating the original Gomez experi-ment. Learning of non-adjacent dependencies results in a U shape curveas a function of the variability of intervening X words, in five conditionsof increasing variability.

214 Luca Onnis



striking tendency to detect variant versus invariant structure that is in turn

adaptive to the informational demands of their input (for putative mecha-

nisms responsible for this e¤ect see Gomez & Maye 2005). And learning a

rule such a non-adjacent dependency is not an all or none phenomenon,

but is mediated by distributional properties in which such dependencies

happen to be experienced.

4.2. Invariance is at hand

A complementary line of studies using artificial grammars and corpora of

naturalistic child-directed speech has highlighted how invariant linguistic

structure may be at hand, namely available to learners in a short window

of time. Here I would like to show how invariant features of language can

be detected when two sentences are allowed to partially overlap in imme-

diate succession in a text or in the speech stream. In a study by Onnis,

Waterfall and Edelman (2008) adult learners were asked to find the novel

words of an alien language out of unparsed (unsegmented) whole sentences

such as kedmalburafuloropesai. In the absence of acoustic and prosodic cues

(the sentences were generated by a speech synthesis software), each sentence

could in principle be composed of a range of possible words, from a single

long word to as many words as there were identifiable syllabic clusters.

Again, as the words in the sentences were all novel, the task was di‰cult,

and it simulated some of the features and conditions involved in second

language learning. It was found that learners were significantly better at

the word segmentation task when a portion of sentences in the training

set were ordered so as to partially repeat themselves one after the other

(e.g., kedmalburafuloropesai, rafuloro), as opposed to a control learning

situation in which no sentences overlapped immediately (although the train-

ing set was composed of exactly the same sentences in both conditions).

Such immediate partial repetition across sentences facilitates comparison.

When aligned, the partial overlap of the two sentences suggests three

candidate units (kedmalbu, rafuloro, pesai) without the need for learners to

entertain all possible unit candidates over several sentences. Importantly,

the study also found evidence for a global learning e¤ect. That is, not only

did learners more reliably prefer word units heard in partially self-repeated

sentences during learning, but they also segmented units that never occurred

in such order more accurately (e.g. gianaber, kiciorudanamjeisulcaz). Similar

results were found in a second experiment in which the phrasal structure of

sentences was to be discovered, suggesting that the same mechanisms of

comparison of invariant structure can signal structure at di¤erent levels

of linguistic analysis.




How can these laboratory studies inform L2 instruction? First, it is

possible to construct L2 teaching materials that reflect the principles of

variability over invariant structure described above. Given the instructed

nature of L2 learning, the input to an L2 learner can be manipulated to

a large extent – and much more flexibly than the input to a child. For

instance, applying the concept of large variability to morphosyntactic rela-

tions in Spanish might involve a sequence of sentences like:

Tengo las botas para el matrimonio.

Tengo las pelıculas para el fin de semana.

Tengo las pelotas para el nino.

In the examples above the female gender and plural number agreements

(las __ -as) are repeated while the intervening lexical items are modified

(bot-, pelıcul-, pelot-). The prediction is that a large enough number of

intervening words should facilitate the extraction of the invariant non-

adjacent relations (las __ -as), either implicitly or by promoting the explicit

noticing of the target structure (Schmidt, 2001). It may also be useful to

keep constant the non-target elements of the sentence, so that the target

elements can vary. For instance, according to the zero variability condi-

tion described in Onnis et al. (2003; 2004) the following learning situation

could also have a facilitative e¤ect, where the same lexical item is shared

between the two gender-agreement constructions in Spanish:

Mabel es la amiga de Carlos

Carlos es el amigo de Maria.

I have described principles of distributional learning that exploit the

contrast between variable and constant materials. As mentioned earlier,

one of the goals of this chapter is to show how such principles may apply

to a range of learning situations, and thus be reusable in di¤erent tasks.

Theoretically, this raises the possibility to understand human learning in

terms of a relative small set of mechanisms. Practically, second language

learning problems that are treated as di¤erent or unrelated may be amena-

ble to similar distributional solutions.

5. Learning principle III: Reuse learning mechanisms

In the introduction section I discussed how statistical learning can be

viewed as a framework for identifying inductive learning mechanisms.

How many mechanisms can we identify? Are they restricted to specific

216 Luca Onnis



tasks and or modalities? While there are di¤erent types of sensitivities

(e.g., to conditional probabilities – see Section 3 – and invariant structure –

Section 4), an assumption of parsimony makes it reasonable to assume that

only a small number of such mechanisms exists, and these are recruited for

di¤erent learning tasks and linguistic levels of representation.

The aim of this section is thus to illustrate how the same learning mech-

anisms that apply to finding words in running speech and discovering

phrasal units (discussed in Section 4) may also apply when learning form-

meaning mappings. Expanding a paradigm used by Yu & Smith (2007),

Onnis, Edelman and Waterfall (2011) demonstrated that form-meaning

mappings in a word learning task were learned significantly better when

the input was structured as partial self-repetitions. During a learning

phase, adult participants saw multiple novel pictures simultaneously and

heard multiple novel words, creating ambiguity regarding correct word-

to-picture mappings for a given trial. For instance, when four words and

four pictures were presented simultaneously on each trial, there could be

4� 4 possible word-referent combinations (Figure 4).

The participants’ task was to infer the correct word-picture mappings

across these training trials. At test, they heard a single word and selected

the picture (among four) that they thought mapped onto that word.

Importantly, this task can only be solved if relations between words and

referents are tracked across multiple trials, hence the term cross-situational

learning (e.g., Yu & Smith, 2007). Onnis and colleagues were able to show

that a learning condition where a specific single word-referent pair repeated

successively across any two given trials (while all other pairs di¤ered) con-

tributed to the immediate disambiguation of the word-scene mappings.

Importantly, even pairs that had not appeared in such contiguous condi-

tions were shown to be learned, suggesting that principles of local alignment

and comparison did not only a¤ect the pairs involved locally, but had a

global benefit on learning the form-meaning pairs.

Cross-situational learning o¤ers a useful approach to modeling natural-

istic L2 learning, in addition to yielding results that can potentially inform

foreign language instruction. For example, Robinson and Ellis (2008),

following Slobin (1996), have referred to the adjustment required to use

conceptually novel form-meaning correspondences as rethinking for speak-

ing. In this respect, word-referent learning experiments may shed light on

the learning of lexical items and the structuring of conceptual domains in a

second language. In addition, L2 vocabulary acquisition takes place in a

rich extra-linguistic context. Quine (1960) illustrated this in his hypothe-

tical account of a field linguist attempting to discern the meaning of the




word gavagai (see also Ellis, 2005). In order to approximate the com-

plexity of cross-situational learning in naturalistic environments, labora-

tory studies may need to increase the number of referents in a given trial,

since visual scenes in the real world typically present much richer evidence

for learning (see Cenoz & Gorter, 2008, for a multimodal account of the

L2 linguistic landscape). Because the composition of actual visual scenes

may overwhelm learners’ computational abilities, prior knowledge of con-

ceptual and social domains, which is readily available to L2 learners,

could also be incorporated in further experiments to inform the design

of computer-based vocabulary tutorials and explore how learners might

solve Quine’s dilemma outside of virtual learning environments.

Figure 4. Two possible trials in the cross-referential learning paradigm used byOnnis, Edelman, & Waterfall (2011). In each single trial the simul-taneous presentation of 4 novel words and referents makes the form-meaning mapping task impossible. However, across trials learners wereable to reduce uncertainty, by comparing the elements that changedversus those that stayed constant. In the example here, one word-referent pair remains constant across the two trials, which one is it?

218 Luca Onnis



The training interventions briefly envisioned here await robust evidence

before they can be adapted to real-world L2 scenarios, but they open up

ways to connect basic research to instructional concerns. In the following

section, I conclude my overview of illustrative examples by looking at how

meaning can be inferred from distributions of words across texts, and how

knowledge of lexical distributions improves reading fluency.

6. Learning principle IV: Learn to predict

Corpus analyses suggest that many words entail probabilistic semantic

consequences that can be expressed as expectations for upcoming words.

For instance, in English, the verb provide typically precedes positive words

as in to provide assistance/benefits/relief, while the verb cause typically

precedes negative items, as in to cause death/damage/disruptions (Sinclair,

1996). Interestingly, while the denotational meaning of say cause involves

an agent and an e¤ect, there is little reason to assume a priori that in

actual use cause may be associated with negative words (Guo et al., in

press). Furthermore, while many speakers are fortunate enough to never

directly experience negative events such as bleeding, war, and death, they

learn that for instance wars are caused by famine rather than wars are pro-

vided by famine. Thus, although not the only way of discovering meaning,

the connotational meaning of certain words may emerge as being distributed

over the co-text and co-speech of their occurrences in natural language.

On these assumptions, connotational meaning naturally lends itself to

being modeled by distributional analyses of corpora.

One class of available computational models of semantic knowledge –

semantic space models – represents each word as a vector in a high-

dimensional state space (Rogers & McClelland, 2004; Vigliocco, Vinson,

Lewis, & Garrett, 2004). The meaning of a word is obtained from the

frequency distributions of the words that occur in the immediate context

of a target word, over a large corpus. This method captures empirically

the intuition that words that occur in the same sorts of contexts tend to

be similar in meaning. For example, road and street are similar because

they occur in similar co-texts (down the road/street, cross the road/street,

the road/street to the left) and are dissimilar from tea and co¤ee, which

co-occur with other words (co¤ee/tea and sugar, pour a cup of co¤ee/tea).

Using a vector space model, Onnis, Farmer, Baroni et al. (2009) were

able to derive the semantic orientation (valence tendency) of a number

of words such as cause, provide, encounter, markedly, largely, impressive,




purely on distributional grounds. This orientation was measured as a signed

value for each word. The authors further obtained independent human

values of semantic orientation in a sentence continuation task. Native

speakers of English were asked to provide a free completion for sentences

like ‘‘The mayor was surprised when he encountered. . .’’. When the portions

of sentence continuations were scored as positive or negative on a Likert

scale by a di¤erent group of participants, their values correlated signifi-

cantly with those assigned by the vector space model on a purely distribu-

tional way. This suggests that a) native speakers are sensitive to the general

semantic orientation of a word, and constrain their free productions to

accommodate it; and b) the semantic orientation of a word can be inferred

automatically by simple distributional properties of texts (the computer

model does not have any inbuilt notion of semantics). Computer models

like this one might approximate to a fair degree the cognitive mechanisms

available to human learners.

The presence of valence tendencies may facilitate language compre-

hension in real-time situations. If producing a given word in a sentence,

say the verb to encounter, prompted speakers to narrow down the set of

possible sentence continuations, then on the comprehender’s side sensi-

tivity to this semantic valence tendency may help anticipate the sentence

continuation, resulting in a measurable gain in comprehension fluency.

This idea was tested in a self-paced reading experiment in which words

in sentences were presented one by one incrementally, and participants

pressed a key on the keyboard to read the next word. This allows the mea-

surement of reading times for each given word in a sentence. It was found

that on-line reading was slowed down significantly in sentences that

contained an incongruent semantic orientation (e.g., the news on television

caused optimism in the audience), as opposed to when the sentences

contained a congruent semantic orientation (the news on television caused

pessimism in the audience). There is mounting evidence in the sentence

processing literature that humans use expectations as the sentence unfolds

in order to reduce the set of possible competitors to a word or sentence

continuation (e.g., Altmann and Kamide, 1999; Tanenhaus et al., 1995).

At each time step the linguistic processor uses the currently available input

and the lexical information associated with it to anticipate possible ways

in which the input might continue.

An important consideration is that distributional patterns of words

a¤ord speakers the necessary fluent generativity to understand and produce

not only crystallized collocations (e.g. to cause damage which has a high

co-occurrence and is probably learned by rote), but also novel ‘on-the-fly’

220 Luca Onnis



combinations of words that are nonetheless congruent with the general

valence tendency of a given word. Thus, learning about distributions of

words in the lexicon may support generative processes and is not limited

to rote memorization processes.

Explaining how learners acquire new vocabulary as well as how they

become fluent speakers figure prominently in second language research.

In this section I have o¤ered a glimpse of a distributional account of how

lexical semantics may be acquired and how it improves language fluency.

Researchers have long recognized the role of learning phraseology in

developing proficiency, for example collocations and other extended units

of meaning (e.g., Boers et al., 2006; Gries, 2008). The study reported here

further shows that having knowledge of language-specific selectional

restrictions and probabilistic tendencies is not a mere matter of sounding

‘native-like’ from a stylistic point of view. Rather, there is correlation

between knowledge of language-specific phraseology and language fluency

in native speakers (for studies of L2 see Howart, 1998; Onnis, 2001; Towell

et al., 1996). The study also o¤ers some methodological advances. Often

proposals of vocabulary learning have been described in qualitative mental-

istic terminology that may not entirely provide causal and mechanistic

explanations. Exactly how the denotational and connotational meanings

of words are learned? I have argued that at least some aspects of lexical

semantics can in principle be derived distributionally from a corpus using

simple computational procedures. While still underdeveloped for instruc-

tional purposes, this approach opens up ways to think about what types

of texts and word distributions within texts can optimize the salience of,

for example, semantic orientations. Thus, one promise of computational

modeling for second language learning is the possibility of making assump-

tions explicit and testable under specific conditions in computer simulations,

as well as in testable conditions with human learners.

7. Discussion: Distributional approaches to SLA

In this chapter I have proposed that by looking at language learning as

induction of patterns and generalizations over patterns, important insights

can be gained, not only for L1 but also in L2 research. I have further

suggested some ways in which L2 instruction inspired by principles as

well as methodologies o¤ered by statistical learning may help adult learners

capitalize on distributional information that correlates with di¤erent types

of linguistic structure at di¤erent levels of analysis – sublexical, morpho-




syntactic, lexical and phrasal, and lexico-semantic. My overarching goal

has been to make the case for a closer integration of the research para-

digm and methods of statistical learning and research on second language

acquisition. I also wanted to stress the role of miniature artificial languages

for unveiling principles of adult human learning. To date, most miniature

languages involving adults have been intended to simulate scenarios of child

language acquisition. Adult learners are thought of as useful ‘human simu-

lations’ (Gillette, Gleitman, Gleitman, & Lederer, 1999) that approximate

some learning behavior in infancy. However, these studies may also be

directly linked to adult second language acquisition, because adults already

possess knowledge of a linguistic system when they engage in learning a

novel miniature language. As such, artificial grammar experiments with

adults can be seen as useful human simulations of second language learn-

ing processes. Sections 3 to 6 reviewed relevant literature and proposed

four principles of learning.

Section 3 contributed the idea that learning di‰culties can be overcome

by integrating di¤erent probabilistic sources to the task at hand. A tradi-

tional view that sees language separated in modular representational levels

(e.g., phonetic – phonemic – sublexical) may underestimate the large redun-

dancy of probabilistic information available in the signal. Accordingly, the

perception of a foreign sound would be treated as a purely acoustic prob-

lem, and as such its solution sought at the acoustic level only. Instead,

phonotactic and orthotactic regularities (along with other information yet

to be assessed) may come in handy in recognizing the di‰cult sound.

Sections 4 and 5 discussed the principle that learners seek invariance in

the signal. Becoming sensitive to what changes versus what stays constant

in the linguistic environment can highlight structural relations in language

such as word boundaries, non-adjacent dependencies, syntactic phrases,

and form-meaning mappings. Importantly, the putative underlying mech-

anisms of alignment and comparison of candidate structures are simple

enough general learning mechanisms and can be ‘recycled’ at di¤erent

levels of linguistic representation, providing a general framework for learn-

ing structure (see further below).

In Section 6 I discussed how probabilistic lexico-semantic constraints

impose choices on sentence continuations in free productions. In addition,

knowledge of lexical semantics improves fluency in realistic conditions

such as when reading text. Finally, Section 3 and 6 together contributed

the idea of integrating computational analyses of language to make experi-

mental predictions about which statistical properties are useful for learning

and processing language. Computational analyses of corpora allow one to

222 Luca Onnis



assess the a priori usefulness of one or more probabilistic cues, which can

then be evaluated empirically with language learners. In sum, statistical

approaches to language contribute a diagnostic toolkit for testing what is

easy and di‰cult to learn in experimentally controlled settings, and may

further o¤er supportive solutions to instructional needs.

7.1. Implications for L2 instruction

While it is early to sketch a map of how statistical learning will inform

educational practices in meaningful ways, I speculate here on a few possi-

bilities. For instance, statistical learning can be seen as complementary

to existing techniques of input-based enhancement, which attempt to make

certain features of the language more salient (e.g., Sharwood Smith, 1991).

While textual enhancement can be achieved via manipulation of typo-

graphical cues such as bolding or italics, meta-analytic reviews of this

research domain show that learners exposed to enhanced texts barely out-

perform those exposed to unenhanced, flooded texts on targeted gram-

matical features (Lee & Huang, 2008). It may be possible, therefore, to

structure texts such that certain distributional properties enhance a partic-

ular target structure. In this respect, presenting a di‰cult structure in

variation sets might inherently bring it to the attention of the learner,

giving rise to the establishment of form-meaning connections. In addition,

attempts to direct attention to L2 mappings may result in even greater

performance gains when cues are made salient. That is, instructional inter-

ventions that orient learners to multiple distributional cues in ways that

take advantage of the contribution of each cue in the real-time compre-

hension or production of fully-formed sentences or utterances may further

reinforce learning.

Such proposals are consistent with an emerging consensus on the

part of researchers from both generative (Slabakova, 2008) and cognitive-

interactionist (Ortega, 2007) traditions who recommend practicing form

and function in meaningful contexts. Therefore, one major advantage of

applying statistical learning to second language teaching is its potential

applicability to actual learning scenarios. If certain distributional properties

of the input accelerate learning (as documented in several independent

experiments on adult artificial language learning in this volume), then

it is possible in principle to tailor the learner’s experience to reflect such

optimal conditions, providing conditions of ‘statistically structured input’,

in line with existing work (e.g., Lee & Van Patten, 2003).




Statistical learning research on L2 has also practical advantages that

work in L1 settings does not. The initial stages of the development of

language in infants and young children are mostly under parental control

and di‰cult to modify with explicit interventions. Conversely, modulating

the input an L2 learner receives can be practically achieved in various

flexible ways, either in the classroom, or via educational software, or via

the construction of materials that incorporate statistical learning principles.

7.2. The relationship with implicit learning

While artificial language studies have been used in SLA, most have focused

on the nature of implicit learning (see Dienes, this volume; Shanks, 2005)

and knowledge in L2 (see Hamrick and Rebuschat, this volume; Leung

and Williams, 2006; Schmidt, 1994), rather than on providing mechanisms

of statistical learning. In most cases these studies do not directly include

manipulations of distributional information in their designs, as opposed

to the studies presented here. In this respect, research on statistical learning

can be seen as complementary and orthogonal to the implicit/explicit dis-

tinction, the latter still being a useful framework for investigating processes

of human learning. Statistical learning may occur on a cline from com-

pletely implicit to explicit. For example, a textbook or a learning task

may present scenarios and sentences that implicitly form variation sets

(see Section 4). The outcome of learning may at this point be fully explicit

(a sort of ah-ah experience: ‘‘I recognize that what stays constant here may

be an L2 construction’’), or less so, with the construction standing out

without direct awareness on the part of the learner – who is perhaps

engaged in encoding or decoding the meaning or the pragmatic relevance

of the event.

Furthermore, it is possible to direct L2 learners to explicitly find patterns

of invariance in collections of texts, as indicated by pedagogical uses of

corpora (e.g., Aston, Bernardini, & Steward, 2004). The relation between

statistical regularities and implicit learning can be quite complex in second

language learning. While certain distributional properties of language,

especially low-level ones such as probabilistic phonotactics, are definitely

learnt implicitly in one’s first language and may appear di‰cult to teach

explicitly, there is also evidence to the contrary. Al-jasser (2008) reported

on a pre-post test intervention study investigating the e¤ect of teaching

English phonotactics to Arabic speakers with the purpose of improving

their lexical segmentation abilities. His post-test results showed significant

gains in the lexical segmentation of running speech in English. Therefore,

224 Luca Onnis



while it is quite reasonable to assume that statistical learning in infancy

and childhood is implicit, for second language learning this line of research

o¤ers non-intuitive insights beyond the classic implicit/explicit divide.

7.3. Defusing the internalist/externalist debate

Research into statistical learning, in addition to guiding the development

of novel instructional interventions, may also provide theoretical insight

into the mechanisms underlying existing forms of L2 instruction, the e¤ec-

tiveness of which has already been demonstrated. The trend in L2 research

toward meta-analytic reviews (Norris & Ortega, 2006, 2011) has generated

robust evidence for, among other areas, the role of interaction in learning

in another language (Keck, Iberri-Shea, Tracy-Ventura, Wa-Mbaleka, 2006;

Mackey & Goo, 2007).

Many researchers now see the divide between social and cognitive

dimensions of learning as hurtful to a better understanding of language

and communication, in both first and second language research. While in

this chapter I have focused on finding language-internal regularities in the

input, such regularities need not be e¤ective in isolation, because there

is already evidence that they do take e¤ect in social settings. Statistical

sensitivity develops both within the linguistic input learners are exposed

to, and across the linguistic and non-linguistic exchanges with their

interlocutors during social interaction. Thus, distributional information

inherent in the input along with social interaction can provide reliable

cues to discovering structural and abstract properties of language (for a

review, see Meltzo¤, Kuhl, Movellan, & Sejnowski, 2009).

In this respect, one general framework for statistical learning that

invokes cognitive principles directly relevant to interactionist approaches

has been put forth by Goldstein and colleagues (2010). This framework

uses the acronym ACCESS as a mnemonic for several key principles in

learning from distributional patterns (Align Candidates, Compare, Evaluate

Statistical/Social significance). Each of these components has a clear ana-

logue in interactionist SLA research. To begin, L2 interaction is funda-

mentally a matter of exposure to input through conversational discourse,

as illustrated by the following example, adapted from a classroom study

on learner interaction in computer-mediated communication. Here, Kin

and Gin are exchanging opinions in a communicative task:

(1) Kin: If you don’t have much money, you can’t go university.

Gin: but why do you go to the university?




Her interlocutor’s response o¤ers Kin an immediate opportunity to

align candidates. For example, she may pay attention (Schmidt, 2001) to

the partial reformulation of the verb phrase ‘go to the university’. Kin’s

ability to restructure her knowledge of the usage required here may rely

on cognitive comparison, during which learners’ output ‘‘must be com-

pared with the relevant data available from the contingent utterances of

their more competent interlocutors’’ (Doughty, 2001, p. 225). As hypothe-

sized by Laufer and Hulstijn (2001), task-induced involvement is what

drives L2 learning in this case, through need, search, and evaluation. The

involvement load hypothesis acknowledges that motivational as well as

cognitive components are involved in incidental second language learning

(see also Dornyei, 2009). The statistical significance of the information Kin

is presented with is registered according to mechanisms detailed throughout

this chapter (but see Ellis, 2006 on related factors that impede learning from

input). Finally, SLA theory o¤ers several theoretical perspectives emphasiz-

ing the sociocultural (Lantolf, 2000), sociocognitive (Atkinson, 2011), and

socially distributed (Markee & Seo, 2009) aspects of L2 interaction that

may help interpret the social significance of the linguistic choices in the

present dyadic exchange. In sum, an interactionist account of SLA that

incorporates principles of statistical learning is not merely possible; in

many respects it already exists. What remains to be done is to more ex-

plicitly articulate these connections in order to strengthen future empirical

work.

To conclude, I have argued that there is an important potential role for

statistical learning research in terms of direct links to practical aspects of

second language learning and instruction, namely diagnosing learner needs,

enhancing instruction and curricula, and defining principles to put into

practice in a variety of ways, as called for by the specific details of the learn-

ing context.

Acknowledgements

I would like to thank Shimon Edelman, Kevin Gregg, Daniel Jackson,

Hannah Jones, Elizabeth Kissling, Phillip Hamrick, Julie Lake, William

O’Grady, Lourdes Ortega, Patrick Rebuschat, Dick Schmidt, and two

anonymous reviewers for their comments on earlier versions of this chapter.

The manuscript also benefited from useful discussions with several graduate

students in the SLS program at the University of Hawaii. The author was

partially supported by a Language Learning Research Grant.

226 Luca Onnis



References

Akahane-Yamada, R., Kato, H., Adachi, T., Watanabe, H., Komaki, R., Kubo,R., Takada, T, and Ikuma, Y.

2004 ATR CALL: A speech perception/production training systemutilizing speech technology, The 18th International Congress onAcoustics, III, 2319–2320.

Al-jasser, F.2008 The e¤ect of teaching English phonotactics on the lexical segmen-

tation of English as a foreign language. System, 36, 1, 94–106.Altmann, G.T.M., & Kamide, Y.1999 Incremental interpretation at verbs: Restricting the domain of

subsequent reference. Cognition, 73, 247–264.Atkinson, D.2010 Extended, embodied cognition and second language acquisition.

Applied Linguistics, 31, 599–622.Aston, G., Bernardini, S., & Stewart D. (Eds.)2004 Corpora and language learners. Amsterdam: Benjamin.

Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B.,Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R.

2007 The English Lexicon Project, Behavior Research Methods, 39,445–459.

Bannard, C., Lieven, E. & Tomasello, M.2009 Modeling children’s early grammatical knowledge, PNAS, 106,

41, 17284–17289.Bod, R.2009 From exemplar to grammar: A probabilistic analogy-based model

of language learning, Cognitive Science, 33, 752–793.Boers, F., J. Eyckmans, J. Kappel, H. Stengers & M. Demecheleer2006 Formulaic sequences and perceived oral proficiency: Putting a

lexical approach to the test. Language Teaching Research, 10,245–261.

Cenoz, J., & Gorter, D.2008 The linguistic landscape as an additional source of input in

second language acquisition. IRAL, 46, 267–287.Chambers, K.E., Onishi, K.H., & Fisher, C.2003 Infants learn phonotactic regularities from brief auditory experi-

ence. Cognition, 87, B69–B77.Chater, N., & Manning, C.D.2006 Probabilistic models of language processing and acquisition.

Trends in Cognitive Sciences, 10, 335–344.Chater, N., & Oaksford, M. (Eds.)2008 The probabilistic mind: Prospects for Bayesian cognitive science.

Oxford: Oxford University Press.




Chomsky, N.1957 Syntactic Structures. Mouton.

Christiansen, M.H., Kelly, M.L., Shillcock, R.C. & Greenfield, K.2010 Impaired artificial grammar learning in agrammatism. Cognition,

116, 382–393.Christiansen, M., Onnis, L., & Hockema, S.2009 The secret is in the sound: From unsegmented speech to lexical

categories. Developmental Science, 12(3), 388–395.Christiansen, M.H., Conway, C., & Onnis, L.2007 Neural responses to structural incongruencies in language and

statistical learning point to similar underlying mechanisms. InProceedings of the 29th Annual Meeting of the Cognitive ScienceSociety.

Clahsen, H. & C. Felser2006 How native-like is non-native language processing? Trends in

Cognitive Sciences, 10, 564–570.Cleeremans, A., Destrebecqz, A., & Boyer, M.1998 Implicit learning: News from the front, Trends in Cognitive

Sciences, 2, 406–416.Dale, R., & Spivey, M.J.2006 Unraveling the dyad: Using recurrence analysis to explore patterns

of syntactic coordination between children and caregivers in con-versation. Language Learning, 56, 3, 391–430.

Dell, G.S., Reed, K.D., Adams, D.R., & Meyer, A.S.2000 Speech errors, phonotactic constraints, and implicit learning: A

study of the role of experience in language production. Journalof Experimental Psychology: Learning, Memory, & Cognition,26, 1355–1367.

Dienes, Z.this volume Conscious versus unconscious learning of structure.

Dornyei, Z.2009 Individual di¤erences: Interplay of learner characteristics and

learning environment. Language Learning, 59, 230–248.Doughty, C.2001 Cognitive underpinnings of focus on form. In P. Robinson (Ed.),

Cognition and second language instruction (pp. 206–257). Cam-bridge: Cambridge University Press.

Ellis, N.C.2005 At the interface: Dynamic interactions of explicit and implicit

language knowledge. Studies in Second Language Acquisition,27, 305–352.

Ellis, N.C.2006 Selective attention and transfer phenomena in L2 acquisition:

Contingency, cue competition, salience, interference, overshadow-ing, blocking, and perceptual learning. Applied Linguistics, 27(2),164–194.

228 Luca Onnis



Ellis, N.C., & Schmidt, R.1998 Rules or associations in the acquisition of morphology? The fre-

quency by regularity interaction in human and PDP learning ofmorphosyntax. Language and Cognitive Processes, 13, 307–336.

Ellis, N. & O’Donnell, M.this volume Statistical construction learning: Does a Zipfian problem space

ensure robust language learning?Evans, J.L., Sa¤ran, J.R., and Robe-Torres, K.2009 Statistical learning in children with Specific Language Impair-

ment. Journal of Speech, Language and Hearing Research, 52,321–335.

Frank, M.C., Goodman, N.D., and Tenenbaum, J.B.2009 Using speakers’ referential intentions to model early cross-

situational word learning, Psychological Science, 20, 578–585.Gillette, J., Gleitman, H., Gleitman, L., & Lederer, A.1999 Human simulations of vocabulary learning. Cognition, 73, 35–

176.Goldstein, M.H., King, A.P., & West, M.J.2003 Social interaction shapes babbling: Testing parallels between

birdsong and speech. Proceedings of the National Academy ofSciences, 100, 13, 8030–8035.

Goldstein, M.H., Waterfall, H.R., Lotem, A., Halpern, J.Y., Schwade, J.A.,Onnis, L., et al.

2010 General cognitive principles for learning structure in time andspace. Trends in Cognitive Sciences, 14, 6, 249–258.

Gomez, R.L.2002 Variability and Detection of Invariant Structure. Psychological

Science, 13, 5, 431–436.Gomez, R L., & Gerken, L.2000 Infant artificial language learning and language acquisition.

Trends in Cognitive Sciences, 4, 178–187.Gomez, R.L., & Maye, J.2005 The Developmental Trajectory of Nonadjacent Dependency

Learning. Infancy, 7, 2, 183–206.Gopnik, A., & Tenenbaum, J.2007 Bayesian networks, Bayesian learning and cognitive development.

Developmental Science (special section on Bayesian and Bayes-Net approaches to development), 10, 3, 281–287.

Gries, S.2008 Corpus-based methods in analyses of SLA data. In Peter Robinson

& Nick C. Ellis (eds.), Handbook of cognitive linguistics and secondlanguage acquisition, 406–431. New York: Routledge, Taylor &Francis Group.

Gri‰ths, T.L., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, J.B.2010 Probabilistic models of cognition: Exploring representations and

inductive biases. Trends in Cognitive Sciences, 14, 357–364.




Guo, X., Zheng, L., Zhu, L., Yang, Z., Chen, C., Zhang, L., Ma, W., & Dienes, Z.in press Acquisition of conscious and unconscious knowledge of semantic

prosody. Consciousness & Cognition.Hay, J. & Lany, J.this volume Sensitivity to Statistical Information Begets Learning in Early

Language Development.Hamrick, P. & Rebuschat, P.this volume How implicit is statistical learning?

Harris, Z.S.1954 Distributional structure. Word, 10, 146–162.

Howart, P.1998 Phraseology and Second Language Proficiency. Applied Linguis-

tics, 19, 1, 24–44.Jurafsky, D.2003 Probabilistic modeling in psycholinguistics: Linguistic comprehen-

sion and production. In R. Bod, J. Hay, and S. Jannedy (Eds.),Probabilistic Linguistics, MIT Press.

Kaufman, S.B., DeYoung, C.G., Gray, J.R., Jimenez, L., Brown, J., & Mac-kintosh, N.

2010 Implicit learning as an ability. Cognition, 116, 321–340.Keck, C.M., Iberri-Shea, G., Tracy-Ventura, N., & Wa-Mbaleka, S.2006 Investigating the empirical link between task-based interaction

and acquisition: A meta-analysis. In J. M. Norris & L. Ortega(Eds.), Synthesizing research on language learning and teaching(pp. 91–131). Amsterdam: John Benjamins.

Kuhl, P.K.2004 Early language acquisition: Cracking the speech code. Nature

Reviews Neuroscience, 5, 831–843.Kuhl, P.K.2000 A new view of language acquisition. Proceedings of the National

Academy of Science, 97, 11850–11857.Landauer, T.K., & Dumais, S.T.1997 A solution to Plato’s problem: The latent semantic analysis

theory of acquisition, induction, and representation of knowledge,Psychological Review, 1, 2, 211–240.

Lantolf, J. (Ed.).2000 Sociocultural theory and second language learning. Oxford: Oxford

University Press.Laufer, B., & Hulstijn, J.2001 Incidental vocabulary acquisition in a second language: The con-

struct of task-induced involvement. Applied Linguistics, 22, 1–26.Lee, J., & Van Patten, B.2003 Making Communicative Language Happen. New York: McGraw

Hill.

230 Luca Onnis



Lee, S., & Huang, H.2008 Visual input enhancement and grammar learning: A meta-

analytic review. Studies in Second Language Acquisition, 30, 307–331.

Leung, J. & Williams, J.N.2006 Implicit learning of form-meaning connections. In Sun, R. &

Miyake, N. (Eds) Proceedings of the Annual Meeting of the Cog-nitive Science Society, pp. 465–470. Mahwah, N.J.: LawrenceErlbaum.

MacKay, D.J.C.2003 Information Theory, Inference, and Learning Algorithms, Cam-

bridge University Press.Mackey, A., & Goo, J.2007 Interaction research in SLA: A meta-analysis and research syn-

thesis. In A. Mackey (Ed.), Conversational interaction in secondlanguage acquisition (pp. 407–452). Oxford: Oxford UniversityPress.

Markee, N., & Seo, M.2009 Learning talk analysis. IRAL, 47, 37–63.

Miller, G.1967 The psychology of communication. New York: Basic Books.

Misyak, J.B., & Christiansen, M.H.in press Statistical learning and language: An individual di¤erences study.

Language Learning.Lively, S.E., Pisoni, D.B. & Yamada, R.A.1994 Training Japanese listeners to identify English /r/ and /l/: III.

Long-term retention of new phonetic categories, Journal of theAcoustical Society of America, 96, 4, 2076–2087.

Maye, J., Werker, J.F. & Gerken, L.2002 Infant sensitivity to distributional information can a¤ect phonetic

discrimination. Cognition, 82, 3, B101–B111.McClelland, J.L., Botvinick, M.M., Noelle, D.C., Plaut, D.C., Rogers, T.T.,

Seidenberg, M.S., and Smith, L.B.2010 Letting Structure Emerge: Connectionist and Dynamical Systems

Approaches to Understanding Cognition. Trends in CognitiveSciences, 14, 348–356.

McClelland, J.L.1998 Connectionist models and Bayesian inference. In Rational models

of cognition, ed. by Mike Oaksford and Nick Chater, 21–53.Oxford: Oxford University Press.

Meltzo¤, A.N., Kuhl, P.K., Movellan, J., & Sejnowski, T.J.2009 Foundations for a new science of learning. Science, 325, 284–

288.Miller, G.A.1956 Information and memory, Scientific American, 1956, 195 (2), 42–

47.




Miller, G.A.1958 Free recall of redundant strings of letters. Journal of Experimen-

tal Psychology, 56, 485–491.Misyak, J.B., Christiansen, M.H. & Tomblin, J.B.2010 Sequential expectations: The role of prediction-based learning in

language. Topics in Cognitive Science, 2, 138–153.Montrul, S., Foote, R., & Perpinan, S.2008 Gender agreement in adult second language Learners and Spanish

heritage speakers: The e¤ects of age and context of acquisition.Language Learning, 58, 3, 503–553.

Norris, J., & Ortega, L.2010 Research synthesis. Language Teaching, 43, 461–479.

Norris, J., & Ortega, L. (Eds.).2006 Synthesizing research on language learning and teaching. Amster-

dam: John Benjamins.Onnis, L.2001 Fluency in native and non-native speakers. In Carli A. (Ed.)

Aspetti linguistici e interculturali del bilinguismo. (pp. 20–139)Milano: Franco Angeli.

Onnis, L., Farmer, T., Baroni, M., Christiansen, M.H., and Spivey, M.J.2009 Generalizable distributional regularities aid fluent language proc-

essing: The case of semantic valence tendencies. Special issue ofthe Italian Journal of Linguistics, 20(1), 129–156.

Onnis, L., Christiansen, M.H., Chater, N., and Gomez, R.2003 Reduction of uncertainty in human sequential learning: Evidence

from artificial language learning. Proceedings of The 25th AnnualConference of the Cognitive Science Society. (pp. 886–891).Mahwah, NJ: Lawrence Erlbaum.

Onnis, L., Monaghan, P., Christiansen, M.H., & Chater, N.2004 Variability is the spice of learning, and a crucial ingredient

for detecting and generalizing nonadjacent dependencies. In Pro-ceedings of the 26th Annual Conference of the Cognitive ScienceSociety.

Onnis, L., Waterfall, H. & Edelman, S.2008 Learn locally, act globally: Learning language from variation set

cues. Cognition, 109, 423–430.Onnis, L., Edelman, S., & Waterfall, H.2011 Local statistical learning under cross-situational uncertainty. In

L. Carlson, C. Holscher and T. Shipley (Eds.). Proceedings ofthe 33rd Annual Conference of the Cognitive Science Society.

Onnis, L. Uchida, Y. & Magnuson, J.in preparation Distributional phonotactic cues assist the perception of speech

contrasts.Ortega, L.2007 Meaningful L2 practice in foreign language classrooms: A

cognitive-interactionist SLA perspective. In R.M. Dekeyser (Ed.),

232 Luca Onnis



Practice in a second language: Perspectives from applied linguisticsand cognitive psychology (pp. 180–207). Cambridge: CambridgeUniversity Press.

Pacton, S., Perruchet, P., Fayol, M., & Cleeremans, A.2001 Implicit learning out of the lab: The case of orthographic regu-

larities. Journal of Experimental Psychology: General, 130, 401–426.

Quine, W.V.O.1960 Word and object. Cambridge, MA: MIT Press.

Reber, A.S.1967 Implicit Learning of Artificial Grammars. Journal of Verbal

Learning and Verbal Behavior, 6, 855–863.Redington, M. & Chater, N.1998 Connectionist and statistical approaches to language acquisition:

A distributional perspective. Language and Cognitive Processes,13, 129–191.

Redington, M., Chater, N., & Finch, S.1998 Distributional information: A powerful cue for acquiring syntac-

tic categories. Cognitive Science, 22, 425–469.Redington, M. & Chater, N.1997 Probabilistic and distributional approaches to language acquisi-

tion. Trends in Cognitive Sciences, 1, 273–281.Redington, M. & Chater, N.1996 Transfer in artificial grammar learning: A reevaluation. Journal

of Experimental Psychology: General, 125, 123–138.Robinson, P.2005 Cognitive abilities, chunk-strength, and frequency e¤ects in im-

plicit artificial grammar and incidental L2 learning: Replicationsof Reber, Walkenfeld, and Hernstadt (1991) and Knowlton andSquire (1996) and their relevance for SLA, Studies in SecondLanguage Acquisition, 27, 2, 235–268.

Robinson, P., & Ellis, N.C.2008 Conclusion: Cognitive linguistics, second language acquisition,

and L2 instruction – issues for research. In P. Robinson & N.C.Ellis (Eds.), Handbook of cognitive linguistics and second languageacquisition (pp. 489–545). New York: Routledge.

Rogers, T.T. & McClelland, J.L.2004 Semantic Cognition: A Parallel Distributed Processing Approach.

Cambridge, MA: MIT Press.Roy, D.2009 New Horizons in the Study of Child Language Acquisition. Pro-

ceedings of Interspeech 2009. Brighton, England.Sa¤ran, Aslin, & Newport1996 Statistical Learning by 8-Month-Old Infants. Science, 274 (5294).

1926–1928.




Shanks, D.R.2005 Implicit learning. In K. Lamberts and R. Goldstone (Eds.), Hand-

book of Cognition (pp. 202–220). London: Sage.Shannon, C.1951 Prediction and Entropy of Printed English, Bell System Technical

Journal, 30, pp. 50–64. Reprinted in D. Slepian, (Editor) (1974).Key Papers in the Development of Information Theory, New York:IEEE Press.

Shannon, C.1948 A Mathematical Theory of Communication, Bell System Techni-

cal Journal, 27, 379–423 and 623–656. Reprinted in D. Slepian,(Editor) (1974). Key Papers in the Development of InformationTheory, New York: IEEE Press.

Sharwood Smith, M.1991 Speaking to many minds: On the relevance of di¤erent types

of language information for the L2 learner. Second LanguageResearch, 7, 2, 118–132.

Schmidt, R.2001 Attention. In P. Robinson (Ed.), Cognition and second language

instruction (pp. 3–32). Cambridge: Cambridge University Press.Schmidt, R.1994 Implicit learning and the cognitive unconscious: Of artificial

grammars and SLA. In N. C. Ellis (Ed.), Implicit and ExplicitLearning of Languages (pp. 165–209). London: Academic Press.

Sinclair, J.1996 The search for units of meaning, Textus, IX, 75–106.

Slabakova, R.2008 Meaning in the second language. Berlin: Mouton de Gruyter.

Slobin, D.I.1996 From ‘‘thought and language’’ to ‘‘thinking for speaking’’. In J.J.

Gumperz & S.C. Levinson (Eds.), Rethinking linguistic relativity(pp. 70–96). Cambridge: Cambridge University Press.

Smith, L.B., & Yu, C.2008 Infants rapidly learn word-referent mappings via cross-situational

statistics. Cognition, 106, 1558–1568.Solan, Z., Horn, D., Ruppin, E., and Edelman, S.2005 Unsupervised learning of natural languages. Proceedings of the

National Academy of Science, 102, 11629–11634.Spada, N., & Tomita, Y.2010 Interactions between type of instruction and type of language

feature: A meta-analysis. Language Learning, 60(2), 263–308.Tanenhaus, M., Spivey-Knowlton, M., Eberhard, K., & Sedivy, J.1995 Integration of visual and linguistic information in spoken language

comprehension. Science, 268, 1632–1634.

234 Luca Onnis



Tenenbaum, J.B., and Gri‰ths, T.L.2001 Generalization, similarity, and Bayesian inference, Behavioral and

Brain Sciences, 24, 629–641.Thiessen, E.D.2007 The e¤ect of distributional information on children’s use of pho-

nemic contrasts. Journal of Memory and Language, 56, 16–34.Tokowicz, N., & Warren, T.in press Beginning adult L2 learners’ sensitivity to morphosyntactic vio-

lations: A self-paced reading study.Towell, R., Hawkins, R., & Bazergui, N.1996 The development of fluency in advanced learners of French.

Applied Linguistics, 17, 84–119.Uchida, Y.2010 Measuring knowledge of English Orthotactics in Japanese learners

of English: Towards the establishment of a training scheme for/l/-/r/ Perception. Unpublished thesis for the Advanced GraduateCertificate, Department of Second Language Studies, Universityof Hawai‘i at Manoa.

Vigliocco, G., Vinson, D.P, Lewis, W. & Garrett, M.F.2004 Representing the meanings of object and action words: The

featural and unitary semantic space hypothesis. Cognitive Psy-chology, 48, 422–488.

Williams, J.N.2004 Implicit learning of form-meaning connections. In J. Williams,

B. VanPatten, S. Rott, and M. Overstreet (Eds.), Form MeaningConnections in Second Language Acquisition. Mahwah, NJ:Lawrence Erlbaum Associates. 2004, pp. 203–218.

Yu, C., and Smith, L.B.2007 Rapid word learning under uncertainty via cross-situational sta-

tistics. Psychological Science, 18 (5), 414–420.






Statistical Learning and Language Acquisition

Documents