-
Determinantal Point Processes for Memory and Structured
InferenceSteven M. Frankland
[email protected] Neuroscience
Institute
Princeton, NJ
Jonathan D. CohenPrinceton Neuroscience Institute
Princeton, NJ
AbstractDeterminantal Point Processes (DPPs) are
probabilisticmodels of repulsion, capturing negative
dependenciesbetween states. Here, we show that a DPP
inrepresentation-space predicts inferential biases towardmutual
exclusivity commonly observed in word learning(mutual exclusivity
bias) and reasoning (disjunctivesyllogism) tasks. It does so
without requiring explicitrule representations, without
supervision, and withoutexplicit knowledge transfer. The DPP
attempts tomaximize the total ”volume” spanned by the set
ofinferred code-vectors. In a representational system inwhich
combinatorial codes are constructed by re-usingcomponents, a DPP
will naturally favor the combinationof previously un-used
components. We suggest thatthis bias toward the selection of
volume-maximizingcombinations may exist to promote the efficient
retrievalof individuals from memory. In support of this, we showthe
same algorithm implements efficient ”hashing”,minimizing collisions
between key/value pairs withoutexpanding the required storage
space. We suggestthat the mechanisms that promote efficient
memorysearch may also underlie cognitive biases in
structuredinference.Keywords: mutual exclusivity; determinantal
point process;memory; binding; compositionality; probabilistic
models
Background and MotivationImagine that Tim and Tom are playing an
Argentinian boardgame called El Estanciero. Tim won. What happened
toTom? Although you may not be certain, you don’t needany more
information to infer that it’s likely that Tom lost.This is true
despite the fact that you have no familiarity withthe particulars
of the game or the people involved. Thisreflects an inferential
bias toward mutual exclusivity (ME):the tendency to map individuals
to 1/n possible relationalpositions, and not to multiple positions
in the same instance.Although, in the El Estanciero example, you
came equippedwith rich knowledge about the relational structure of
gamesthat you could import, ME inferences have been
observedsurprisingly early in human development (Halberda
2003,Cesana-Arlotti et al. 2018), and in non-human
species(Pepperberg et al. 2019), and aren’t constrained to
binaryrelations.
For example, a classic finding in developmentalpsychology is
that, all else being equal, young childrenprefer to map a novel
word (”zurp”) to a novel referent(cathode-ray tube), rather than
mapping many words to thesame object, or many objects to the same
word (Markman& Wachtel, 1988; Halberda, 2003), a phenomenon
knownas the mutual exclusivity bias (ME) in word-learning. Atthe
same time, work in a different cognitive domain hasfound that
pre-verbal infants assume that an individual objectcan not be in
two places at the same time, and thus makeinferences that resemble
a formal disjunctive syllogism (A orB (but not both). Not A.
Therefore, B) (Cesana-Arlotti et al.2018; See also Mody &
Carey, 2016)).
Here, we argue that the class of ME inferences canbe fruitfully
modeled using a Determinantal Point Process(DPP) operating over a
representational space. DPPs areprobabilistic models of repulsion
between states: the moresimilar two states are, the less likely
they are to co-occur (SeeFigure 1). DPPs originated in statistical
physics to modelthe location of fermions at thermal equilibrium
(Macchi,1975), but have since been extended to other branches
ofmathematics and machine learning (Kulesza & Taskar, 2012).In
machine learning, they have recently gained traction inthe
generation of sets of samples when sample diversity isdesirable,
such as recommender systems looking to presenta broad sample of
item-types to users ((Kulesza & Taskar,2012, Gillenwater et al.
2012).
Here, we consider the inferential biases afforded byDPPs over a
representational space. Specifically, we show(1) that when
representations of possible combinations(e.g., word/object or
location/object combinations) arecombinations of re-usable codes,
DPPs naturally predictthe mutual exclusivity bias in word learning
and reasoningby disjunctive syllogism. We then (2) suggest that
theseinferential biases may owe to basic desiderata placed on
thedata structures employed by an efficient memory system,
andprovide evidence that DPPs effectively navigate a
space/timetradeoff regarding storage space and access time in
encodingand retrieval.
General MethodsWe focus on two well-studied cases of mutual
exclusivityin cognitive science: The ”mutual exclusivity bias” in
wordlearning (Markman Wachtel, 1988; Merriman Bowman,
3302©2020 The Author(s). This work is licensed under a
CreativeCommons Attribution 4.0 International License (CC BY).
-
Figure 1: Sampling in a plane for three types of pointprocesses.
PPP’s contain no relational information, andtherefore, although
likely to be spread across the space, canbe susceptible to
”clumping”. MPPs, by contrast, carryrelational information about
the similarity between states,directly promoting such clumping.
DPPs use the samesimilarity representation as MPPs, but,
critically, repel similarstates, ensuring that samples are well
distributed across thespace.
1991; Halberda, 2003) and the ability to complete
disjunctivesyllogisms (Mody & Carey, 2016; Cesana-Arlotti et
al.2018). We first describe the modeling approach in abstractterms,
explaining the features that are common to bothuse-cases.
DPP framework. Our model assumes a point process Poperating over
a finite ground set Y of discrete items. Here,the items are
possible representations (i.e., encodings) andmore specifically,
representations of particular combinations(e.g., word/object or
location/object combinations). We willoften use the generic terms
”keys” and ”values” when talkingabout the representational
components, motivated largely bythe connections we wish to make to
memory.
The process defines a probability of selecting particularsubsets
(S) of these combinations drawn from the ground set.
P(S⊆ Y ) = det(KS)
where K is an NxN positive semi-definite kernel matrixwhose
entries encode pairwise similarities between Npossible discrete
states, where N = m keys X n values. Forall analyses reported here,
we use a linear kernel, computedas a normalized inner product of
each combination of themXn key/value vectors under consideration
(See Figure 2 forillustration).
DPPs select a particular configuration of items so asto maximize
the determinant (det) of the correspondingsub-matrix of K, indexed
by S. (Macchi, 1975; KuleszaTaskar, 2012). Thus, the central
computation in the currentcase is arg maxS det(KS).
Geometrically, we can think of this determinant as thevolume
spanned by the parallelpiped of the code vectors. Themore similar a
set of vectors (small determinant), the lesslikely they are to
co-occur in a set. The less similar the vectors(larger
determinant), the more likely they are to co-occur inS. This
enables the modeling of repulsion between possible
states; it is central to the current work and its ability to
modelaspects of higher-level cognition.
Here, the relevant representational states areconcatenations of
two types of code vectors correspondingto pre-factorized
representations. As we noted above,these factors may be thought of
as ”keys” and ”values”, oralternatively ”relations” and ”content”.
Concretely, however,in the situations we explore, they are
representations ofwords and their referents in the ME-bias case,
and spatiallocations and objects in the disjunctive syllogism case.
Togenerate the code for a possible combination, we
simplyconcatenate vectors for the key/value components, keepingthe
ordering and codes consistent across uses, consistentwith
compositionality.1 Although we report simple
key/valueconcatenations here, we obtain the same results both
bysumming key/value representations.
Although finding the maximum a posteriori (MAP) subsetin a DPP
is NP-hard (Kulesza & Taskar, 2012), thereexist greedy methods
that can effectively approximate it(Gillenwater et al. 2012, Han et
al. 2017). However, here,we make the simplifying assumption that Y
is itself a subsetof the vastly larger set of possible items that
could have beenunder consideration. For present purposes, we assume
thatthis context-dependent restriction of the possibility space
canbe carried out by standard attentional mechanisms. Withinthis
small-cardinality space, we are able to exhaustivelysearch for the
MAP subset (the sub-matrix that maximizes thedeterminant). However,
generating more plausible heuristicmethods that can scale to larger
spaces remains a focus forfuture work.
Throughout, we compare the DPP to two alternativepoint process
models in order to emphasize the conceptualcontribution of the DPP.
First, a Poisson Point Process (PPP),which assumes no similarity
kernel K, treating each item asindependent. Here, we use P(S⊆ Y ) =
∏i∈y pi ∏i/∈y (1− pi),where p is a flat prior across states. This
is random uniformselection. Second, a generic Markov Point Process
(MPP),which selects items based on the kernel K used for theDPP,
but selecting directly on the similarity scores of thesub-matrix,
rather than its determinant. In this, the MPP overK can be
considered in opposition to the DPP, favoring itemsthat are nearby
in the code-space rather than far apart. Takentogether, one can
think of these three models as capturing thepossibility of (a)
random inference (PPP), (b) inference bysimilarity (MPP) and (c)
inference by repulsion (DPP). SeeFigure 1. We note that we are not
choosing to compare the
1We note that, although this encoding framework is simple,it is
motivated by empirical evidence concerning the nature
andorganization of the projections from the mammalian
entorhinalcortex (EC) to the hippocampal sub-fields: a key circuit
bothfor simple forms of reasoning and memory (Zeithamova et
al.,2012). Specifically, a medial region of EC contains
low-dimensionalrepresentations of the spatial structure of the
environment, while alateral region encodes sensory content (Behrens
et al. 2018). Theseseparate representations are believed to then be
bound togetherin the hippocampus in order to encode different
structure/contentcombinations (Whittington et al. 2018).
3303
-
Figure 2: (A). Example of the word learning problem. The model’s
task is to select a word/object mapping, conditioned on3 existing
associations. Humans tend to map the held-out word to the held-out
object. (B) shows the representations used bythe models to guide
inference. We assume separate, but combinable, codes for words
(keys) and objects (values). The squarematrix in (C) represents
pairwise similarities between possible word/object combinations.
Brighter colors reflect more similarcombinatorial codes, darker
colors less similar codes. Black bars across rows and columns
reflect a hypothetical subset ofword/object mappings, as in A. (D)
We evaluate the probability that the held-out word is mapped to the
held-out object (mutualexclusivity bias) across 1000 simulations
with different word/object codes. A DPP naturally selects the novel
unused word andun-used object, exhibiting the mutual exclusivity
bias.
DPP to these other point process models because we believethat
MPPs and PPPs are a priori particularly plausible modelsof the
relevant types of structured inference. Instead, webelieve they are
useful to highlight fundamental properties ofthe DPP (e.g.,
repulsion). We believe that first comparing thepredictions of a DPP
against these clearly situates the DPP ina consistent conceptual
framework for expository purposes.
Mutual Exclusivity Bias in Word Learning. The ”mutualexclusivity
(ME) bias” in word learning refers to theempirical observation
that, all else being equal, youngchildren and adults prefer to map
a novel word to a novelreferent. They prefer this both to mapping
many words tothe same object, or many objects to the same word
(Markman& Wachtel, (1988); Merriman & Bowman, (1989);
Halberda,(2003); Lake, Linzen, Baroni (2019)). Here, we suggest
thatthis inferential bias follows directly from the considerationof
an associative encoding system that selects which codesto bind
under a Determinantal Point Process. A learner thatperforms
inference using a DPP will prefer combinationsthat maximize the
total volume of the representational space.Under the assumption
that components re-use codes acrosspossible uses (consistent with
compositionality), a DPPnaturally favors combinations of previously
un-used wordsand objects.
We model a case involving 4 words and 4 objects, in whichthe
learner has 3 extant word/object associations (See Figure2a). Here,
we are agnostic as to whether those associationswere acquired in
this particular episode, or whether theywere brought to the
episode. Words and objects are randomcode vectors, sampled from a
multivariate Gaussian N (0,1).These codes are concatenated to form
possible word/objectcombinations (See Figure 2c). We compute a
linear kernel
over the representations of these combinations, reflectingthe
covariance structure amongst the codes for differentword/object
pairs. The different point process models thenselect from the 13
remaining word/object combinations,conditioned on the 3 previous
word/object combinations.
We ran 1000 simulations involving different random wordand
object vectors, and found that the DPP exhibits themutual
exclusivity bias 99.3 % of the time. See Figure 2.As would be
expected, the PPP model randomly selects fromthe 13 remaining
possible conjunctions. The MPP prioritizesre-use of codes across
instances (re-using a word to referto multiple objects), given that
its desideratum is to selectcombinations similar to those already
encountered. It has aninferential bias of ”many-to-one”.
Thus, when codes for combinatorial states arecompositions of
words and objects (keys and values), asimple algorithm that
maximizes the volume spanned by thecode vectors naturally produces
a mutual exclusivity bias inthe word learning process. Re-using
either words or objectsacross different mappings in the same
context works againstmaximizing the volume spanned by the vectors,
as the samevector will contribute to multiple
combinations.Disjunctive Syllogism. Our second example involves
theability to make inferences like those in a classical
DisjunctiveSyllogism (DS) (Mody & Carey, 2016; Cesana-Arlottiet
al., 2018; Pepperberg et al. 2019). Formally, adisjunctive
syllogism starts with the representation of adisjunction (premise
1: A or B), where ’or’ is XOR (oneor the other, but not both).
Next, one acquires somepiece of information (premise 2: not A).
Finally, a ruleis applied to derive the conclusion, conditioned on
thepremises. (conclusion: Therefore, B). Notably, young
3304
-
Figure 3: (A). Rendering of Cesana-Arlotti (2018)’s experimental
paradigm, based on their Figure 1. (B). To model this, weassume a
factored space of object and location codes under consideration and
(C) populate a square matrix with the pairwisesimilarities between
possible object/location combinations (K). We highlight in red the
items selected by arg max S ∈Y det(KS),conditioned separately on
each combination that could be observed (the diagonal). The DPP
reliably favors the combinationthat is most dissimilar (dark blue)
to the observed object/location combination (bright yellow). (D).
Cesana-Arlotti et al. (2018)found that infants exhibit increased
looking time to DS-inconsistent cases (results schematically
depicted here). (E) The DPPmodel naturally selects a combination of
the un-used location and un-used object. If these inferences are
used to generatepredictions and compared against the DS-consistent
and DS-inconsistent cases, the DPP exhibits greater prediction
errors whenthe revealed object/location combination is
DS-inconsistent, like pre-verbal infants.
children (Mody & Carey, 2016) including infants as youngas
12 months (Cesana-Arlotti et al. 2018) and non-humananimals
(Pepperberg et al. 2019) all show aspects of thisinferential
ability. Here, we show that a DPP defined overthe space of
combinatorial representations predicts the keyempirical
pattern.
For expository purposes, we focus on Cesana-Arlotti et
al.(2018)’s paradigm with pre-verbal infants. See Figure 3a fora
schematic of a trial. A trial begins with two objects ona screen.
Both are temporarily hidden behind an occluder,obstructing the
objects from the infant’s view. One objectis then seen to be
scooped out from behind the occluder,though the infant is unable to
determine which of the twoobjects it was. The occluder is then
removed revealing (e.g.,)object A. The inference by disjunctive
syllogism, of course,is that object B must therefore be the object
in the bucket.Infants’ expectations are assessed by measuring their
lookingtime. If it is then revealed that the bucket contains
objectA, rather than object B (the ”DS-inconsistent”
condition),infants as young as 12 months old are surprised,
evidenced byincreased looking time relative to the alternative
outcome inwhich object B is in the bucket (”DS-consistent”
condition).
To model this, we assume 1x100 random code vectorsdrawn from a
multivariate Gaussian N (0,1) for each of twoobjects (values) and
two locations (keys). We concatenatethese to form a 4x200 matrix,
in which the rows arecompositions of possible object/location
combinations, andthe columns are the random features. As above, we
computethe covariance between each of the mXn combinations,
hereobtaining a 4x4 kernel K encoding the similarities betweenthe
codes for possible combinatorial states. For our analyses,we
simulated 1000 different possible instances of randomvectors, while
also randomly selecting different superficialtrial structures
(e.g., that the DS-consistent combination wasobject A/location 1,
object A/location2, object B/location1,objectB/location2). As
expected, given one conjunction(e.g., object A in location 1), a
DPP reliably selects theun-used object and the un-used location
(here, object Bin location 2), as it maximizes the volume spanned
byvectors encoding the combinations. See Figure 3. Tomore directly
relate the models’ inferences to the infantlooking time data, we
next computed the MSE between theobject/location combination
selected by the model and thecode for the stimulus in the
DS-consistent (low-surprise)
3305
-
and DS-inconsistent (high-surprise) conditions. As expected,the
prediction error is high for the DPP model in theDS-inconsistent,
and at zero for the DS-consistent condition.The MPP
(similarity-based), and PPP (random) models donot predict this
direction of the ”surprisal” effect.
DPP Hashing for Collision-Free Encoding.The empirical findings
that we model here demonstrate anumber of notable features of ME
biases, chief amongstthem: (a) they appear to be present remarkably
early indevelopment (Cesana-Arlotti et al. 2018; Lewis et
al.,2020), (b) they have been observed, in different forms,across
representational domains (word learning and logicalreasoning about
physical states), and (c) related worksuggests that at least one
type (DS-like inferences) maybe present in some non-human species
(Pepperberg et al.2019). This raises a family of interesting
theoretical questionsregarding their acquisition and nature. For
example, are thereseparate domain-specific biases, the existence of
each owingto its utility in a particular domain, or do these emerge
froma shared system? What are the relevant representational
andinferential systems, and how are they implemented? And,
ofcourse, familiar questions regarding whether such
inferentialbiases are acquired over phylogenetic or ontogenetic
time. 2
Although we certainly do not intend to definitively answerthese
questions here, we add one theoretical suggestion to theliterature:
we propose that ME may be a consequence of amore general strategy
for efficiently encoding and retrievingunique tokens of key/value
associations in memory. On suchan account, it would not be
surprising that ME biases arepresent across disparate
representational domains, as longas the domain requires binding of
component parts andstorage for later retrieval. Moreover, it seems
possible thatgeneral mechanisms for encoding and retrieving of
tokens(individuals) might be be present early in development
andwould be employed by different species.
Why would one think that encoding and retrievingunique tokens of
key/value associations in memory hasanything to do with mutual
exclusivity? Recall, first, that weassume that the bindings under
consideration (word/objector object/location combinations) are
fundamentallycompositional: re-using codes to promote
generalization tonovel combinations. However, a compositional
encodingscheme also creates the possibility of collisions
betweendistinct instances of similar states (imagine the
classicexample of where you parked your car yesterday vs.two days
ago). One might therefore wish to index therepresentations of
individual instances in such a way asto avoid mapping similar
states to the same address. One
2The early-onset of the DS results of Cesana-Arlotti et al
(2018)at least raise the possibility that pieces of the relevant
machinerycould be innate. However, recent computational modeling
workfrom Lake (2019) provides an existence proof that an ME bias
canitself be induced from domain-experience. Lake (2019) shows
thata neural network equipped with an external memory and trained
in ameta learning sequence-to-sequence paradigm can learn to apply
anME bias to novel instances of word-object pairings.
way to achieve this is to spread the keys broadly across
therepresentational space (maximizing volume, as in a DPP).That is,
we suggest that ME biases may exist across domainsbecause those
domains all draw on a shared memory systemfor associative binding,
and an important feature of thissystem is its ability to avoid
collisions by maximizing therepresentational ”volume” of the memory
keys. On this view,ME biases would arise in any representational
domain thatrequires inference regarding novel combinations of
familiarcomponents (word/object, object/location).
To better illustrate the potential benefit of dispersing keysfor
efficient memory retrieval, it is instructive to considerdata
structures for ”hashing” in computer science. Ahash-function is a
way of mapping from a datum to aunique index, such as a position in
an array. Effectivehashing seeks to avoid the time demands that are
produced bysequential search techniques, which have a
time-complexityof (O(n)) or binary search (O(log n)). Instead, a
goodhash-function enables (O(1)) access times, in which readouttime
is invariant to the number of items in the memory. Thisis a classic
example of a space/time tradeoff (Sedgewick &Wayne, 2011): If
one is willing to expend the resourcesnecessary to construct a vast
associative array, data wouldalmost never be mapped to the same
position, and collisionswould be minimized. However, this is costly
in terms ofspace. By contrast, restricting the size of the array
reduces theamount of space consumed, but risks dramatically
increasingretrieval time, as one would have to search all the items
in theparticular location currently indexed (i.e, linear probing
andchaining methods). Hashing seeks to effectively navigate
thistrade-off, constraining both the size of the array that is
needed(space), while also minimizing the amount of computationspent
resolving collisions (time). DPPs in representationalspace
effectively avoid this tradeoff.
To see this, consider a toy case in which locations inmemory are
indexed by a finite set of keys, and we are ableto select a key for
each datum. We compare hypotheticalkey-selection algorithms that
hash based on PPPs, MPPs,and DPPs where the latter two cases are
defined over thesimilarity kernel for the space of key/value
combinations inthe dataset, as above. DPP-based key selection
begins byrandomly sampling a key/value pair. Then, each
subsequentvalue in the dataset is tagged with the particular key
thatmaximizes the total volume of the dataset of key/value
pairsthat have been hashed to that point (when concatenatedwith the
value). DPP-based key selection (unlike PPP andMPP) thus implicitly
discourages the re-use of keys, asthis would reduce the volume of
the parallelpided spannedby the code vectors for the association.
Figure 4 showsthe results of 1000 simulations comparing the
performanceof these different models. The probability of a
collisionin such an idealized memory system is near 0 (Figure4a).3
Notably, this ”collision-free” property is accomplished
3We find that these infrequent collisions in the DPP can
becompletely eliminated by use of Pearson correlation to compute
the
3306
-
Figure 4: DPP-driven codes enable efficient retrieval of unique
items. We allow the keys of key/value pairs to be selectedas a MPP,
PPP, or DPP. DPPs (A) minimize collisions between items. They do so
in virtue of (B) selecting un-used keys tomaximize the volume
spanned by the key/value code vectors, across the dataset. They
thus efficiently manage the resourcetradeoff between search time
and storage space.
without dramatically expanding the size of the array, as
thearray size is no larger than the number of individual states
wewish to encode (Figure 4b). A DPP-based hash algorithm,unlike
random selection in a PPP, thus exploits the repulsiveproperty to
distribute codes evenly across the representationalspace.
DiscussionWe have shown that Determinantal Point
Processes(DPPs)–probabilistic models of the negative
interactionsbetween states– predict a class of commonly
observedbiases in structured inference: specifically, inferential
biasestoward (a) mutual exclusivity in word learning (Markman&
Wachtel, 1988; Halberda, 2003) and (b) completion ofdisjunctive
syllogisms (Mody & Carey, 2016; Cesana-Arlottiet al., 2018).
These inferences arise naturally from a DPPbecause a DPP selects
subsets so as to maximize the volumespanned by the vectors (here, a
subset of the possiblecombinations). When the similarity is defined
over re-usablekeys and values, the DPP prefers combinations of
previouslyunused components.
This framework does not require that the cognitivesystem have
explicit representations of rules, receive directsupervision, or
have mechanisms for transferring knowledgebetween domains. This
puts ME biases well-within thecognitive reach of pre-verbal
infants, language-learningchildren, and non-human species that may
lack the relevantexperience, cortical machinery, or both necessary
to representand operate over abstract logical rules (See Mody &
Carey,(2016) for related discussion regarding disjunctive
syllogismin young children).
Instead, we suggest that the central driver of this biasmay be
the promotion of an efficient memory system.Maximizing the volume
spanned by the vectors in codespace promotes memory retrieval by
minimizing interference
similarity matrix.
(”collisions”). This is closely related to classic
ideasregarding pattern separation in an episodic memory
system(Marr, 1971; Treves & Rolls, 1994; O’Reilly &
McClelland,1994): a reduction in the similarity between two
statesin a function’s output relative to their similarity in
theinput. Pattern separation is canonically implemented
byprojecting codes into a high-dimensional space where
theprobability of collisions is low. The DPP has a
similarmotivation here. However, a DPP precludes the needto project
into a higher-dimensionality in order to avoidcollisions, as it is
able to uniquely map items to locationswith an array size equal to
the number of keys (See Figure4). Thus, a DPP may better navigate
the time/spacetradeoff (Sedgewick & Wayne, 2011) than the
strategyof interference-reduction through
dimensionality-expansionstandard in pattern separation. However,
this savings instorage may come at a computational cost, as
computingdeterminants has a time complexity of either O(n3) orO(!)
(depending on the algorithm)4. These quantitiestherefore likely
need to be approximated in order tobe implemented in a neural
substrate. Approximatingthem in a biologically plausible algorithm
remains atopic of ongoing work. Although pattern separation
isconventionally studied in the episodic memory literature,
thetheoretical points that we make throughout regarding DPPsin
representation-space apply to working memory as well. Insome ways,
the considerations regarding structured inferenceare more closely
to tied to what is conventionally thought
4We note, however, that although such exponential (or
factorial)scaling is detrimental in applied use-cases of hashing,
it has anintriguing connection to some empirically observed
set-size effectsin uniform domains, characterized by rapid,
non-linear decrease inperformance as n grows) (Miller, 1956; Luck
& Vogel, 1997). Onepossibility is that capacity limits that
appear to stem from a fixednumber of ”slots” may instead owe to the
computational complexityof computing (or approximating) the
determinants necessary toencode unique (non-colliding)
conjunctions. At the moment, thisremains speculative, however.
3307
-
of as ”working memory”, as we assume that attentionalmechanisms
have already windowed into a smaller regionof the possibility space
so that we can easily computethe MAP over a sub-set of the broader
set of possibleitems. Better understanding how DPPs may relate
tothe particular factorizations of memory systems standardin
cognitive science (e.g., episodic / working / semantic),as well as
specific aspects of the entorhinal/hippocampalsystem5 also remain
important topics of ongoing work. Forpresent purposes, however, the
central distinction in memorysystems is simply that between a
stable set of re-usablerepresentations and combinations of those
representations inparticular instances (key/value pairs). While
re-using codespromotes generalization, it increases the risk of
collisionsin the memory system. Here, we have suggested thatan
algorithm that seeks to maximize the total volume ofthe constructed
combinations in a representational-space(exhibiting repulsion) not
only promotes efficient memoryencoding and retrieval, but may also
underlie inferentialbiases toward mutual exclusivity.
AcknowledgmentsThis project / publication was made possible
through thesupport of grants from the John Templeton Foundation
andNIH grant T32MH065214. The opinions expressed in thispublication
are those of the authors and do not necessarilyreflect the views of
the John Templeton Foundation.
ReferencesBehrens, T. E., Muller, T. H., Whittington, J. C.,
Mark, S.,
Baram, A. B., Stachenfeld, K. L., & Kurth-Nelson, Z.(2018).
What is a cognitive map? organizing knowledgefor flexible behavior.
Neuron, 100(2), 490–509.
Cesana-Arlotti, N., Martı́n, A., Téglás, E., Vorobyova,
L.,Cetnarski, R., & Bonatti, L. L. (2018). Precursors of
logicalreasoning in preverbal human infants. Science,
359(6381),1263–1266.
Chanales, A. J., Oza, A., Favila, S. E., & Kuhl, B.
A.(2017). Overlap among spatial memories triggers repulsionof
hippocampal representations. Current Biology, 27(15),2307–2317.
Gillenwater, J., Kulesza, A., & Taskar, B.
(2012).Near-optimal map inference for determinantal pointprocesses.
In Advances in neural information processingsystems (pp.
2735–2743).
Halberda, J. (2003). The development of a word-learningstrategy.
Cognition, 87(1), B23–B34.
Kulesza, A., Taskar, B., et al. (2012). Determinantal
pointprocesses for machine learning. Foundations and Trends R©in
Machine Learning, 5(2–3), 123–286.
Lake, B. M. (2019). Compositional generalization throughmeta
sequence-to-sequence learning. In Advances in neuralinformation
processing systems (pp. 9788–9798).5See Chanales et al. (2017) for
intriguing fMRI evidence of
”repulsion” in hippocampal codes, in which similar states
havedissimilar representations.
Lake, B. M., Linzen, T., & Baroni, M. (2019). Humanfew-shot
learning of compositional instructions. arXivpreprint
arXiv:1901.04587.
Lewis, M., Cristiano, V., Lake, B. M., Kwan, T., & Frank,M.
C. (2020). The role of developmental changeand linguistic
experience in the mutual exclusivity effect.Cognition, 198,
104191.
Luck, S. J., & Vogel, E. K. (1997). The capacity of
visualworking memory for features and conjunctions.
Nature,390(6657), 279–281.
Markman, E. M., & Wachtel, G. F. (1988). Children’s useof
mutual exclusivity to constrain the meanings of words.Cognitive
psychology, 20(2), 121–157.
Marr, D. (1971). Simple memory: A theory for
archicortex.Philosophical Transactions of the Royal Society of
London.Series B, Biological Sciences, 262(841), 23–81.
Retrievedfrom http://www.jstor.org/stable/2417171
McClelland, J. L., McNaughton, B. L., & O’Reilly, R.
C.(1995). Why there are complementary learning systemsin the
hippocampus and neocortex: insights from thesuccesses and failures
of connectionist models of learningand memory. Psychological
review, 102(3), 419.
Merriman, W. E., Bowman, L. L., & MacWhinney, B.(1989). The
mutual exclusivity bias in children’s wordlearning. Monographs of
the society for research in childdevelopment, i–129.
Miller, G. A. (1956). The magical number seven, plus orminus
two: Some limits on our capacity for processinginformation.
Psychological review, 63(2), 81.
Mody, S., & Carey, S. (2016). The emergence of reasoningby
the disjunctive syllogism in early childhood. Cognition,154,
40–48.
O’reilly, R. C., & McClelland, J. L. (1994).
Hippocampalconjunctive encoding, storage, and recall: Avoiding
atrade-off. Hippocampus, 4(6), 661–682.
Pepperberg, I. M., Gray, S. L., Mody, S., Cornero, F. M.,&
Carey, S. (2019). Logical reasoning by a grey parrot?a case study
of the disjunctive syllogism. Behaviour,156(5-8), 409–445.
Sedgewick, R., & Wayne, K. (2011). Algorithms.Addison-wesley
professional.
Treves, A., & Rolls, E. T. (1994). Computational analysisof
the role of the hippocampus in memory. Hippocampus,4(3),
374–391.
Whittington, J., Muller, T., Mark, S., Barry, C., &
Behrens,T. (2018). Generalisation of structural knowledge in
thehippocampal-entorhinal system. In Advances in neuralinformation
processing systems (pp. 8484–8495).
Zeithamova, D., Schlichting, M. L., & Preston, A. R.
(2012).The hippocampus and inferential reasoning: buildingmemories
to navigate future decisions. Frontiers in humanneuroscience, 6,
70.
3308