Top Banner
Language Design as Information Renormalization ´ Angel J. Gallego Departament de Filologia Espanyola, Facultat de Filosofia i Lletres, Universitat Aut` onoma de Barcelona, 08193 Bellaterra, Spain Rom´ an Or´ us Donostia International Physics Center, Paseo Manuel de Lardizabal 4, E-20018 San Sebasti´ an, Spain Ikerbasque Foundation for Science, Maria Diaz de Haro 3, E-48013 Bilbao, Spain and Institute of Physics, Johannes Gutenberg University, 55099 Mainz, Germany Here we consider some well-known facts in syntax from a physics perspective, allowing us to establish equivalences between both fields with many consequences. Mainly, we observe that the operation MERGE, put forward by N. Chomsky in 1995, can be interpreted as a physical informa- tion coarse-graining. Thus, MERGE in linguistics entails information renormalization in physics, according to different time scales. We make this point mathematically formal in terms of language models. In this setting, MERGE amounts to a probability tensor implementing a coarse-graining, akin to a probabilistic context-free grammar. The probability vectors of meaningful sentences are given by stochastic tensor networks (TN) built from diagonal tensors and which are mostly loop- free, such as Tree Tensor Networks and Matrix Product States, thus being computationally very efficient to manipulate. We show that this implies the polynomially-decaying (long-range) correla- tions experimentally observed in language, and also provides arguments in favour of certain types of neural networks for language processing. Moreover, we show how to obtain such language models from quantum states that can be efficiently prepared on a quantum computer, and use this to find bounds on the perplexity of the probability distribution of words in a sentence. Implications of our results are discussed across several ambits. I. INTRODUCTION Linguistics can be defined as “the scientific study of language, and its form, meaning, and context” [1]. The field itself is a broad science, sometimes even a philos- ophy, embracing interdisciplinary ideas from a wide va- riety of contexts: syntax, mathematics, computer sci- ence, neuroscience... all in all, there is no common agree- ment concerning why human language is as it is, or even about its basic defining properties. From the point of view of Artificial Intelligence (AI), for instance, one is worried about developing accurate algorithms for speech and text recognition/prediction [2]. Additionally, the generative approach led by Noam Chomsky tries to un- derstand the linguistic capacity from a biological per- spective, as part of human cognition. As Chomsky et al. observe [3], the point of departure is Descartes’ ob- servation that, among all animal species, only humans seem to have a language ability [4]. Work on compar- ative cognition has endorsed this insight: only humans appear to possess a mental grammar – an “I-language,” where the “I” stands for intensional, internal, and indi- vidual – that allows us to create infinitely many meaning- ful expressions from a finite stock of discrete units [5, 6] Within the generative models, the Minimalist Program [7] tries to attribute the properties of human language to what Chomsky [8] calls the “third factor”, namely “to language-independent principles of data processing, structural architecture, and computational efficiency” [8]. This picture is not different from the general study of organic systems, and D’Arcy Thompson’s and Alan Tur- ing’s works on form and morphogenesis can be seen as an example [9]. In this framework, Chomsky proposed a basic operation, called MERGE, to build up linguistic structures [10]. In MERGE, two syntactic objects X and Y are combined to form a new syntactic unit K, i.e., MERGE : X, Y -→ K = {X, Y }, (1) where the brackets mean that the information in K is obtained from that in X and Y . The operation can be applied recursively, thus having the ability to create dif- ferent generations of units. In parallel to this, physics aims to understand how the universe behaves. Some of its subfields search for the fundamental mathematical laws of the building blocks of Nature, such as high-energy physics and quantum grav- ity. However, the knowledge of such fundamental rules (the so-called reductionism) does not imply a priori the knowledge of the emergent laws for aggregates of many fundamental entities (the so-called emergentism [11]). Typical examples of this are condensed matter and solid- state physics, where the knowledge of the rules governing the fundamental entities at a short length scale (such as atoms and molecules described by Schr¨ odinger’s equa- tion) does not imply, at least directly, the knowledge of the rules governing the collective behavior of aggregates at a large length scale (such as phase diagrams of mat- ter). In famous words of P. Anderson, “more is different” [12]. The key concept in the above discussion is that of emergence : the collective properties of aggregates of sys- tems may be, because of different reasons, very different from the ones of the individual systems themselves. The arXiv:1708.01525v4 [cs.CL] 19 Mar 2019
23

arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

Apr 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

Language Design as Information Renormalization

Angel J. GallegoDepartament de Filologia Espanyola, Facultat de Filosofia i Lletres,

Universitat Autonoma de Barcelona, 08193 Bellaterra, Spain

Roman OrusDonostia International Physics Center, Paseo Manuel de Lardizabal 4, E-20018 San Sebastian, Spain

Ikerbasque Foundation for Science, Maria Diaz de Haro 3, E-48013 Bilbao, Spain andInstitute of Physics, Johannes Gutenberg University, 55099 Mainz, Germany

Here we consider some well-known facts in syntax from a physics perspective, allowing us toestablish equivalences between both fields with many consequences. Mainly, we observe that theoperation MERGE, put forward by N. Chomsky in 1995, can be interpreted as a physical informa-tion coarse-graining. Thus, MERGE in linguistics entails information renormalization in physics,according to different time scales. We make this point mathematically formal in terms of languagemodels. In this setting, MERGE amounts to a probability tensor implementing a coarse-graining,akin to a probabilistic context-free grammar. The probability vectors of meaningful sentences aregiven by stochastic tensor networks (TN) built from diagonal tensors and which are mostly loop-free, such as Tree Tensor Networks and Matrix Product States, thus being computationally veryefficient to manipulate. We show that this implies the polynomially-decaying (long-range) correla-tions experimentally observed in language, and also provides arguments in favour of certain typesof neural networks for language processing. Moreover, we show how to obtain such language modelsfrom quantum states that can be efficiently prepared on a quantum computer, and use this to findbounds on the perplexity of the probability distribution of words in a sentence. Implications of ourresults are discussed across several ambits.

I. INTRODUCTION

Linguistics can be defined as “the scientific study oflanguage, and its form, meaning, and context” [1]. Thefield itself is a broad science, sometimes even a philos-ophy, embracing interdisciplinary ideas from a wide va-riety of contexts: syntax, mathematics, computer sci-ence, neuroscience... all in all, there is no common agree-ment concerning why human language is as it is, or evenabout its basic defining properties. From the point ofview of Artificial Intelligence (AI), for instance, one isworried about developing accurate algorithms for speechand text recognition/prediction [2]. Additionally, thegenerative approach led by Noam Chomsky tries to un-derstand the linguistic capacity from a biological per-spective, as part of human cognition. As Chomsky etal. observe [3], the point of departure is Descartes’ ob-servation that, among all animal species, only humansseem to have a language ability [4]. Work on compar-ative cognition has endorsed this insight: only humansappear to possess a mental grammar – an “I-language,”where the “I” stands for intensional, internal, and indi-vidual – that allows us to create infinitely many meaning-ful expressions from a finite stock of discrete units [5, 6]Within the generative models, the Minimalist Program[7] tries to attribute the properties of human languageto what Chomsky [8] calls the “third factor”, namely“to language-independent principles of data processing,structural architecture, and computational efficiency” [8].This picture is not different from the general study oforganic systems, and D’Arcy Thompson’s and Alan Tur-

ing’s works on form and morphogenesis can be seen asan example [9]. In this framework, Chomsky proposeda basic operation, called MERGE, to build up linguisticstructures [10]. In MERGE, two syntactic objects X andY are combined to form a new syntactic unit K, i.e.,

MERGE : X,Y −→ K = X,Y , (1)

where the brackets mean that the information in K isobtained from that in X and Y . The operation can beapplied recursively, thus having the ability to create dif-ferent generations of units.

In parallel to this, physics aims to understand how theuniverse behaves. Some of its subfields search for thefundamental mathematical laws of the building blocks ofNature, such as high-energy physics and quantum grav-ity. However, the knowledge of such fundamental rules(the so-called reductionism) does not imply a priori theknowledge of the emergent laws for aggregates of manyfundamental entities (the so-called emergentism [11]).Typical examples of this are condensed matter and solid-state physics, where the knowledge of the rules governingthe fundamental entities at a short length scale (such asatoms and molecules described by Schrodinger’s equa-tion) does not imply, at least directly, the knowledge ofthe rules governing the collective behavior of aggregatesat a large length scale (such as phase diagrams of mat-ter). In famous words of P. Anderson, “more is different”[12].

The key concept in the above discussion is that ofemergence: the collective properties of aggregates of sys-tems may be, because of different reasons, very differentfrom the ones of the individual systems themselves. The

arX

iv:1

708.

0152

5v4

[cs

.CL

] 1

9 M

ar 2

019

Page 2: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

2

mathematical formalization of this paradigm in physics isachieved by the so-called Renormalization Group (RG),or simply renormalization [13]. Originally developed(mainly) by K. Wilson and L. Kadanoff, renormaliza-tion is a strategy for dealing with problems at differentphysical scales. This scale is typically a length, energyor time scale, and allows for different descriptions of theproblem as it changes. For instance, going from shortto long length scales corresponds intuitively to “zoomingout” the system, effectively going from, e.g., a descrip-tion of individual atoms (short scale) to a description ofa solid with O(1023) interacting atoms (long scale). Atits root, a renormalization transformation is built in twosteps: first, one keeps the most relevant information de-grees of freedom to describe the system at a new scalediscarding the ones believed not to be relevant, and sec-ond, one implements a rescaling of the physical variableand operators/functions in order to maintain the origi-nal picture. Physics is full of successful applications ofrenormalization in different ambits, from condensed mat-ter physics [14] to quantum field theory [15] and quantuminformation [16] (each one with its own peculiarities), tothe point that it has become one of the basic pillars inour current understanding of the laws of Nature. Phys-ical theories that cannot be renormalized are consideredwrong or incomplete.

Having said all this, our aim with this paper is twofold.On the one hand, we establish equivalences betweenphysics and linguistics in the context of the Minimal-ist Program (for linguistics) and emergence (for physics)which, once mathematically formalized, turn out to haveimportant consequences in ambits as diverse as AI, the-oretical linguistics, computer science, RNA / protein se-quencing, quantum many-body systems, quantum com-puting, and beyond. On the other hand, we strengthenthe relation between physics and linguistics, where lan-guage is the system to be understood using the machineryof physics, and of information theory in particular.

Let us be more specific: here we observe that MERGEcan be understood physically as a type of informa-tion coarse-graining according to time scales. Roughlyspeaking, the linguistic information in sequences ofwords (short time scale) gets renormalized by succes-sive MERGE operations up to meaningful sentences (longtime scale) [61]. This simple observation, somehow trivialfrom a physics perspective, turns out to have deep con-sequences. In particular we show that language models(i.e., probability distributions over word sequences) [17],widely used in AI applications, admit a natural descrip-tion in terms of Tensor Networks (TN) [18]. For instance,the simplest MERGE corresponds to a 3-index tensor ofcomponents Mαβγ accounting for a probability distribu-tion of three variables α, β and γ. And this is nothingbut a Probabilistic (or Weighted) Context-Free Grammar(PCFG), in a way to be made precise later. Probabil-ities of meaningful sentences with a given syntax treeare naturally given in this framework by (mostly) loop-free TNs made of diagonal tensors. Such mathematical

α β

α

(a) (b)

α β

FIG. 1: MERGE operation, taking two lexical elements α andβ, and projecting them into a new one, namely K, with labelα. The fact that the label is also α means that the elementresulting from the projection has the syntactic properties ofα (the “head” of the syntactic object). (b) A label-free repre-sentation of the application of MERGE, compatible with therecent claim [19] that labels should be dispensed with.

structures have a number of nice properties which makethem particularly amenable to manipulations of their in-formation content, as we shall explain. Moreover, weshow that they naturally encompass the long-range cor-relations that are observed experimentally in language,as well as provide arguments in favour of certain types ofneural networks for language processing. Then, we moveon to show that the TN structure and the particularitiesof PCFGs allow for the description of the probability dis-tributions in terms of some TN quantum states. Such anexotic description using quantum mechanics is only to beunderstood at the practical level, but it happens to pro-vide a useful connection between computational linguis-tics and quantum information and computation, openingthe door to unprecedented results and developments inlanguage processing algorithms. As examples of this, weshow how such states can be built efficiently on a quan-tum computer, and prove lower bounds on the perplexityof the probability distribution of a set of words in a sen-tence [62] by using mathematical tools borrowed fromfrom the theory of quantum many-body entanglement[20]. We envisage consequences of our results in severalambits. For instance, one can use the full machinery ofTNs and quantum information to validate, simulate, as-sess, and improve current language models. Moreover,the fact that such probabilistic models can be fed into aquantum computer means that we have, in fact, a quan-tum algorithm that allows perfect random sampling oflanguage, which is simply impossible with classical com-puting. All in all, and together with other implications,we propose that our physical picture is in fact relatedto the conjectured “perfect design and economy” of lan-guage in Chomsky’s Minimalist Program, as well as tothe (also conjectured) efficient processing of linguistic in-formation in the human brain [21].

The structure of this paper is as follows. In Sec.II weintroduce our basic equivalence, namely, that MERGE inlinguistics entails information-renormalization in physics.In Sec.III we explain a direct consequence of this: thequasi-loop-free TN structure of language models. We alsoderive properties of such structures using tools from TNs.In Sec.IV we prove that our formalism encompasses theobserved long-range correlations in language. In Sec.V

Page 3: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

3

long-scalephysics

...

x

z

z3

zm

z2

z1

z4

FIG. 2: (Color online) Pictorial representation of a renormal-ization process in real space for a 1d lattice. The horizontalaxis is coordinate x (say, a space coordinate), and the ver-tical axis is coordinate z, which parametrizes the renormal-ization scale. Short renormalization scales (small z) amountsto a microscopic description of the system at small distancesin x, whereas large renormalization scales (large z) amountsto a coarse-grained, macroscopic description of the relevantphysics of the system at large distances in x. We codify shortscales with “blue” and long scales with “red”, following theintuition in physics that renormalization may take you fromhigh energies (ultraviolet) to low energies (infrared). In ourcase, though, the colors have no special meaning and are justa convenient way of indicating the different scales z1, z2, ...,which are also shown for convenience. Formally, an RG stepamounts to a coarse-graining followed by a rescaling of the lat-tice and associated operators/functions, which we implicitlyassumed in the picture.

we establish the novel connection to TN quantum states,and show in Sec.VI how this can be used to derive resultson the perplexity of probabilistic language models. Then,Sec.VII considers arbitrary grammars and language mod-els. In Sec.VIII we discuss some implications of our ob-servations in different ambits. Finally, in Sec.IX we wrapup our conclusions, include a table of the main equiva-lences discussed in the paper, and discuss future perspec-tives. We also include Appendix A with formalities forthe readers with background on theoretical linguistics,and which allows us to find even more equivalences be-tween linguistic and physical concepts, all of them linkedto MERGE and renormalization. Overall, though, thepaper is written assuming that the reader has mostly aphysics & maths background, even though the style ishighly heterogeneous, being this a consequence of the in-terdisciplinary nature of our results.

II. THE BASIC EQUIVALENCE

Our guiding principle is the following equivalence,which we write in the form of an equation:

MERGE = Coarse-graining

eat pasta

V N

VP

t

z

z1

z2

FIG. 3: (Color online) The linguistic MERGE operation, seenas a physical coarse-graining process. The horizontal axis istime t, and the vertical axis is the renormalization scale z. Inthis case, the operation takes a verb (eat) and a noun (pasta),and coarse-grains them into a verb phrase (eat pasta). At thescale z2, all the relevant syntactic information is that the com-pound object is a verb phrase (V P ). Unless stated otherwise,we assume that the basic building blocks are words togetherwith their label, as shown in the grey boxes, though one couldalso interpret them more fundamentally as the result of aMERGE operation between a word and a set of lexical cate-gories. For simplicity, we shall also assume here that no otherinformation is carried by MERGE (such as genre, number,case, etc). In any case, this extra information can always beaccounted for with minor trivial modifications of the schemethat we present here. The diagram provides the structure oflinguistic correlations in the physical 〈z, t〉 plane.

The left hand side of the above expression is a purely lin-guistic concept. MERGE is a basic operation in syntax,picking up a set of linguistic elements (such as lexical cat-egories) and returning a new element describing the mainfeatures of the combination of the original, see Fig.1. Onthe other hand, the right hand side of the equation isa purely physical concept. A coarse-graining of informa-tion means the removal of superfluous degrees of freedomin order to describe a given physical system at a differentscale (of, e.g., length, energy or time), see Fig.2. Com-bined with rescaling, it is the procedure entailing renor-malization, by which the rules describing the macroscopicemerge from those describing the microscopic. The aboveequation establishes that both concepts are, in fact, thesame basic idea but in different contexts. Chomsky’sMERGE operation entails then the renormalization oflinguistic information. Moreover, this renormalizationaccounts for different time scales.

To understand better why this is the case, see the ex-ample in Fig.3. In terms of information processing, theMERGE operation picks up two information units at agiven time scale, namely

[V eat] and [N pasta], (2)

and keeps for the next time scale the most relevant in-formation of their coarse-grained combination, i.e., that

[V P [V eat] [N pasta]] (3)

is a verb phrase (i.e., lexical category V P ), where weused bracketed phrase markers to represent the syntax

Page 4: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

4

NP

the happy

D A

cat eats

N V

the fish

D N

NP

VPNP

S

t

z

z1

z2

z3

z4

FIG. 4: (Color online) Syntax tree for “The happy cat eats thefish”, seen as a renormalization process. The flow in z goesfrom the individual words, to the sentence, labeled by S. Thedifferent labels correspond to the different types of syntagmas(Noun Phrase, Verb Phrase, and so on). A rescaling of thetime variable at every scale is also implicitly assumed.

tree. The operation MERGE is non-associative, as cor-responds in general to a coarse-graining. Moreover, aslinguists know very well, grammar rules for individualwords are not the same as those governing more complexsyntagmas such as noun and verb phrases. So, we havetwo different descriptions of a system at different scales,and with different linguistic information units. The phys-ical variable with different scales must be time, since lan-guage is spoken and thought as time-ordered sequences ofconcepts, being written language just a graphical repre-sentation of this, see the more complex example in Fig.4.

This observation is ubiquitous in syntax and, whenseen from the perspective of physics, entails the renormal-ization of linguistic information at different time scales.Consider for instance syntax trees (or parse trees, asknown in computational linguistics) like the one in Fig.4.Such analysis are of the kind linguist use to describe howdifferent elements (words) come together in order to pro-duce a meaningful (active or passive) sentence, and havesince long been widely used in the study of language. Inpractice, such syntax trees are nothing but the concate-nation of several MERGE operations at different scales[63]: from words to other sintagmas, from these sintag-mas to more complex sintagmas... and finally up to asentence. According to our basic equivalence, the syn-tax tree that one obtains from such analysis is nothingbut the graphical representation of the renormalizationof the (linguistic) information of a sentence. This is, howthe information in different words comes together, hy-erarchically at different time scales, up to an emergentmeaningful sentence that we can interpret semantically.

Moreover, the syntax tree also encodes the structureof physical correlations at different time scales in thesentence. More precisely, because of the local natureof MERGE, correlations in a sentence are (essentially)built locally at different time scales. Of course, it could

be possible that other potentially-necessary operations insyntax, different from MERGE, introduce other depen-dencies (e.g., long-range movement). But still, it shouldbe possible to codify them pictorically in the syntax tree.Worst-case scenario, such extra elements would introducesome loop in the tree. But even in such a case, the renor-malization picture still holds, as we shall show in explicitexamples.

Importantly, this observation is completely general,and therefore must hold for any reasonable model of lan-guage following the Minimlalist Program. In particular,theoretical linguistic models trying to account for the ob-served rules of grammar, as well as probabilistic languagemodels in artificial intelligence accounting for the chancesof finding a given sentence in a corpus, should somehowencompass the renormalization of linguistic information.This, in turn, has deep implications in the structure ofcorrelations in syntax. As we shall see in the next sec-tion, a direct consequence of our observation is a natu-ral description of probabilistic language models in termsof quasi-loop-free TNs [18] accounting for different timescales.

III. TENSOR NETWORKS ANDPROBABILISTIC LANGUAGE MODELS

We now consider the implications of our basic equiva-lence for language models, i.e., probability distributionsover sequences of words [17]. Such models produce proba-bilities pw1,...,wn

for a sequence of n words, represented bythe random variables w1, . . . , wn, and are widely used inseveral technological ambits such as speech recognition,machine translation, text prediction, and so forth. Inpractice, such probabilities are obtained by training themodel (in this context, computing the frequencies of se-quences) with very large corpuses of text. Here we focuson the general constraints that renormalization imposeson the structure of these probability distributions. Aswe shall see, a very natural description in terms of TNsjust pops out, linking directly to Probabilistic Context-Free Grammars (PCFG), but not necessarily restrictedto them only.

A. The MERGE tensor

To begin with, let us consider the probability distribu-tion that two given linguistic elements α and β (e.g., twowords) merge into a new element γ (e.g., some other syn-tagma). This probability distribution M([α, β] → γ) =M(α ∩ β ∩ γ) can in fact be described by a probabilitymap M ,

M : Vin1 ⊗ Vin2 −→ Vout, (4)

with Vin1, Vin2

and Vout the input and output vectorspaces. The coefficients of this map are given by a 3-indexprobability tensor Mαβγ . The entries of this tensor are

Page 5: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

5

α β

δγ

µ

α βγ

δ

µM 2[ ]

M 1[ ] =γ α β

µ

p

(a) (b) (c)

FIG. 5: (Color online) (a) Two concatenated MERGE opera-tions, where we write different greek letters for all the possiblelexical variables. For language models, this structure can berepresented by the tensor network in (b), where M [1] and M [2]

are two different MERGE probability tensors (see text). Thecontraction of the tensor network gives the probability tensorpµγαβ , see Eq.(7). In this picture we used a diagrammaticnotation for tensors and their contractions, see text.

the probabilities of merging α and β (the linguistic inputof MERGE) into γ (the linguistic output of MERGE).Physically, the tensor coarse-grains the variables α andβ, at a given time scale, and retains the fundamental de-grees of freedom of the common object at a different timescale. The result of this coarse-graining is variable γ.

The tensor Mαβγ obeys the usual normalization con-dition for probabilities,∑

α,β,γ

Mαβγ = 1, (5)

i.e., the sum of all the probabilities is equal to 1. Onecan also compute residual probability distributions in theusual way, i.e., by summing up over the variables that arediscarded. For instance, one could have

M ′γ =∑α,β

Mαβγ , (6)

withM ′γ the residual probability distribution of obtainingγ as the output of MERGE, no matter the input.

From a linguistic point of view, the tensor Mαβγ is theimplementation, at a mathematical level, of the MERGEoperation for a probabilistic language model. If the sametensor is to be used everywhere in a syntactic structure,then this is nothing but the realization of a PCFG [22],i.e., a Context-Free Grammar with probabilities assignedto its merging rules. From the perspective of physics,though, this tensor coarse-grains degrees of freedom αand β at a given time scale into a new degree of freedomγ at a different time scale. From a mathematical perspec-tive, this tensor describes the probability of obtaining theinformation codified in γ from the information in α and β.Regardless of the interpretation, this object constitutesthe fundamental LEGO R© brick of probabilistic languagemodels following the Minimalist Program.

B. Syntactic tensor networks

the

D

man

N

from

P

Boston

N

drives

V

well

Adv

the

D

car

N

NP PP VP NP

NP VP

S

t

z

z1

z2

z3

z4

FIG. 6: (Color online) Syntactic TN for the sentence “Theman from Boston drives well the car”, where we includedalso the t and z axis, as well as the different renormalizationscales. Linguistic information is naturally encoded in the TNat every possible scale. The contraction of the TN gives theprobability of this sentence. In this particular example, theTN is a (binary) Tree Tensor Network.

Next, we notice that the structure of a syntax treemaps directly into a tensor network (TN) [18] for theprobability distribution pw1,...,wn

of the sentence. Specif-

ically, every syntactic MERGE[i] corresponds to a 3-index

tensor M[i]αβγ , with i simply a label to identify individual

tensors, which could in principle be different. Now let usconsider the case in which a variable µ is the result ofmerging δ and γ, with δ itself being the result of mergingα and β. In such a case, following the usual mathematicaltreatment of probabilities, one has that the probabilityof obtaining µ from α, β and γ (i.e., no matter the valueof δ) is given by the expression

pµγαβ =∑δ

M[2]µδγ M

[1]δαβ , (7)

i.e., we sum over all the possible intermediate events rep-resented by δ. This admits a very intuitive diagrammaticrepresentation, see Fig.5. In that figure, every tensor isa shape and every index is a line. Open indices, i.e.,those over which there is no sum, are just “free” lines,whereas sums over all the possible values of a commonindex between tensors are represented by lines connect-ing the tensors. Such sums are called contractions, i.e.,in this example we just contracted index δ. These typeof structures, where one has a set of tensors whose in-dices are contracted according to some network pattern,are called tensor networks (TN) [18], and always admita convenient diagrammatic representation as in Fig.5.With this in mind, we arrive to the conclusion that syn-tax trees of sentences map into TNs of MERGE tensors

M[i]αβγ at the level of probabilistic language models. We

call such structures syntactic TNs.Let us be more precise: if the syntax tree does not

have long-range dependencies (i.e., it is made only ofMERGEs), then the TN is loop-free and correspondsgenerically to a Tree Tensor Network (TTN) [23], see

Page 6: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

6

Noam

N

drives

V

the

D

car

N

NP

VP

S

t

z

≈Noam

N

drives

V

the

D

car

N

S VP NP

(a)

(b)

FIG. 7: (Color online) Syntactic TN for the sentence “Noamdrives the car”. The Tree Tensor Network in (a) can be un-derstood as a Matrix Product State, as shown in (b).

Fig.6. If the MERGEs are sequential in time, then theTN is in fact a special case of TTN called Matrix Prod-uct State (MPS), see Fig.7 [24]. These two types ofstructures appear quite often in the study of stronglycorrelated classical and quantum lattice systems in onespatial dimension [18, 23, 24] as well as in tensor cal-culus [25], and their properties are very well known byphysicists and mathematicians. Moreover, if the syntaxtree has some long-range dependency (e.g., movement,agree, c-command...), then this introduces some extraindex in the network, correlating variables at differentpositions, and therefore introducing some loop in the di-agram. To be precise, such extra index correlates the(perhaps distant) probability distributions for such vari-ables, and can normally be casted into redefined tensorsin order to keep the overall tree structure, as shown inthe figure. As an example, this is in fact the case ofthe so-called CHAINS, which we mentioned in the intro-duction (Sec.I), and where a lexical object is intrinsicallyinterpreted in different contexts of a sentence but onlyexternalized in one of them, see Fig.8 for an example.More intricate cases, such as those involving a concate-nation of chains (the so-called successive cyclicity), canalso be accounted for similarly, see Fig.9 for an example.At any rate, though, the number of loops in the TN is al-ways quite small, as long as the syntax tree is based on aPhrase-Structure (Constituency) Grammar [26], such asPCFGs. For the sake of clarity we restrict our explana-tion to these grammars. Other plausible situations, suchas those arising in Dependency Grammars [27], will bebriefly discussed in Sec.VII.

The syntactic structure of a sentence implies, there-fore, that correlations in its probability distribution areorchestrated according to a (mostly loop-free) TN ofMERGE tensors, which organize the degrees of freedomaccording to different time scales.

Einstein

Ntk

play

V

violin?

N

Should

Tk

S TP T´ VP

Einstein

Ntk

play

V

violin?

N

VP

S

Should

Tk

TP

(a)

(b)

t

z

FIG. 8: (Color online) Syntactic TN for the sentence “ShouldEinstein play violin?”, as an example of syntactic movement.The element “Should” is created at the position of tk butexternalized at the position of Tk (hence it “moved”). Atthe level of the TN, this can easily be accounted for by anextra correlation between these two positions, i.e., an extralink between them (and perhaps two new tensors, as shownin the figure). This introduces a loop in the TN. However, asshown in (b), it is possible to redefine the overall structure asa loop-free TN with tensors as those shown in the dotted redboxes, and reshaped (or fused) tensor indices (i.e., wheneverthere are two indices together, fuse them into a single bigindex).

C. Properties

Let us now enumerate some important properties of theprobability structures that we just found, which come outnaturally from their TN description. Some of them werealready mentioned briefly, but we revisit them again forclarity:

1. Locally-built syntactic correlations at every scale

Correlations in the probability distribution are builtlocally at every renormalization time scale by MERGE.Distant parts of the sentence become correlated at longtime scales (i.e., up in the syntax tree), whereas thosethat are close become correlated at short time scales (i.e.,down in the syntax tree). This locality implies a nicefeature of loop-free syntax trees: for a sentence with nwords, there are always exactly n − 1 merged objects.Translated into syntactic TNs, this means that if the TNhas n indices at the first renormalization scale z1 (i.e.,those corresponding to the words in the sentence), thenthere will be exactly n− 1 indices on the whole at higherrenormalization scales zm,m > 1. This can be easilychecked by inspection in all the figures with loop-freesyntax trees and syntactic TNs of this paper. The con-sequence is that to specify the full syntax of a typical

Page 7: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

7

seems

Vtk

tobe

V

likely

Adv

VP

S

Ángel

Tk

T‘‘‘

t

z

t´ktocome

V

today

Adv

T´´

VP

VP

FIG. 9: (Color online) Syntactic TN for the sentence “Angelseems to be likely to come today”, as an example of con-catenation of chains, or successive cyclicity. The syntacticinformation of the element “Angel” is at different places leav-ing traces tk, t

′k, ..., but externalized at only one position Tk.

At the level of the TN this can be easily accounted for by anextra index correlating different sites, similarly as in Fig.8.

sentence of n words, one requires on the whole 2n − 1units of syntactic information. In the case of having TNswith loops, as in the case of long-range movement in Fig.8and Fig.9, the index creating the loop establishes a cor-relation between distant positions in the sentence, someof them with only syntactic information and no wordpresent (the so called traces). In such cases it is clearthat the number of required syntactic information unitsis larger than 2n− 1, though not much larger.

2. Diagonal tensors and correlated factorization

The TN, when contracted from up towards down, re-produces the different probability distributions of the lin-guistic variables at every renormalization time scale. Inother words, the TN encodes the probabilities of the rele-vant degrees of freedom at all possible time scales. More-over, it is possible to obtain the residual probability ofany of the variables just by contracting all the rest. Quiteimportantly, in syntactic TNs one does not even need toperform any tensor contraction since, once the sentenceis fixed or partially fixed, there is a correlated factoriza-tion of the whole TN because of the way human languageturns out to be, which we explain in what follows.

A well-known fact in grammar is that the output of aMERGE operation is always uniquely determined by itsinput. This is, given two objects being merged, there isonly one possible output, no matter the context. This isa simple observation about how human language seemsto work: the human brain does not merge an adjective

Román

N

plays

V

his

D

...

?

NP

VP

S

t

z

M 3[ ]

M 2[ ]

M 1[ ]

FIG. 10: (Color online) Syntactic TN for the sentence“Roman plays his ...”, where the last word is unspecified.The syntactic environment inside the dashed area forces theupper index of tensor M [1] to be NP . The first index of M [1]

is forced to be the determiner “his”. This constraints theprobability of finding a given word at the last place of thesentence: whatever it is, it needs to merge with a determinerto become a noun phrase. There are not too many options:the word needs to be a noun. Notice that this is fully deter-mined by the immediate neighbourhood in the sentence (thedeterminer), as well as the syntactic environment (the dashedregion).

A and a noun N into an object that sometimes behaveslike a noun phrase NP , and sometimes like an adjectivalphrase AP . Instead the combined object behaves alwayslike a noun phrase NP . So, given the input of MERGE,its output becomes fixed uniquely [64].

This turns out to have an important consequence forus: MERGE tensors are diagonal. As a consequence,once the sentence is given, or partially given, then theTN factorizes in a correlated way. To see why this is so,notice that if the output of MERGE is always uniquelydetermined by its input, then all the indices in the syn-tactic TN become fixed once the indices at the shortesttime scale are fixed, i.e., once a specific sentence is given.Because of this, the probability of a specific sentenceactually factors out in terms of correlated probabilitiesand no TN contraction is needed at all. The overallcorrect syntactic structure of the sentence is the global,non-local property that correlates all the probabilitiesamongst themselves. Moreover, the residual probabilityof, say, finding a specific word in a sentence that is par-tially given, can be easily computed using one MERGEtensor only, which contains information about both theimmediate neighborhood of the word, as well as aboutthe overall syntactic neighborhood, see Fig.10. This isa very remarkable property that has its roots in the pe-culiarities of human language. In particular, it impliesthat the calculation of probabilities is extremely efficient,and that if the correct syntactic structure of a sentenceis fully or partially known, then the statistical perplexi-ties of reduced probability distributions are remarkablylow, as we shall discuss in more detail in the forthcomingsections.

Page 8: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

8

For a given sentence, therefore, the formalism producesa correlated structure of 3-index tensors linking all possi-ble renormalization scales, see Fig.10. For example, theoverall probability of, e.g., the 4-word sentence “Romanplays his guitar” (an actual possibility in Fig.10) reads

pw∗1 ,w∗2w∗3 ,w∗4 = M[3]w∗1 ,V P,S

M[2]w∗2 ,NP,V P

M[1]w∗3 ,w

∗4 ,NP

, (8)

where w∗1 , ..., w∗4 are the fixed words of the sentence, and

no tensor contraction is needed at all. The above equa-tion is a correlated product of coefficients from 3-indexprobability distributions, which encode all the syntacticinformation of the sentence at all time scales. The ef-fect of this is more dramatic when it comes to residualprobabilities: consider for instance predicting the word“drank” in the sentence “The man John met yesterdaydrank japanese whisky”. A 3-gram model [28] (a rathercommon option in speech recognition) would give a prob-ability distribution such as

pw∗4 ,w∗5 ,w63− gram model, (9)

i.e., correlating the word w6 only to “met” and “yester-day”. The predictive power of this distribution is thusnot very good, because there is no use whatsoever ofthe syntactic information from the rest of the sentence.However, in our TN description, the residual probabilitydistribution, as shown in Fig.11, is given by

M[6]w6,NP,V P

Syntactic TN model, (10)

which includes all the relevant syntactic information ofthe environment needed to predict w6 in the sentence.In other words, having [NP [A japanese] [N whisky]], therest of the sentence imposes that whatever goes in w6

needs to combine together with this NP necessarily intoa verb phrase V P . To put it simply, the marginal prob-ability distribution is governed by this question: withwhom do I merge to form what, as constrained by therest of the sentence? In hindsight, this description in-cludes all the relevant syntactic information required topredict the word exactly at that point.

From the above derivations, it is clear that all probabil-ities can be computed very efficiently, and exactly, fromthe TN. To be more specific, the fact that the structuresare mostly loop-free implies that the calculation of prob-abilities, which amounts to the contraction of the tensorsin the TN, can be done in polynomial time in the numberof words, i.e., O(poly(n)) [18, 23, 24]. From the perspec-tive of complexity theory, this is a consequence of the factthat the contraction of a loop-free TN is a problem in thecomplexity class P [29] [65]. But in the case of syntacticTNs like the ones described here, the situation is evenbetter because of the correlated factorization explainedabove, which implies that no contraction of tensors needsto be done at all. The calculation of the probability of agiven sentence amounts, simply, to determining the rele-vant syntax tree for a sentence and then multipliying thecorresponding MERGE coefficients. For a sentence with

man

N

met

V

yesterday

Adv

NP

S

The

D

Japanese

Adj

whisky

N

VP

VP

John drank

N V

NP

t

z

NP

FIG. 11: (Color online) Syntactic TN for the sentence “Theman John met yesterday drank Japanese whisky”. The fullsyntactic environment of the word “drank” is highlighted inthe dashed region, and determines the probability distributionof finding a specific lexical element at that place.

n words, it is easy to see that both steps have a computa-tional cost of O(n), and therefore the overall cost is alsoO(n). Therefore, the renormalization structure imposedby MERGE implies a very economical manipulation ofthe linguistic information in terms of computational re-sources such as time, memory, and energy.

3. Syntactic correlations

The two-point correlations in the probability distri-butions depend on the sentence (specifically, the syntaxtree) and the renormalization time scale chosen to com-pute the correlations. This is also a well-known propertyof loop-free TNs, and in our case it means that the cor-relation between two words in the sentence decays expo-nentially fast with their separation distance in the syntaxtree (i.e., their separation in the network), which may beequal to the actual separation distance in the sentence ornot.

Mathematically, this means the following: consider thetwo-point correlation function

C(i, j) ≡ 〈f(wi)f′(wj)〉 − 〈f(wi)〉〈f ′(wj)〉, (11)

with

〈f(wi)f′(wj)〉 =

∑w1,··· ,wn

f(wi)f′(wj) pw1,...,wn

〈f(wi)〉 =∑

w1,··· ,wn

f(wi) pw1,...,wn

〈f ′(wj)〉 =∑

w1,··· ,wn

f ′(wj) pw1,...,wn, (12)

and f(wi), f′(wj) some functions of the variables wi, wj .

We could think of these variables as those representing

Page 9: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

9

words at times i and j, but they could also be the vari-ables for other (renormalized) syntagmas at a longer timescale (i.e., somewhere up in the tree). It is possible toprove mathematically [18, 23, 24] that this correlationfunction decays asymptotically as

C(i, j) ≈ e−d(i,j)/τ for d(i, j) τ, (13)

with d(i, j) the size of the path between wi and wj in thesyntax tree, and τ a sentence-dependent (finite) correla-tion time, see Fig.12 and Fig.13. As is well known fromthe theory of TNs, parameter τ does not depend on thechoice of functions f(wi) and f ′(wj), so it depends onlyon the type of sentence and the MERGE probabilities.This conclusion also holds if the TN has a small numberof loops. Importantly, the quantity d(i, j) can depend alot on the type of syntax tree that one has. Consider forinstance the two examples “Noam drives the car”, and“The man from Boston drives well the car”, with syntaxtrees as in Fig.12 and Fig.13. In the first case, Fig.12,the syntax tree is purely sequential, and therefore theTN for the probability distribution is a Matrix ProductState [24]. In such a case it is clear that the distanced(i, j) between two words is the actual separation dis-tance in the sentence, i.e., d(i, j) = |j − i|. However,in the second case, Fig.13, the syntax tree is a binarytree, and therefore the corresponding TN is a Tree Ten-sor Network [23]. In such a case, the path along the treebetween two words in the sentence necessarily goes alsoalong the vertical axis, and one can prove that it is givenby d(i, j) ≈ log2 |j−i|, again with |j−i| the separation inthe sentence, and where ≈ means that it is correct up tosome possible additive constant term [66]. Therefore, incases such as the one in Fig.12 (“Noam drives the car”),the correlation function between two words will behavelike

C(i, j) ≈ e−|j−i|/τ for |j − i| τ, (14)

whereas in cases such as Fig.13 (“The man from Bostondrives well the car”) it will behave like

C(i, j) ≈ e−(log2 |j−i|)/τ ≈ 1

|j − i|1/τfor |j − i| τ.

(15)In both cases the correlation falls down towards zero

with the separation distance |j − i| in the sentence, butin the first case it decreases exponetially fast, whereasin the second it is polynomially fast, and therefore muchslower than in the first case. Notice, however, that atthe level of renormalized objects up in the syntax tree,the two situations are completely equivalent, see Fig.13.This means that the correlation functions for the secondsentence (Fig.13), but at some longer time scales (i.e., atthe level of renormalized syntagmas up in the tree), decayexactly in the same way as those in the first sentence (seeFig.12).

Three remarks are in order. First, in intermediate sit-uations between those described by the two examples

Noam

N

drives

V

the

D

car

N

S VP NP

d(i, j) = j − i

i jj − i

FIG. 12: (Color online) For a TN structure such as the onefor “Noam drives the car”, the syntactic distance d(i, j) is thesame as the time separation distance, i.e., d(i, j) = |j − i|.This is because the structure of correlations can be writtenas a Matrix Product State. Two-point correlation functionsin this type of sentences decay exponentially fast in the timeseparation |j−i|, as explained in the text. The syntactic pathbetween i and j is shown with a red thick line.

the

D

man

N

from

P

Boston

N

drives

V

well

Adv

the

D

car

N

NP PP VP NP

NP VP

S

i jj − i

d(i, j) ≈ log2 j − i

FIG. 13: (Color online) For a TN structure such as the onefor “The man from Boston drives well the car”, the syntac-tic distance d(i, j) is not the time distance, but rather itslogarithm, i.e., d(i, j) ≈ log2 |j − i|. This is so because thesyntactic path between positions i and j goes also through therenormalization scale. Consequently, there are two-point cor-relation functions for these types of sentences which can decaypolynomially fast towards zero in the time separation |j − i|,hence much slower than in the case of Fig.12 (see text). Thepath between i and j is shown with a red thick line. Notice,however, that at the level of the renormalized syntagmas inthe red dotted boxes, the structure is exactly the same as theone in Fig.12.

above we expect also an intermediate regime between thetwo limiting cases from Eq.(14) and Eq.(15), but alwaysobeying Eq.(13) asymptotically. Second, notice that thecorrelation time τ measures roughly how fast these cor-relations decay: the shorter τ is, the faster they decaytowards zero. And third, notice that Eqs.(13), (14) and(15) essentially imply that language, at least within thisdescription, has always very short-distance correlationswithin the syntax tree, which does not necessarily im-ply short-distance correlations in the separation distancewithin a sentence, since these two distances are differ-

Page 10: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

10

ent as shown in Eq.(15). In fact, we will show in Sec.IVthat for actual separation distances in the sentence onefinds that, on average, language has long-range correla-tions, i.e., they decay polynomially, in turn matching theexperimental observations [30]. Similar conclusions ap-ply as well in the case of having a small number of loopsin the network, e.g. in linguistic chains, or in situationssuch as the German language, where a word correlatedwith the beginning of the sentence is actually sent to theend.

4. Positivity

By construction, the syntactic TNs presented here aresuch that all the tensors are non-negative, i.e., they aremade entirely of non-negative coefficients. This is be-cause of the stochastic nature of the MERGE tensor,which has been defined in terms of probabilities. Theseare sometimes called stochastic tensor networks. It is wellknown that such positivity restriction on the coefficientsof the tensors is in fact very stringent [32] and usually im-plies a very large dimension for the vector spaces of thecoarse-grained variables in the TN. We may therefore ex-pect that a TN description of the overall probability dis-tribution in terms of non-negative tensors lowers downthis dimension, thus making the representation compu-tationally more efficient. The price to pay, however, isthat we loose the interpretation of the MERGE tensor asa tensor of probabilities. Still, a non-positive TN may becomputationally more convenient in some situations.

D. Refinement levels

The TN structure of MERGE tensors that we justdescribed admits different levels of refinement, when itcomes to determining the actual probability of a sentencein a given language model. A practical evaluation of suchprobabilities, once a parsed corpus (a Penn TreeBank) isgiven, proceeds as follows:

(i) First, one does a frequency count of all the words,and computes the probability of being some lexical cat-egory (N , V , etc) conditioned to being a certain word.This probability distribution corresponds, formally, to aMERGE operation at an initial time scale z0 between aset of words and a set of lexical categories, as mentionedbriefly in the caption of Fig.3. In practice, though, it can

be accounted for by a 2-index probability matrix Mαβ ,with the first index referring to a particular word, andthe second to its lexical category.

(ii) Second, one considers every sentence in the corpusand the respective syntax tree, and computes the proba-bilities corresponding to the coefficients of the MERGEtensors. This is done by counting the frequency of howmany times two given lexical elements merge into a given

object. Quite importantly, there are (at least) four differ-ent levels of refinement of the computed tensors, depend-ing on their position in the 〈z, t〉 plane and the structureof the syntax tree. In increasing order of refinement,these are:

1. One single MERGE tensor M for all possible posi-tions in the 〈z, t〉 plane.

2. One MERGE tensor M [z] for each possible renor-malization scale, each one for all possible positionsin t at the corresponding scale.

3. One MERGE tensor M [z,t] for each possible posi-tion in the 〈z, t〉 plane.

4. One MERGE tensor M [T,z,t] for each possible posi-tion in the 〈z, t〉 plane, and for each possible syntaxtree T .

The more refined the information included in the com-puted MERGE tensors, the more accurate is the prob-ability distribution, and therefore the better is the lan-guage model. The first of the refinement levels describedabove corresponds to the probabilistic language modelsprovided by PCFGs [22]. These models are known towork reasonably in some circumstances, although on av-erage not as good as, say, N -gram models [28]. But this isunderstandable, because one does not expect a priori thesame MERGE tensor at all the possible positions in the〈z, t〉 plane. Importantly there are still three more lev-els of refinement, which should account for better mod-els. The second level drops the assumption of “ancestor-free” (akin “scale invariance” in physical jargon), so thatthe tensors may depend on the scale z. The third leveldrops, additionally, also the assumption of “place invari-ance” (akin “translation invariance” in physical jargon),so that the tensors may also depend on the variable t.Finally, the fourth level of refinement drops the assump-tion of the MERGE tensors being tree-independent. Inprinciple, the four refinement levels are computable froma TreeBank, implying increasing level of precision for thelanguage model. As for the computational cost of retriev-ing the MERGE tensors, in the first three levels it shouldbe O(Mn) both for time and memory, with n the averagenumber of words per sentence in a corpus containing Msentences. In the fourth case, however, the time cost isalso the same but the memory cost may be larger since,for a large text, we expect to find roughly all possiblesyntax trees for every sentence length, which for n wordsis in turn given by the (n− 1)th Catalan number,

C(n−1) =(2(n− 1))!

n!(n− 1)!≈ 4n√

πn3/2

(1 +O

(1

n

)), (16)

where the approximation is in the limit n 1, and there-fore scales exponentially. However, typical sentences inhuman language do not usually imply a dramatically-large number of words (we elaborate more on this inSec.VIII), and therefore the number of different syntax

Page 11: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

11

trees to be stored in memory may not be as large in prac-tice as the above number.

Once the MERGE tensors have been computed fromthe TreeBank, the numerical probability for a sentenceof n given words can be obtained in a two-step process:

(i) First, compute the possible syntax trees of the sen-tence (there may more than one valid tree in ambiguouscases).

(ii) Second, evaluate the probability for each tree fol-lowing the correlated factorization procedure explainedin the previous section, according to the four refinementlevels mentioned above. The overall probability is thesum of probabilities for each valid syntax tree.

As the probabilities are computed, it is possible to cal-culate standard benchmark measures of language models,such as the so-called perplexity P,

P = 2H(p) = 2−∑w pw1,··· ,wn log2 pw1,··· ,wn , (17)

with H(p) the Shannon entropy of the probability dis-tribution. The lower the perplexity, the more peaked isthe distribution and thus the better it predicts the sam-ple. So, the better the language model, the lower itsperplexity, at least a priori. In our case we also expectthe perplexity to decrease substantially as the refinementof the coefficients of the MERGE tensors increases, ac-cording to the four refinement levels mentioned above.Moreover, the perplexity also goes down with the pre-cision of the probabilities in our MERGE tensors. Weprove these points in Sec.V, using at some steps a novelreformulation of language models in terms of quantumstates.

IV. LONG-RANGE CORRELATIONS INLANGUAGE

How are correlations, on average, for language? Itturns out that we can provide an answer for this ques-tion using the results that we presented so far. We sawin Eqs.(13, 14, 15) that correlation functions C(i, j) be-tween two points i and j in a sentence decay exponen-tially with the syntactic distance d(i, j), i.e., the distanebetween positions i and j in the network. And this dis-tance may be the actual distance in the sentence, as inFig.12, or not, as in Fig.13.

In practice, it is clear that all languages tend to havemore structures like the one in Fig.13 than those likein Fig.12. The simple reason for this is that MERGEfavours the formation of tree-like structures. So, if wewere to count the number of possible tree-like sentencesin a language (i.e., like those like Fig.13), they would cer-tainly outnumber the “linear” structures (i.e., like thosein Fig.12). The conclusion from this is that the average

syntactic distance d(i, j) over all the sentences of a givenlanguage, obeys

d(i, j) ≈ log2 |j − i| , (18)

where the overline means “average”. This must be thecase, simply because most of the syntactic structures aretrees as in Fig.13. In turn, this implies that the averagecorrelation function C(i, j) behaves like

C(i, j) ≈ 1

|j − i|1/τ, (19)

with τ correlation time that depends only on the averageover all sentences of the specific language being analyzed.Such a behaviour is also inherited by other quantitieslike the mutual information I(i, j), since it satisfies theinequality [33]

I(i, j) ≥ a× C(i, j)2, (20)

with a some constant prefactor. Thus, on average andfor the mutual information one finds

I(i, j) ≥ a× 1

|j − i|2/τ, (21)

meaning that it also decays at least polynomially withthe separation distance |j − i| in the sentence, and witha characteristic correlation time τ/2.

The result we just found in Eqs.(19, 21) deserves somediscussion. First, notice that we obtained these expres-sions exactly. They therefore imply that languages, onaverage, have a polynomial decay of correlation functionsand mutual information between two points. Impor-tantly, this has been observed in experiments and numer-ical analysis of specific texts and languages, and is whatpeople often call “long-range correlations” in language[30]. Our results thus provide a mathematical proof fromfirst principles of this observed phenomenon.

Second, notice that some properties of Eqs.(19, 21) arelanguage-independent, and therefore universal, whereasother are language-dependent. In particular, the aver-age polynomial decay is universal. However, the char-acteristic time scale for this decay, namely exponent τ ,is language-dependent. The number τ therefore providesa quantitative way to classify different texts and differ-ent languages. Those texts and languages with similarsyntactic structures will have similar values of τ , and istherefore natural to think that they belong somehow tothe same family, i.e., probably coming from the same an-cestor (e.g., Spanish, Italian and French all coming fromLatin). And this is important, because such an analysismay be a way to provide hints about the genealogical (or“genetic”) root of languages for which we do not knowtheir origin [34], such as Basque, Sumerian, and others.

V. LANGUAGE MODEL QUANTUM STATES

Let us now define the following quantum state:

|Ψ(Tn)〉 =1

Z(Tn)12

∑w1,...,wn

(pw1,··· ,wn)12 |w1, . . . , wn〉,

(22)

Page 12: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

12

A i[ ]

A i[ ]( )* = p i[ ]

FIG. 14: (Color online) TN diagram for Eq.(25). The matrix

on the right hand side is diagonal, and with entries p[i]γ δγγ′ .

with pw1,··· ,wnthe probability of a sentence with words

w1, · · · , wn and syntax tree Tn, and |w1, . . . , wn〉 anorthonormal (tensor product) basis of some Hilbert spacefor n parties, each party corresponding to the position ofa word in the sentence. The dividing normalization factorZ(Tn) is actually the partition function of the probabilitydistribution, i.e.,

〈Ψ(Tn)|Ψ(Tn)〉 =1

Z(Tn)

∑w1,...,wn

pw1,··· ,wn= 1. (23)

We call the state in Eq.(22) a language model quantumstate.

Because of the correlated factorization of syntacticTNs explained in previous sections, one can see easilythat these language model quantum states admit a TNrepresentation of their coefficients, i.e., they are really TNstates in the strict quantum-mechanical sense. The TN

structure of the coefficient (pw1,··· ,wn)

12 is simply given by

the same one as for the probability distribution pw1,··· ,wn

(the syntactic TN), but replacing every coefficient of aMERGE tensor by its square root. More specifically, itis the same TN but with 3-index tensors A[i] of coeffi-cients

A[i]αβγ ≡

(M

[i]αβγ

) 12

, (24)

again with i simply label for the different tensors. Thissimple prescription is a direct consequence of tensors be-ing diagonal in the syntactic TN. Notice also that thesetensors obey the condition

∑α,β

A[i]αβγ

(A

[i]αβγ′

)∗=

∑α,β

M[i]αβγ

δγγ′ = p[i]γ δγγ′ ,

(25)

with p[i]γ the probability of merging at position i any two

given lexical objects into γ, and δγγ′ the Kronecker delta,see Fig.14.

A. Properties

The language TN quantum state that we just defined isinteresting for a number of reasons. These are describedin what follows.

QR

A i[ ]A j[ ]

Q i[ ]R i[ ]

B j[ ]

Q i[ ]B j[ ]

==

FIG. 15: (Color online) Iterative procedure to get the quan-tum circuit producing a language model quantum state for agiven syntax tree (see text). The red dashed lines in the up-per diagram correspond to QR decompositions. The processis iterated at every scale, until reaching the top.

1. Truly random sampling

First, notice that if this quantum state becomes (some-how) experimentally available in an actual quantum sys-tem, then it can be used to do truly random sampling ofthe probability distribution of sentences with that par-ticular syntax tree. For comparison, all classical sam-plings are based on pseudo-random number generators,which are known to induce errors in the long run for, e.g.,Monte Carlo methods. The state could also be useful, forinstance, to find the most-likely sentences in a languagemodel, and things alike.

2. Language model quantum circuit

Second, the state can, in fact, be created by a quantumcircuit with as many two-body gates as A-tensors. Theprocedure is sketched in Fig.15: starting from the short-est renormalization scale z1, one reshapes the indices ofthe A-tensors as a matrix and performs a QR decompo-sition [35], as shown in the figure. Since the A-tensorsare real and positive, the matrix Q is orthogonal, i.e.,QTQ = I. Reshaping back Q as a 3-index tensor providesan isometric tensor, which we keep at the particular sitesof the network at that renormalization scale. Matrices R,

Page 13: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

13

however, are contracted with the A-tensors at the nextrenormalization scale z2, see Fig.15. The resulting ten-sors, call them B, are then also QR-decomposed, wherethe Qs define again isometries, which we keep in the net-work, and the Rs are contracted with the A-tensors atthe next renormalization scale. By iterating this processup to the top level, one gets a TN of isommetric 3-indextensors Q[i], and a quantum state |Ω〉 at the very topcarrying non-local information about the probability ofthe whole sentence. In particular, since tensors Q[i] areisommetries, one has that

〈Ψ(Tn)|Ψ(Tn)〉 =1

Z(Tn)〈Ω|Ω〉 = 1, (26)

(where the last equality follows from the normalizationof the state), and therefore

〈Ω|Ω〉 = Z(Tn) =∑

w1,...,wn

pw1,··· ,wn, (27)

which means that the norm of the quantum state |Ω〉is the overall probability of having an n-word sentence(whichever) with syntax tree Tn in the language model.This global information just moved up to the top levelof the TN and, importantly, we constructed it locally atevery renormalization scale by a sequence of QR decom-positions, therefore very efficiently (notice that we neverneeded to compute each one of the terms pw1,··· ,wn indi-vidually!) [67]. Connecting to the usual developments inquantum-mechanical TN states, this is an example of anisometric TTN state [23]. Finally, in order to promotethis structure to a quantum circuit, we simply notice thatan isometric tensor can be understood as a two-body uni-tary gate, where one of the indices is fixed to some ancil-lary state |0〉 [16], see Fig.16. The resulting diagram isnothing but the picture of the quantum circuit producingthe desired quantum state. The conclusion is that if theMERGE tensors are given, then one could in principleproduce these quantum states efficiently in a quantumcomputer or a quantum simulator. Last but not least:the description above has been for TNs without loops,but it can be generalized to other situations. In case ofhaving a small number of loops in the network (e.g. inCHAINS), there is also a similar procedure as the oneindicated here by playing with several tensor decompo-sitions (QR, Singular Value Decomposition, etc), alwayssending the non-unitary parts upwards in the syntacticnetwork. We envisage that this may trigger applicationsof near-term quantum processors for computational tasksrelated to language.

VI. PERPLEXITY FROM ENTANGLEMENT

An interesting application of the language model quan-tum states defined in the previous section concerns theperplexity P of a language model, which was defined inEq.(17). For a given sentence, it turns out that we can

0 0 0 0 0 0 0Ω

FIG. 16: (Color online) Quantum circuit of 2-body gates pro-ducing a language model quantum state for a given syntaxtree. Ancillary degrees of freedom are fixed to the quantumstate |0〉. The state |Ω〉 at the top may be produced from |0〉by some extra 1-body gate, and its squared norm codifies theoverall probability of the tree.

w1

L1

w2

L2

wn-1

Ln-1

wn

Ln

S

... ...

n '

FIG. 17: (Color online) Subset of n′ contiguous words in anarbitrary sentence, as described by the language quantumstate in Eq.(22). The clouds indicate some arbitrary pieceof an arbitrary syntactic tree.

give lower bounds on the perplexity of a given subset ofwords, using tools from quantum information theory, aswe show next.

Let us start by considering a sentence with n words,and a subset of n′ < n contiguous words within the sen-tence. These are a block of n′ words. The question wewant to answer now is: how much is the entanglement ofthis block of n′ words in a given quantum state |Ψ(Tn)〉for a syntax tree Tn? Following the usual procedure forbipartite entanglement, we get first the reduced densitymatrix of the block,

ρ(n′) = trn−n′ |Ψ(Tn)〉〈Ψ(Tn)|, (28)

with trn−n′(·) the partial trace over the rest of the system(the environment). As shown in the diagrams of Fig.17,this can be achieved by “cutting” out the relevant sub-tree linking the n′ words from the rest of the sentence.After the appropriate contractions, this reduced densitymatrix can always be written as

ρ(n′) =1

Z(Tn)WXW † (29)

with W some rectangular matrix amounting for the con-

Page 14: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

14

traction of the sub-tree for the block, and X a squarematrix whose rank is the number of lexical categories Nlin our grammar, being this also the rank of ρ(n′), seeFig.18. It is easy to see, moreover, that in fact matrix Xis diagonal,

Xαα′ ∝ p(n− n′)αδαα′ , (30)

with p(n − n′)α the overall probability of the string ofn − n′ words merging into lexical category α, no mat-ter the words in the string. One can also see that the(unnormalized) eigenvectors of ρ(n′) are given by

(vα)ω =(W †)wα

, (31)

with (vα)ω the ω-coefficient of the αth eigenvector, andeigenvalues λα given by

λα = p(n′)α p(n− n′)α, (32)

with p(n − n′)α as described above, and similarly forp(n′)α but for the set of n′ words, see Fig.19. UsingEq.(32), one can get the entanglement entropy S(ρ(n′))and the single-copy entanglement E1(ρ(n′)) of the blockof n′ words [36], which are given respectively by

S(ρ(n′)) = −∑α

λα log2 λα

E1(ρ(n′)) = − log2

(maxα

λα

). (33)

The above entanglement measures obey the chain of in-equalities

E1(ρ(n′)) ≤ S(ρ(n′)) ≤ log2Nl, (34)

which implies that the entanglement of the block cannever be too large, since the number of lexical categoriesNl in a typical grammar for human language is usuallyquite small.

Next, we notice that the probability distribution pω forthe n′ words in the block is actually given by the diagonalelements of ρ(n′) in the basis of Eq.(22) restricted to theblock, i.e.,

pω = ρ(n′)ωω. (35)

One can check from the derivations above that this prob-ability distribution and the one of the eigenvalues λαobey the majorization relation [37]

~p ≺ ~λ, (36)

which implies

H(pw) ≥ S(ρ(n′)), (37)

i.e., the Shannon entropy of the reduced probability dis-tribution for the block of n′ words is larger than the en-tanglement entropy of the block. This relation, combinedwith Eq.(34), implies directly that

P = 2H(pw) ≥ 2S(ρ(n′)) ≥ 2E1(ρ(n

′)), (38)

S

... ...

S

... ...

ρ(n ') = 1Z(Tn )

=1

Z(Tn )X

W

W *

FIG. 18: (Color online) Reduced density matrix of a block ofcontiguous n′ words in the language state of Eq.(22).

1Z(Tn )

X

W

W *

W *

ρ(n ')

p(n ')α p(n− n ')αα

= p(n ')α p(n− n ')α W *

α

FIG. 19: (Color online) TN diagram for the eigenvalue equa-tion of the reduced density matrix ρ(n′).

with P the perplexity of the distribution of the n′ wordsas defined in Eq.(17). Combining this with Eq.(32) andEq.(33), in the end we arrive to the result

P ≥ minα

(1

p(n′)α p(n− n′)α

), (39)

which is our main lower-bound for the perplexity of theprobability distribution of the n′ words.

Some remarks are in order. First, notice that Eq.(39) isa fully classical result, even though we used the machin-ery of quantum information theory to find it. Second,the inequality is giving us a fundamental lower bound onhow well our language model can predict sentences, justbecause of its statistical nature. Third, we can roughlyestimate the scaling of this lower bound: if pmax is themaximum merging probability over all MERGE tensorsin the network, it is easy to see that

P '

(1

pmax

)n−1(40)

which implies, also roughly, that the perplexity getsworse (increases) exponentially fast with the number ofwords n in the sentence, but also that it improves (de-creases) exponentially fast if the MERGE probabilitiesof the language model get more refined and accurate.This inequality shows clearly the route required in orderto improve the performance of syntax-based probabilistic

Page 15: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

15

t

z

z3

z2

z1

z4

FIG. 20: (Color online) Possible MERA-like TN for somepossible dependency grammar. Probability distributions (ten-sors) are correlated at every renormalization scale. The struc-ture is no longer a tree if all possible dependencies are takeninto account at every scale, as shown in the diagram.

language models: short sentences, and accurate probabil-ities.

VII. ARBITRARY GRAMMARS ANDLANGUAGE MODELS

We would like to say a couple of words about othertypes of grammars, not necessarily context-free, as wellas other language models. Importantly, the tensor net-work picture of language is not necessarily restricted tothe cases that we presented above, and in fact can be usedto describe the correlation structure of, essentially, anytype of grammar and/or language model. For instance,the trees of dependency grammars [27], though not basedon the MERGE operation, also admit a TN representa-tion of their correlations when put as a probabilistic lan-guage model. We could even add long-range dependen-cies between the probability distributions in constituencygrammars, as was shown for the case of chains in Fig.8,but which can in fact be generalized over the whole 〈z, t〉plane, obtaining what is known in physics as a MERA-like tensor network [16], see Fig.20. As a matter of fact,it would be possible to model with TNs any grammaticalcorrelation structure, even if not directly linked to humanlanguage. An example would be a syntactic structurebased on an hypothetical MERGE operation with multi-ple outputs for a given input. Such structures would nothave the property of “correlated factorization” discussedabove, but most of the key properties that we mentionedwould still hold, including those related to computationalefficiency and short-range syntactic correlations.

From a practical perspective, the so-called N -grammodels [28], where the probability of observing a wordis assumed to depend only on the history of the preced-ing N − 1 words, also admit a similar description. Forinstance the case of 1-grams corresponds to the productprobability distribution

pw1,...,wn= p[1]w1

· · · p[n]wn, (41)

...

w1 w2 w3 wn

p 1[ ] p 2[ ] p 3[ ] p n[ ]

t

FIG. 21: (Color online) TN for a 1-gram language model.Only the time axis is relevant, and there is no correlationbetween the words w1, ..., wn. In physics, this is the analogueof the so-called mean-field theory approximation.

which can be represented by the TN diagram of Fig.21.Such a 1-gram TN does not include any correlation be-tween the words. For comparison, similar separable TNsare also the ones used in the so-called mean-field approxi-mation to strongly correlated systems, where correlationsbetween different sites are discarded [38], and which isknown to fail whenever correlations are important. Forthe case of more complicated N -grams, one can actuallydefine an appropriate language model quantum state, i.e.,

|Ψ(N − gram)〉 =1

Z12

∑α∈N−gram

(pα)12 |α〉, (42)

with α an index running over all possible N -grams, pαtheir probabilities, |α〉 a set of orthonormal states, onefor every N -gram (which is rather easy to construct), andZ the partition function of the distribution. Once sucha state is available, one can do similar things as for theTN language models discussed previously, such as trulyrandom sampling, and so forth.

VIII. IMPLICATIONS

Our “renormalization picture” of syntax and the re-sults presented above demand for a necessary and de-tailed discussion about its implications, which extendinto different ambits. In what follows we elaborate onsome of them, taking a somehow more phylosophicalperspective than in the previous sections, though well-grounded in our rigorous observations so far.

A. Good and bad models for language processing

The first practical implication, as we have alreadyhinted in the previous sections, is that “good” lan-guage models (of any kind) should be compatible withthe coarse-graining picture that we presented. From ageneric perspective, one should expect a language modelto reproduce the way humans seem to organize correla-tions in sentences, and from our perspective, this is givenby the organization of coarse-grained information at dif-ferent time scales. Concerning the field of artificial intel-ligence, we thus believe that a good starting point to ob-

Page 16: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

16

tain better language-processing algorithms, is to includealso this organization of linguistic information accordingto time scales. This is in fact partially achieved alreadyby the so-called “syntactic language models” [39]. Thesame applies to theoretical models of language in theo-retical linguistics [68]. Notice, importantly, that in thiswork we never hypothesized about what is the funda-mental theory of grammar behind the known propertiesof MERGE. Questions such as “why a noun and an adjec-tive merge into a noun phrase?”, or “why is the output ofMERGE uniquely determined by its input?”, are beyondthe scope of this work. In other words: we observedhow correlations in human language get organized, ex-plained this organization using the tools of physics, andexploited the consequences. And that is everything wedid. We never discussed where these correlations couldcome from, or why they are as they are. In any case,and this is the point that we wish to make here, modelsattempting to explain this, either computational or the-oretical, should encompass the picture presented here tobe legitimate, since our observations are general.

We are now in position to answer the following ques-tion: why are some neural networks good at process-ing language, whereas some others are bad? The rea-son for this is now clear: those neural networks that aregood, such as deep convolutional networks, are nothingbut Tree Tensor Networks (TTNs) [45], and thereforecodify the correct renornalization structure. As we ex-plained before, TTNs can encompass the long-range cor-relations observed in language (i.e., the polynomial decayof correlation functions and of mutual information for twowords in a sentence, see Sec.IV). Additionally, those neu-ral networks that are bad at language processing, suchas recurrent neural networks and hidden Markov mod-els, also have a TN structure [45] but it correspondsto an MPS, i.e., a structure such as the one in Fig.12.As we explained also before, MPS have exponentially-decaying correlation functions and thus cannot accountfor the average long-range correlations of language. Themessage is then clear: some neural networks work wellbecause they have the correct renormalization structure(e.g., deep convolutional networks), whereas other workbadly because they do not have such a structure (e.g., re-current neural networks and hidden Markov models). Infact, the connection between neural networks and renor-malization had already been pointed out, see for instanceRef.[46]. Notice that, as a byproduct, our observationimplies that one could have discovered the structure ofbiological neural networks not by looking at how the ac-tual neurons are organized in our brains, but rather byanalyzing the correlation structure of the language out-put of these brains. From such an analysis we could havealready concluded, as the only logical option, that some-thing like neurons must be somehow interconnected in away compatible with the renormalization of information.As we know, this is in fact the case, and it is nothing butwhat tries to be mimicked by artificial neural networks.The point is, that one could have hinted this biological

structure without having to open any brain!

B. Universal and non-universal properties

Given the renormalization structure and the propertiesof TN language models, one can predict universal fea-tures, i.e., properties that should be the same, no matterthe language, and which only depend on the correlationstructure of syntax [69]. In this paper we found al-ready some of such universal properties, as discussed inSec.IV: the polynomial decay of correlation functions andmutual information is universal. However, the character-istic time-scale τ for this decay is language-dependentand should depend on external factors, such as culturalheritage.

We wish to remark that universal properties of lan-guage had already been observed by analyzing linguis-tic information with the tools of complex networks [41].This is the field of physics and mathematics that analyzescomplex systems and their structure from the networkperspective (examples are ubiquitous: the internet, thepower grid of a country, financial networks, the synap-tic network in the brain...). In this setting, the so-calledlinguistic networks allow for a study of the properties ofsyntax from a pure network-theory perspective. In or-der to avoid confusion, we stress that our approach hereis radically different, since we start from a very differ-ent physical perspective: renormalization, and how thisorchestrates correlations. This led to a TN picture oflanguage models which is different but complements theone obtained using complex-network theory.

C. Optimality of language

Several of the properties from the previous sectionseem to be related to the conjectured “perfection andeconomy” of human language in the Minimalist Program,as well as to the conjectured efficient processing of lin-guistic information in the brain [7, 21]. Let us take forconcreteness the language models that we analyzed be-fore. The fact that the TN structures are mostly loop-free automatically implies that the retrieval of informa-tion can be done efficiently in all computational resources(a problem in the complexity class P). Such computa-tional efficiency strongly depends on the quasi-loop-freerenormalization structure of syntax trees, and is thereforegenerically valid, i.e., not just for the case of languagemodels. In fact, loop-free structures are well-known tobe the cheapest non-trivial class of correlation structuresin terms of the manipulation of their information [23].The surprising fact is that human language is even moreefficient than this, because of the properties of MERGE.In particular, we saw that the uniqueness of the output ofMERGE once the input is specified, implied diagonal ten-sors and thus a correlated factorization in the TN, whichleads to a dramatic efficiency in the calculation of proba-bilities for TN language models. It looks, therefore, that

Page 17: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

17

the human language chose the cheapest possible optionable to keep non-trivial correlations between informationunits. Our brain could have evolved to use a MERGEwhere the output is non-unique for a given input andstill maintain a big part of the computational efficiencyin the manipulation of information, but this just didn’thappen. This observation makes precise the common-lorestatement that language is, indeed, the cheapest non-trivial computational system. This may be one of thereasons why our brains choose to work with such cor-relation structures, instead of a different one. And wemanage to externalize it through a physiological inter-face pretty well: we communicate (on average and mostof the time) via sequential sounds in time produced withone mouth, instead of producing correlated sounds with,say, each one of our fingers, which would amount to 20mutually correlated outputs, and thus a syntax full ofcorrelation loops in turn implying computational ineffi-ciency in the processing of its information.

D. Non-Markovian memory environment

A coarse-graining is a process that finds effective de-grees of freedom to describe an emergent object, and in-herently involves an information loss when moving fromone scale to the next. It is well known in physics thatrenormalization is, usually, irreversible (the so-called “ir-reversibility of RG flows”) [42]. In language, however, itis clear that even if syntax manipulates coarse-grainedobjects at some long-time scale, we still know about theinformation content of the short-time scales. This is, ourbrain seems to organize the information according to dif-ferent time scales, but does not seem to fully erase theinformation when going from one scale to the next, atleast for some period of time. For instance, when wesay a sequence of the type [NP [A X] [N Y ]] (an adjec-tive X followed by a noun Y ), we remember for a whilewhat it actually refers to: “happy cat”, “hot meal”, “in-teresting paper”, and so on. This seems to indicate thatthe “discarded” information seems not to be immediatelyerased, but just put apart for a while in some memory de-gree of freedom. To put it in physical jargon, one wouldsay that the “memory environment” is non-Markovian,in the sense that there seems to be access for some pe-riod of time to the discarded information, shall this beneeded. Understanding how and why this happens is in-deed a relevant but different question to the one that weaddressed in this paper.

E. Context-free grammars in other ambits

An interesting observation is that probabilisticcontext-free grammars (PCFG), though originally devel-oped in linguistics, have proven recently very powerfulin the probabilistic modelling of RNA and protein struc-

tures. In particular, PCFGs offer a way of determiningthe secondary structure of RNA, with a comparable ac-curacy to that obtained by energy minimization methods[43]. Concerning proteins, the situation is more complexbut several achievements have already been reported us-ing PCFG methods [44]. Many of the things that wementioned previously in this work for the case of lan-guage, therefore, apply as well to the study of RNA andprotein sequences. Even if being a very different scenario,the relevant correlation structures that appear in thesebiological problems happen to be similar to the ones thatwe described in this work, and therefore the same deriva-tions could be applied to study those. The same is alsotrue for the correlation structures present in program-ming languages, such as C++, Java, and so on. Froma theoretical perspective, programming languages actu-ally apply the rules of some grammar, i.e., rules by whichwords in a computer code are interpreted into meaningfulmachine instructions.

Intriguingly, one can also make a turnaround in thederivation that we presented here, and consider some TNstructures as the natural correlation output of grammars.To be more precise, one could argue that TTNs and MPScan, in general, always be regarded as the output of someset of “generalized” context-free grammar rules whereone allows for several possible outputs of a MERGE op-eration for a given input, being the outputs associatedto complex “weigths”. As such, this then implies thatground states of gapped 1d local quantum many-bodyHamiltonians, which are known to have an MPS struc-ture [18], are (roughly speaking) nothing but generalizedgrammatical structures. Whether this simple observationhas consequences in the (analytical and numerical) studyof quantum many-body systems remains as a provocativeopen question.

F. On typical human abilities

Intriguingly, similar structures to the ones presentedhere for the case of language and grammar have alsobeen found in different but somehow related scenarios.For instance, as we said before, the correlation struc-ture of neural network algorithms (which mimic in partthe behavior of neurons in the brain) is, in fact, thatof a Tree Tensor Network [45]. Renormalization-like al-gorithms are also common in the study of image com-pression, such as those based on wavelets [47], and evenon Matrix Product States [48], where information of apicture gets organized according to different 2d lengthscales. Matrix Product States have also been used inthe context of machine learning [49]. Moreover, it hasbeen argued that the harmonic structure of tonal musicmay be, in fact, also a result of the MERGE syntacticoperation [50]. As a matter of fact, it is believed thatthe faculty of language appeared in evolution almost si-multaneously to the faculties of mathematics and music,

Page 18: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

18

with some people arguing in favour of the three facultiesbeing actually three different manifestations of the samebasic ability, which became available to our ancestors dueto some genetic mutation throughout evolution [51]. Avery subtle, somehow missed point, but key in this re-gard, is that the mathematical faculty looks itself also asa coarse-graining of (mathematical) information. Thisis in fact a consequence of MERGE being the successorfunction in mathematics [52]. In order to make this pointmore explicit, let us directly cite a rather popular para-graph (at least in the linguistics’ community) in one ofthe recent works of N. Chomsky [53]:

“Suppose that a language has the simplestpossible lexicon: just one lexical item, call it“one”. Application of MERGE to the lexicalitem yields one, call it “two”. Applicationof MERGE to one yields one, one, callit “three”. And so on. In efect, MERGEapplied in this manner yields the successorfunction. It is straightforward to define addi-tion in terms of MERGE(X,Y), and in famil-iar ways, the rest of arithmetic. The emer-gence of the arithmetical capacity has beenpuzzling ever since Alfred Russell Wallace,the co-founder of modern evolutionary the-ory, observed that the “gigantic developmentof the mathematical capacity is wholly un-explained by the theory of natural selection,and must be due to some altogether distinctcause”, if only because it remained unused. Itmay, then, have been a side product of someother evolved capacity (not Wallace’s conclu-sion), and it has often been speculated that itmay be abstracted from the faculty of languageby reducing the latter to its bare minimum.Reduction to a single-membered lexicon is asimple way to yield this consequence.”

Moreover, and at an experimental level, neuroscientistshave recently discovered what could be the signature ofthe MERGE operation in neural activity, by analizing theneural activation of epileptic patients performing severallanguage tasks [54].

Given all this, we take the liberty to take off and hy-pothesize, somehow phylosophically and because every-thing seems to point in this direction, that the humanabilities of language, mathematics, and probably others,may actually be different manifestations of a fundamen-tal single ability of the human brain, namely, the abilityto organize and process information according to differ-ent physical scales. To put it simple: one could saythat the human brain is, among other things, a biologicalinformation-renormalization machine. When it comes tohuman language, this allows the brain to build a languagesystem of discrete infinity, i.e., a discrete and recursivesystem able to produce infinitely-many outputs.

IX. CONCLUSIONS AND PERSPECTIVES

The observations and results in this paper are highlyinterdisciplinary. Let us briefly summarize here the mainpoints. We have argued that the linguistic MERGE oper-ation entails renormalization in physics: the informationcontent in, e.g., sequences of words (short time scale)gets renormalized by MERGEs up to sentences (longtime scale). We have made this observation concrete forlanguage models, and have found that probabilities ofmeaningful sentences are naturally given by quasi-loop-free TNs, which in turn organize correlations accordingto different renormalization time scales. Such languagemodels are naturally related to probabilistic context-freegrammars, though not restricted only to them. We havediscussed some of the properties of these TN languagemodels: locally-built syntactic correlations at every scale,very high efficiency of information processing because ofcorrelated factorization of the TN, syntactic correlations,and practical refinement levels. We also proved thatlong-range correlations in language follow naturally fromour approach. Moreover, we proposed how to promoteprobabilistic language models to probability distributionsof quantum states, argued that such quantum states maybe useful when it comes to sampling the distribution,showed how they can be built efficiently in a quantumcomputer, and used their entanglement properties to pro-vide a classical lower bound on the statistical perplexityof finding a set of words in a sentence. We discussed alsohow this useful formalism may be generalized to othertypes of grammars, and discussed a number of implica-tions of our observations in several ambits. These concernthe legitimacy of language models and neural networksfor language processing, universality and optimality oflanguage, some required properties of the memory envi-ronment, the potential application of our formalism forRNA and protein sequencing as well as programming lan-guages and quantum many-body systems via context-freegrammars, and the overall picture of several human fac-ulties all somehow boiling down to MERGE. In the end,we have taken the liberty to hypothesize that the humanbrain seems to have a natural fundamental ability to or-ganize information according to different physical scales,from which other faculties may materialize.

Our work opens the possibility to use all the mathe-matical and physical knowledge about TN states, bothclassical and quantum, in the theoretical and computa-tional study of language and grammar. This includes awide variety of applications not just in linguistics, butalso in RNA and protein sequencing [43, 44] and thedesign of computer languages, just to name some well-known examples. In particular, the different ways toquantify correlations and the information content in thenetwork, as well as associated numerical algorithms [18],should find useful applications in these scenarios. More-over, the efficient descriptions of probability distributionsof relevant grammars by means of quantum states, opens

Page 19: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

19

the exciting possibility to use possible quantum comput-ers and quantum simulators to deal with problems in allthese ambits. A prominent example is AI, where our re-sults show that quantum information tools can be usedto validate, simulate, assess, and improve state-of-the-art language models, as well as that quantum computerscan be used to implement perfect random sampling oflanguage, which is simply impossible with classical tech-nology. This is particularly relevant given the recent bigadvances in the development of experimental quantumprocessors.

By digging deeper into linguistic concepts it is indeedpossible to take our equivalences further. We do this inAppendix A. All in all, our conjecture that MERGE inlinguistics is connected to RG in physics turns our to beextremely fruitful, since many of the key linguistic ideasfrom the last century fit perfectly with known physicalconcepts linked to renormalization. We have also seenthat, as a consequence, many concepts in computationallinguistics also match perfectly with well-known physicalconceptions. The main equivalences discussed in this pa-per, including those in the appendix, are summarized inTable I.

Linguistics Physics

MERGE Coarse-graining

Relabelling Rescaling

Derivation RG flow

Phase RG scale

Phase impenetrability RG irreversibility

Optimality and efficiency Loop-free structures

Prob. language model 1d tensor network

N -gram models Mean-field theory

Prob. context-free grammar 3-index tensor

& MPS/TTN

Dependency grammar (k > 3)-index tensor

& 1d MERA√Prob. language model Quantum circuit

Perplexity Quantum entanglement

TABLE I: Main equivalences and connections between lin-guistics and physics proposed in this paper. The upper partcorresponds to concepts usually discussed in theoretical lin-guistics, and the lower part to concepts in computational lin-guistics. 1d means that the “physical” degrees of freedomspan along one dimension, which in the case of language istime. The “square-root” symbol in the lower-left panel is away of saying that the corresponding quantum circuit pro-duces probability amplitudes that are the square root of theactual probabilities given by the language model.

Only good things can happen by studying languagefrom the perspective of physics [55]. The fields of physicsand linguistics have been traditionally very far away fromeach other. But indeed, linguistics focuses on the studyof the laws of language, and physics on the study of thelaws of Nature. For a linguist, the human language is

the universe, and it has deep connections with how ourbrain processes and manipulates information, as well asother situations whose correlations are orchestrated bygrammar-like rules. From the perspective of physics, itfeels just natural to think that classical and quantuminformation theories should be somehow useful for thispurpose. Being able to formalize mathematically someof the most relevant aspects of language and grammar interms of physical ideas is already an important achieve-ment. We strongly believe that the cross-fertilization ofphysics and linguitics will become increasingly relevantin the future. Phylosophical questions, such as those en-countered sometimes in linguistics, usually lead to deep,profound scientific problems, and our work here is no ex-ception to this rule.

Acknowledgments

We acknowledge the Universitat Autonoma deBarcelona, the Max Planck Instutute of Quantum Optics,and the University of Mainz, where the ideas in this papermaterialized from interdisciplinary discussions betweenthe two authors over the past nice years. Discussionswith Ondiz Aizpuru, Gustavo Ariel Schwartz, Gemmade las Cuevas, Laura Garcıa-Alvarez, Geza Giezdke, JoseI. Latorre, Enrique Solano, Miles Stoudenmire and JuanUriagereka are acknowledged. A special acknowledge-ment to Sergi Quintana, for explaining us the basics ofChomsky’s generative grammar 26 years ago: qui sem-bra, recull.

Appendix A: More equivalences by digging deeper

Our paper is written having in mind a reader withbackground on physics and mathematics. However, thetopic itself is strongly interdisciplinary. Because of this,in this appendix we would like to add some extra informa-tion useful for the reader with knowledge about theoreti-cal linguistics. In particular, we would like to define a fewconcepts more precisely in linguistic jargon. Thanks tothis, we will see that by digging deeper into the linguisticjargon, more equivalences with physics will show up, inturn strengthening our thesis that MERGE in linguisticsand RG in physics are deeply linked to each other.

To begin with, the term Universal Grammar (UG) isnothing but a label for the striking difference in cognitivecapacity between “us and them”, i.e., humans versus therest of animal species. UG is thus the research topic ofgenerative grammar in its attempt to understand what itis and how it evolved in our species. Finding a satisfyinganswer to the latter question may be impossible with thetools we have right now, but any theory of UG seekingto address the former must meet a criterion of evolv-ability: any properties, mechanisms, etc. attributed toUG should have emerged in what appears to have been

Page 20: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

20

a unique and relatively sudden event on the evolutionarytimescale [56]. This line of thought presupposes that UG(the genetic encoding of the the human linguistic capac-ity) manifests bona fide traits of perfect design, in thesense that contains operations and mechanisms that fol-low from conceptual necessities, efficiency principles orinterface demands. In this respect, linguistic expressions(sentences, phrases, words, etc.) are built up by adher-ing to these principles, therefore in an optimal fashion.While these notions are intuitively clear, their preciseformulations remain vague and controversial.

One of the most important mathematical achievementsof generative grammar is the so-called “Chomsky Hi-erarchy” [57], a classification of formal grammars ac-cording to their complexity. As Chomsky showed sixtyyears ago, human languages manifests both context-free and context-sensitive properties, needed to constructPHRASES and CHAINS respectively, shown in the Sen-tences A1(a,b):

a. John killed John. (A1)

b. John was killed < John >

In Sen.A1(a) (a PHRASE) we have two tokens of the lex-ical item “John” that participate in phrasal dependenciesto yield a compositional interpretation whereby the firstJohn is the agent of a killing event, and the second Johnis the patient of such event. What we have in Sen.A1(b)(a CHAIN) is more complex. This time, we don’t havetwo tokens of “John”, but two occurrences of the samelexical item – as if they were one and the same objectin two positions at the same time, where the notation< John > means that the word itself is not pronouncedat that position, but it is also interpreted there from thelogical point of view. This is what is called CHAIN inlinguistics. In languages of the English type, the first(leftmost) occurence is spelled-out, whereas the second(rightmost) is necessary to keep a syntax-semantics ho-momorphism (that is, to capture the desideratum thata specific interpretation is tied to a specific position).Notice that the same type of object (a CHAIN) is neces-sary in Sen.A2, where “John” is pronounced to the leftof seem, although it is interpreted as the patient of killed.

John seems to have been killed < John > (A2)

In order to account for these properties, generativegrammar has resorted to phrase structure rules (PSR)and transformations. The most articulated version ofPSR is known as X-bar Theory, which resorted to differ-ent devices that have been subject to a revision withinminimalism. In particular, Chomsky [10] argued that thebasic properties of PSR could be understood by means ofa computational operation, dubbed MERGE [10], whichcaptures two empirical properties of human languagethat are non-negotiable: discrete infinity and displace-ment. To be able to account for those properties, onemust assume an operation that constructs hierarchicallystructured expressions with displacement. And that is

what MERGE does. MERGE applies to two objects Xand Y (be these words or bigger units), yielding a newone, K, which is the set containing X and Y , i.e., X,Y .If X, Y are distinct (taken directly from the lexicon orindependently assembled), K is constructed by what iscalled EXTERNAL MERGE (EM); if Y is part of X (if Yis contained in X), then we have what is called INTER-NAL MERGE (IM). The latter scenario is that of Sen-tences A1(b) and A2 above, where MERGE turns “John”into a discontinuous object (a CHAIN). For complete-ness, if the operation is at the beginning of a derivation(e.g., with bare lexical items from a lexicon), it is calledFIRST MERGE, and if it operates with partially-deriveditems (phrases), it is called ELSEWHERE MERGE.

Chomsky [10] takes MERGE to be strictly binary, asit is what is minimally necessary to create hierarchicalstructure. Generation by MERGE thus entails a restric-tive class of recursively defined, binary-branching anddiscrete-hierarchical structures.

It is also worth mentioning that in X-bar Theory, thelabel identifies the properties of the entire phrase, at thecost of this being a theory-internal symbol that departsfrom inclusiveness demands. An alternative to this is alabel-free representation (see Fig.1), where endocentric-ity (the assumption that all phrases must be headed) isnot preserved. This entails that syntactic objects can beexocentric, as seems to be necessary for objects formedby the combination of two phrases, XP, Y P. Syntacticobjects are “endocentric” if they contain an element thatcan be determined by Minimal Search – typically, a head.Given this logic, X,Y P is endocentric and XP, Y Pexocentric. Consequently, such a system freely generatesobjects of different kinds, without stipulating their endo-centric nature.

Moreover, MERGE is subject to efficiency and econ-omy conditions. One such condition is inclusiveness,which precludes the introduction of extraneous objects,like the ones that X-bar Theory deployed: traces, bar-levels, projections, etc. Inclusiveness also bars introduc-tion of features that are not present in lexical items.

To further clarify MERGE, we stress that the combi-nation of two objects, X and Y , yields a new one, K,which is the set X,Y . Once we have X,Y , we maywant to merge K and some object W , which can be ei-ther internal to K or external to it (see above). In anyevent, the merger of W cannot change or tamper withX,Y , which behaves as a unit. More precisely, subse-quent applications of MERGE must yield Eq.A3(a), notA3(b):

a. MERGE(K,W ) = X,Y ,W (A3)

b. MERGE(K,W ) = X,W, Y

The driving force of this work is the fact that MERGEand renormalization seem to play a similar role on var-ious respects. As noted above, MERGE takes two ob-jects, X and Y , to yield a new one, K, thus remov-ing X and Y from the computational workspace (WS).In the simplest scenario, MERGE maps WS = [X,Y ]

Page 21: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

21

onto WS′ = [X,Y ], reducing the complexity of WS.Notice that MERGE never extends the WS, at least interms of cardinality; thus WS = [X,Y ] and WS′ =[W, X,Y ] are equally bigger, since they only con-tain one set. A new element can be added to WS (orWS′) in only one way: by taking two items W , Z fromthe lexicon and introducing W,Z into WS as a newelement, yielding WS′′ = [W,Z, X,Y ]. Of course,cardinality can be reduced if we apply EM (EXTER-NAL MERGE) and neither of the elements are takenfrom the lexicon, as if we map WS′′ = [W,Z, X,Y ]onto WS′′′ = [W,Z, X,Y ]. This idea is indeedvery similar to that of a coarse-graining in physics, inthe sense made precise throughout the paper.

Additionally, the possibility that computational loadis reduced by MERGE is perhaps somewhat new, as thistypically follows from a principle in linguistics that iscalled STRICT CYCLICITY. The notion of cycle (andthus cyclicity) goes back to the fifties, where work inphonology [59] showed that the application of stress-assigning rules apply from innermost to outermost unitsof a word, putting aside linear order information. Moregenerally, an object is build under cyclic principles if itis COMPOSITIONAL, which means that its interpreta-tion is fixed by the elements it contains and the way inwhich they are combined. Consider this with SentencesA4, where the interpretation is crucially different (Brutusis an agent in (a), and a patient in (b)), although bothexamples contain the same three words:

a. Brutus stabbed Caesar. (A4)

b. Caesar stabbed Brutus.

The concept of STRICT CYCLICITY is a strongerversion of cyclicity. The key intuition behind it is thatfor certain linguistic object constructed in a derivation(say, a V P ), further computation should not modify it.Let us see this with the example in Eq.A5, where theverb “leave” is merged with the NP “the room” to yieldthe complex V P “leave the room”, which we can call Kfor ease of reference.

MERGE(leave, the, room) = leave, the, room(A5)

What is of interest here is that the interpretation ofK (that is, of “leave the room”) is determined at thatstage of the derivation (at that “cycle”), and cannot bechanged at subsequent stages (“cycles”). Therefore, ifwe add “Mary” to obtain “Mary leaves the room” (callit K ′), as in Eq.A6, the interpretation of K will be thesame in Eq.A5 and in Eq.A6.

MERGE(Mary,K) = Mary, leaves, the, room(A6)

In a nutshell, the interpretation of complex objectsis constructed stepwise, in a step-by-step fashion, andwhatever has been done at a stage s cannot be undoneat statge s+ 1 (Eqs.A5 and A6 above). This, in turn, isquite analogue to the idea of irreversibility of RG flows in

physics, which matches perfectly with our interpretationof MERGE as a coarse-graining of information.

Such stages at a derivation, where a “computation” isdone and cannot be altered afterwards, correspond withthe so-called linguistic PHASES, and the device respon-sible for ensuring that the interior of a PHASE is nolonger accessible is the PHASE IMPENETRABILITYCONDITION (PIC for short). What has been calledphase roughly corresponds with the notion of cycle de-scribed above. Using the physical interpretation that weintroduce in this paper, one would say that a PHASE inlinguistics is the analogous of an RG scale in physics.

To be more precise, a phase is defined in linguistics asa domain D where uninterpretable features (number andperson features of verbs) are valued. When a phase isclosed off, the complement domain Ω (which can itselfbe complex, in the sense of having some inner structure)of the phase head P cannot be modified from the outside;this means, for instance, that the case of an NP withinΩ (e.g., “the book” in the V P “read the book”) cannotbe changed once the phase headed by P is complete [60].Among other things, this entails that “the book”, whichis the Direct Object of “read” Sentence A7 (it receivesaccusative case from “read”), cannot also be the DirectObject of the matrix verb “believe”:

I believe that John read the book. (A7)

That “the book” is the Direct Object of “read” and notof “believe” is shown in Sentences A8, where we see thatthis NP can be passiviced in the embedded clause, butnot in the matrix clause (∗ signals ungrammaticality):

a. I believe that the book was read. (A8)

b. ∗The book was believed that John read.

This “shielding” effect that makes the V P impenetra-ble is captured by the phase impenetrability conditionmentioned above. Physically, this is the irreversibility ofthe RG flow when moving from one RG scale to the next.There are various approaches to Phase Theory [58], butall of them share the key intuition that PHASES are do-mains where complexity is reduced by somehow allowingthe system to “forget” about an amount of structure thathas been created and which will be no longer accessible.This process of “forgetting” is, in fact, analogous to theprocess of “discarding irrelevant degrees of freedom” inan RG-step in physics.

Moreover, the “rescaling” step in RG has not been dis-cussed in this paper, but also appears naturally when par-ticularizing to specific models of language. For instance,in the Matrix Syntax model [40] this rescaling appearsnaturally in order to recover the correct linguistic labelsafter a MERGE operation (see Ref.[40] and the discus-sions therein for more information). We believe that thisis a general feature: the “rescaling” in physics is nothingbut the “mathematical relabelling” that one needs in or-der to recover the correct labels (NP , V P , etc) after a

Page 22: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

22

MERGE operation when dealing in practice with models of language.

[1] See, e.g., https://en.wikipedia.org/wiki/Linguistics[2] S. J. Russell and P. Norvig, Artificial Intelligence: A

Modern Approach (3rd ed.), Upper Saddle River, NewJersey: Prentice Hall (2009).

[3] N. Chomsky, A. J. Gallego, and D. Ott, GenerativeGrammar and the Faculty of Language: Insights, Ques-tions, and Challenges. Ms., MIT / UAB / UOttawa(2017). Available at http://ling.auf.net/lingbuzz/003507

[4] R. Descartes, Discours de la methode, 1662.[5] N. Chomsky, The Language Capacity: Architecture and

Evolution. Psychonomic Bulletin and Review 24: 200-203 (2017).

[6] M. D. Hauser, N. Chomsky and W. T. Fitch, The Facultyof Language: What is It, Who Has It, and How Did ItEvolve?, Science 298: 1569-1579 (2002); S. R. Anderson,Doctor Dolittle’s Delusion. Animals and the Uniquenessof Human Language. New Haven, CT: Yale UniversityPress (2004); N. Chomsky, Some Simple Evo-devo The-ses: How True Might They Be for Language?, in R. K.Larson, V. Deprez, and H. Yamakido (eds.), The Evo-lution of Human Language: Biolinguistic Perspectives,45-62. Cambridge: Cambridge University Press (2012).

[7] N. Chomsky, A minimalist program for linguistic theory,MIT occasional papers in linguistics no. 1. Cambridge,MA: Distributed by MIT Working Papers in Linguistics,1993.

[8] N. Chomsky, Three factors in language design. LinguisticInquiry 36: 1-22 (2005).

[9] D. W. Thompson, On Growth and Form, Cambridge Uni-versity Press (1917); A. M. Turing, Phylosophical Trans-actions of the Royal Society B, 237 (642): 37 - 42 (1952).

[10] N. Chomsky, Bare Phrase Structure, Evolution and Rev-olution in Linguistic Theory, Essays in honor of Car-los Otero., eds. Hector Campos and Paula Kempchinsky,51109, 1995.

[11] See, e.g., https://en.wikipedia.org/wiki/Emergentism[12] P. W. Anderson, Science, New Series, Vol. 177, No. 4047,

393-396 (1972).[13] There are plenty of books and introductory articles on

renormalization in physics. Some good original sources,though, are L. Kadanoff, Physics 2, 263 (1966); K. G.Wilson, Rev. Mod. Phys. 47, 4, 773 (1975); K. G. Wilson,Sci. Am. 241, 140-157 (1979); Also K. G. Wilson’s nobelprize lecture from 1982, available at www.nobelprize.org.

[14] See, e.g., R. Shankar, Rev. Mod. Phys. 66, 129 (1994);S. R. White, Phys. Rev. Lett. 69, 2863 (1992).

[15] See, e.g., S. Weinberg, The Quantum Theory of Fields (3volumes), Cambridge University Press (1995).

[16] F. Verstraete et al, Phys. Rev. Lett. 94, 140601 (2005);G. Vidal, Phys. Rev. Lett. 99, 220405 (2007);

[17] See, e.g., https://en.wikipedia.org/wiki/Language model[18] F. Verstraete, J. I. Cirac, and V. Murg, Adv. Phys.

57,143 (2008); J. I. Cirac and F. Verstraete, J. Phys.A: Math. Theor. 42, 504004 (2009); J. Eisert, Modelingand Simulation 3, 520 (2013); N. Schuch, QIP, LectureNotes of the 44th IFF Spring School (2013); R. Orus,Eur. Phys. J. B 87, 280 (2014); R. Orus, Ann. Phys.-New York 349 117158 (2014).

[19] N. Chomsky, Problems of Projection. Lingua 130: 33-49(2013).

[20] M. Nielsen and I. Chuang, Quantum Computation andQuantum Information, Cambridge University Press, NewYork (2000).

[21] B. Sengupta and M. N. Stemmler, Proceedings of theIEEE, Vol. 102, No. 5 (2014).

[22] N. A. Smith, M. Johnson, Computational Linguistics. 33(4): 477 (2007).

[23] Y. Shi, L. Duan and G. Vidal, Phys. Rev. A 74, 022320(2006); L. Tagliacozzo, G. Evenbly and G. Vidal, Phys.Rev. B 80, 235127 (2009); V. Murg et al, Phys. Rev. B82, 205105 (2010); M. Gerster et al, Phys. Rev. B 90,125154 (2014).

[24] M. Fannes, B. Nachtergaele, R. F. Werner, Commun.Math. Phys. 144, 443-490 (1992); A. Klumper, A. Schad-schneider, J. Zittartz, J. Phys. A 24, L955 (1991); A.Klumper, A. Schadschneider, J. Zittartz, Europhys. Lett.24, 293 (1993); U. Schollwock, Ann. Phys. 326, 96(2011).

[25] I. V. Oseledets, SIAM J. Sci. Comput., 33(5), 22952317(2011).

[26] N. Chomsky, Syntactic structures. The Hague/Paris:Mouton (1957).

[27] See, e.g., H. Liu, Dependency Grammar: from Theory toPractice. Beijing: Science Press (2009).

[28] See https://en.wikipedia.org/wiki/N-gram]cite note-1[29] C. H. Papadimitrou, Computational Complexity, (Addi-

son Wesley, 1994).[30] E. Alvarez-Lacalle, B. Dorow, J.-P. Eckmann, E. Moses,

PNAS 103(21):7956-7961 (2006); E. G. Altmann, G.Cristadoro, M. D. Esposti, PNAS 109(29): 11582-11587(2012); H. W. Lin, M. Tegmark, Entropy, 19, 299 (2017).

[31] N. Schuch et al, Phys. Rev. Lett. 98, 140506 (2007).[32] K. Temme, F. Verstraete, Phys. Rev. Lett. 104, 210502

(2010); G. De las Cuevas et al, New J. Phys. 15, 123021(2013).

[33] M.M. Wolf, F. Verstraete, M.B. Hastings, J.I. Cirac,Phys. Rev. Lett. 100, 070502 (2008).

[34] See, e.g., https://en.wikipedia.org/wiki/Language isolate[35] See, e.g., https://en.wikipedia.org/wiki/QR decomposition[36] C. Holzhey, F. Larsen, and F. Wilczek, Nucl. Phys. B

424, 443 (1994); G. Vidal et al., Phys. Rev. Lett. 90,227902 (2003); J. I. Latorre, E. Rico, and G. Vidal, Quan-tum Inf. Comput. 4, 48 (2004); J. Eisert and M. Cramer,Phys. Rev. A 72, 042112 (2005); R. Orus et al., Phys.Rev. A 73, 060303(R) (2006).

[37] See, e.g., R. Bathia, Matrix Analysis, Springer-Verlag,New York, (1997); M. Nielsen and G. Vidal, QIC Vol. 1No. 1, 76-93 (2001).

[38] L. P. Kadanoff, J. Stat. Phys. 137: 777 (2009).[39] G. Sidorov et al, Syntactic Dependency-based n-grams as

Classification Features, LNAI 7630: 111 (2012).[40] R. Orus, R. Martin and J. Uriagereka, arXiv:1710.00372;

R. Martin, R. Orus and J. Uriagereka, to appear in theconference proceedings of Generative Syntax: Questions,Corssroads and Challenges, edited by UAB.

[41] R. Ferrer i Cancho and R. V. Sole, Proc. R. Soc. Lond.

Page 23: arXiv:1708.01525v4 [cs.CL] 19 Mar 2019to what Chomsky [8] calls the \third factor", namely \to language-independent principles of data processing, structural architecture, and computational

23

B 268, 2261-2265 (2001).[42] A. B. Zamolodchikov, JETP Lett. 43: 730732 (1996); J.

I. Latorre et al, Phys. Rev. A 71, 034301 (2005); R. Orus,Phys. Rev. A 71, 052327 (2005).

[43] S. R. Eddy and R. Durbin, Nucleic Acids Research. 22(11): 20792088 (1994); Y. Sakakibara et al., NucleicAcids Research. 22 (23): 51125120 (1994); R. Durbin,S. Eddy, A. Krogh and G. Mitchinson, eds, Biologicalsequence analysis: probabilistic models of proteins andnucleic acids, Cambridge University Press (1998).

[44] D. Searls, Biopolymers 99 (3): 203217. (2013); A. Kroghet al., J. Mol. Biol. 235: 15011531 (1994); C. Sigristet al., Brief Bioinform. 3 (3): 265274 (2002); W. Dyrkaand J.-C. Nebel, BMC Bioinformatics. 10: 323 (2009);W. Dyrka, J.-C. Nebel and M. Kotulska, Algorithms forMolecular Biology. 8: 31 (2013).

[45] Y. Levine et al, arXiv:1704.01552.[46] H. W. Lin, M. Tegmark, D. Roinick, J. Stat. Phys. 168

(6), 1223-1247 (2017).[47] A. Graps, IEEE Computational Science and Engineering,

Volume 2, Issue 2, 50-61 (1995).[48] J. I. Latorre, arxiv:quant-ph/0510031[49] E. M. Stoudenmire and D. J. Schwab, Advances in Neural

Information Processing Systems 29, 4799 (2016).[50] J. Katz, D. Pesetsky, http://ling.auf.net/lingBuzz/000959[51] J. K. Alcock et al, Brain Lang. 75, 3446 (2000); C. S.

L. Lai et al, Nature 413, 519523 (2001); I. Peretz, Psy-chol. Belg. 49, 157175 (2009); K. Rimfeld et al, ScientificReports 5, 11713 (2015)

[52] See, e.g., https://en.wikipedia.org/wiki/Successor function,and also Paul R. Halmos Naive Set Theory, Nostrand(1968).

[53] N. Chomsky, On Phases, MIT Press (2008).[54] M. J. Nelson et al., PNAS Vol. 114, No. 18 (2017).[55] For other examples in different contexts see, e.g., M.

Piattelli-Palmarini, G. Vitiello, arXiv:1506.08663; M.Piattelli-Palmarini, G. Vitiello, Journal of Physics: Conf.Series 880 012016 (2017); R. Sole, Phil. Trans. R. Soc. B371: 20150438 (2016); R. Orus, R. Martin, J. Uriagereka,to appear soon.

[56] J. Bolhuis, I. Tattersall, N. Chomsky, and R.C. Berwick,How Could Language Have Evolved? PLoS Biology 12:e1001934 (2014); R. C. Berwick and N. Chomsky WhyOnly Us, Cambridge, MA: MIT Press (2016).

[57] N. Chomsky, Three models for the description of lan-guage, -IRE Transactions on Information Theory 2: 113-124 (1956).

[58] A. J. Gallego (ed.), Phases. Developing the Framework.Berlin: De Gruyter (2012).

[59] N. Chomsky, M. Halle, F. Lukoff, On Accent and Junc-ture in English. In For Roman Jakobson: Essays on theoccasion of his sixtieth birthday, M. Halle et al. (eds.),65-80. The Hague: Mouton and Co. (1956); N. Chom-sky, M. Halle, The Sound Pattern of English. New York:Harper Row (1968).

[60] N. Chomsky, Problems of Projection. Lingua 130: 33-49(2013); N. Chomsky, Problems of Projection. Extensions.In E. di Domenico et al. (eds.), In Structures, Strategiesand Beyond, 1-16. Amsterdam: John Benjamins. (2015).

[61] A clarification is in order: here we understand “renor-malization” as the tool that allows the mathematicaldescription of the information in a system at differentphysical scales, accounted for by relevant degrees of free-

dom at every scale. Of course, the implementation ofthis idea in several contexts leads to different conse-quences. Well-known examples in physics are the exis-tence of critical systems, critical exponents, universalityclasses, phase transitions, the c-theorem, the reshufflingof Hilbert spaces, the β-function, fixed points, RG flows,scaling laws, relevant / irrelevant / marginal perturba-tions... the list is unending. In our case, however, we donot assume necessarily the existence of any of these inthe case of language (though some of them may also bethere), and adopt instead the most fundamental and gen-eral perspective of what “renormalization” means at itsvery basic core at the level of information.

[62] To be defined in Eq.(17).[63] Other operations can be accounted for by introducing

extra links in the graphical representation, as we shallexplain, but the renormalization picture still holds.

[64] Notice that the converse is not true.[65] Unlike some TNs with loops, which are ]P-hard and

therefore need an exponential time to be contracted [31].[66] To be precise, this is correct in average, since depending

on the tree it is possible to choose specific pairs of pointswith longer separation [23].

[67] This is in fact a very efficient procedure to compute theoverall probability of a given tree in a language model.

[68] A recent attempt in this direction, also related to quan-tum physics and linear algebra, is the Matrix Syntaxmodel in Ref.[40].

[69] In words of N. Chomsky, “there is only one human lan-guage” (private communication).