Object-oriented Neural Programming (OONP) for … Neural Programming (OONP) for Document Understanding Zhengdong Lu1 Haotian Cui 2* Xianggen Liu * Yukun Yan * Daqi Zheng1 1DeeplyCurious.ai

Object-oriented Neural Programming (OONP)

for Document Understanding

Zhengdong Lu1 Haotian Cui2* Xianggen Liu2* Yukun Yan2* Daqi Zheng1

1DeeplyCurious.ai{luz,da}@deeplycurious.ai

2Department of Bio-medical Engineering, Tsinghua University{cht15, liuxg16, yanyk13}@mails.tsinghua.edu.cn

Abstract

We propose Object-oriented Neural Programming (OONP), a framework for semanticallyparsing documents in specific domains. Basically, OONP reads a document and parses itinto a predesigned object-oriented data structure (referred to as ontology in this paper)that reflects the domain-specific semantics of the document. An OONP parser models se-mantic parsing as a decision process: a neural net-based Reader sequentially goes throughthe document, and during the process it builds and updates an intermediate ontology tosummarize its partial understanding of the text it covers. OONP supports a rich familyof operations (both symbolic and differentiable) for composing the ontology, and a bigvariety of forms (both symbolic and differentiable) for representing the state and the doc-ument. An OONP parser can be trained with supervision of different forms and strength,including supervised learning (SL) , reinforcement learning (RL) and hybrid of the two.Our experiments on both synthetic and real-world document parsing tasks have shownthat OONP can learn to handle fairly complicated ontology with training data of modestsizes.

1 Introduction

Mapping a document into a structured “machine readable” form is a canonical and probablythe most effective way for document understanding. There are quite some recent efforts ondesigning neural net-based learning machines for this purpose, which can be roughly catego-rized into two groups: 1) sequence-to-sequence model with the neural net as the the blackbox [Dong and Lapata, 2016, Liang et al., 2017], and 2) neural net as a component in a pre-designed statistical model [Zeng et al., 2014]. We however argue that both approaches havetheir own serious problems and cannot be used on document with relatively complicated struc-tures. Towards solving this problem, we proposed Object-oriented Neural Programming (OONP), a framework for semantically parsing in-domain documents. OONP is neural net-based, butit also has sophisticated architecture and mechanism designed for taking and outputting dis-crete structures, hence nicely combining symbolism (for interpretability and formal reasoning)and connectionism (for flexibility and learnability). This ability, as we argue in this paper, iscritical to document understanding.

OONP seeks to map a document to a graph structure with each node being an object, as il-lustrated in Figure 1. We borrow the name from Object-oriented Programming [Mitchell, 2003]

* The work is done when the authors worked as interns at DeeplyCurious.ai.

1

arX

iv:1

709.

0885

3v4

[cs

.LG

] 8

Oct

201

7

to emphasize the central position of “objects” in our parsing model: indeed, the representa-tion of objects in OONP allows neural and symbolic reasoning over complex structures andhence it make it possible to represent much richer semantics. Similar to Object-oriented Pro-gramming, OONP has the concept of “class” and “objects” with the following analogousness:1) each class defines the types and organization of information it contains, and we can defineinheritance for class with different abstract levels as needed; 2) each object is an instanceof a certain class, encapsulating a number of properties and operations; 3) objects can beconnected with relations (called links) with pre-determined type. Based on objects, we candefine ontology and operations that reflect the intrinsic structure of the parsing task.

For parsing, OONP reads a document and parses it into this object-oriented data structurethrough a series of discrete actions along reading the document sequentially. OONP supportsa rich family of operations for composing the ontology, and flexible hybrid forms for knowledgerepresentation. An OONP parser can be trained with supervised learning (SL) , reinforcementlearning (RL) and hybrid of the two. Our experiments on one synthetic dataset and two real-world datasets have shown the efficacy of OONP on document understanding tasks with avariety of characteristics.

Figure 1: Illustration of OONP on a parsing task.

In addition to the work on semantic parsing mentioned above, OONP is also related tomultiple threads of work in natural language processing and machine learning. It is inspiredby [Daume III et al., 2009] on modeling parsing as a decision process, and also state-trackingmodels in dialogue system [Henderson et al., 2014] for the mixture of symbolic and probabilis-tic representations of dialogue state. OONP is also related to [Johnson, 2017] for modelingthe transition of symbolic state and [Henaff et al., 2016] on having explicit (although notthorough) modeling on entities. OONP is also obviously related to the the recent work onneural-symbolism [Mou et al., 2017, Liang et al., 2017].

1.1 Overview of OONP

An OONP parser (as illustrated through the diagram in Figure 2) consists of a Reader equippedwith read/write heads, Inline Memory that represents the document, and Carry-on Memory thatsummarizes the current understanding of the document at each time step. For each documentto parse, OONP first preprocesses it and puts it into the Inline Memory , and then Readercontrols the read-heads to sequentially go through the Inline Memory (for possibly multipletimes, see Section 6.3 for an example) and at the same time update the Carry-on Memory.

2

Figure 2: The overall digram of OONP, where S stands for symbolic representation, D standsfor distributed representation, and S+D stands for a hybrid representation with both symbolicand distributed parts.

The major components of OONP are described in the following:

• Memory: we have two types of memory, Carry-on Memory and Inline Memory. Carry-onMemory is designed to save the state∗ in the decision process and summarize currentunderstanding of the document based on the text that has been ‘read”. Carry-on Memoryhas three compartments:

– Object Memory: denoted as Mobj, the object-based ontology constructed during theparsing process, see Section 2.1 for details;

– Matrix Memory: denoted as Mmat, a matrix-type memory with fixed size, for dif-ferentiable read/write by the controlling neural net [Graves et al., 2014]. In thesimplest case, it could be just a vector as the hidden state of conventional Recur-rent Neural Netwokr (RNN);

– Action History: denoted as Mact, saving the entire history of actions made duringthe parsing process.

Intuitively, Mobj stores the extracted knowledge with defined structure and strong evi-dence, while Mmat keeps the knowledge that is fuzzy, uncertain or incomplete, waitingfor future information to confirm, complete and clarify. Inline Memory , denoted Minl ,is designed to save location-specific information about the document. In a sense, theinformation in Inline Memory is low level and unstructured, waiting for Reader to fuseand integrate for more structured representation.

• Reader: Reader is the control center of OONP, coordinating and managing all theoperations of OONP. More specifically, it takes the input of different forms (reading),processes it (thinking), and updates the memory (writing). As shown in Figure 3,Reader contains Neural Net Controller (NNC) and multiple symbolic processors, andNeural Net Controller also has Policy-net as its sub-component. Similar to the controllerin Neural Turing Machine [Graves et al., 2014], Neural Net Controller is equipped with

∗It is not entirely accurate, since the Inline Memory can be modified during the reading process it alsorecords some of the state information.

3

multiple read-heads and write-heads for differentiable read/write over Matrix Memoryand (the distributed part of) Inline Memory, with possibly a variety of addressing strate-gies [Graves et al., 2014]. Policy-net however issues discrete outputs (i.e., actions), whichgradually builds and updates the Object Memory in time (see Section 2.1 for more de-tails). The actions could also updates the symbolic part of Inline Memory if needed. Thesymbolic processors are designed to handle information in symbolic form from ObjectMemory, Inline Memory, Action History, and Policy-net, while that from Inline Memoryand Action History is eventually generated by Policy-net.

Figure 3: The overall digram of OONP

We can show how the major components of OONP collaborate to make it work throughthe following sketchy example. In reading the following text

OONP has reached the underlined word “BMW” in Inline Memory. At this moment, OONPhas two objects (I01 and I02) for Audi-06 and BMW respectively in Object Memory. Readerdetermines that the information it is currently holding is about I02 (after comparing it withboth objects) and updates its status property to sold, along with other update on both MatrixMemory and Action History.

OONP in a nutshell: The key properties of OONP can be summarized as follows

1. OONP models parsing as a decision process: as the “reading and comprehension” agentgoes through the text it gradually forms the ontology as the representation of the textthrough its action;

2. OONP uses a symbolic memory with graph structure as part of the state of the parsingprocess. This memory will be created and updated through the sequential actions ofthe decision process, and will be used as the semantic representation of the text at theend;

3. OONP can blend supervised learning (SL) and reinforcement learning (RL) in tuningits parameters to suit the supervision signal in different forms and strength;

4. OONP allows different ways to add symbolic knowledge into the raw representation ofthe text (Inline Memory) and its policy net in forming the final structured representationof the text.

4

RoadMap of the paper: The rest of the paper is organized as follows. We will elaborateon the components of OONP in Section 2 and actions of OONP in Section 3. After that wewill give a detailed analysis on the neural-symbolism in OONP in Section 4. Then in Section5 we will discuss the learning for OONP , which is followed by experiments on three datasetsin Section 6. Finally we conclude the paper in Section 7.

2 OONP: Components

In this section we will discuss the major components in OONP, namely Object Memory , InlineMemory and Reader. We omit the discussion on Matrix Memory and Action History since theyare straightforward given the description in Section 1.1.

2.1 Object Memory

Object Memory stores an object-oriented representation of document, as illustrated in Figure 4.Each object is an instance of a particular class†, which specifies the internal structure of theobject, including internal properties, operations, and how this object can be connected withothers. The internal properties can be of different types, for example string or category, whichusually correspond to different actions in composing them: the string-type property is usually“copied” from the original text in Inline Memory, while the category properties usually needsto be rendered by a classifier. The links are by nature bi-directional, meaning that it can beadded from both ends (e.g., in the experiment in Section 6.1), but for modeling convenience,we might choose to let it to be one directional (e.g., in the experiments in Section 6.2 and6.3). In Figure 4, there are six “linked” objects of three classes (namely, Person, Event,and Item) . Taking Item-object I02 for example, it has five internal properties (Type, Model,Color, Value, Status), and is linked with two Event-objects through stolen and disposed

link respectively.In addition to the symbolic part, each object had also its own distributed presentation

(named object-embedding), which serves as its interface with other distributed representationsin Reader (e.g., those from the Matrix Memory or the distributed part of Inline Memory).For description simplicity, we will refer to the symbolic part of this hybrid representationof objects as ontology, with some slight abuse of this word. Object-embedding serves as adual representation to the symbolic part of a object, recording all the relevant informationassociated with it but not represented in the ontology, e.g., the context of text when theobject is created.

The representations in Object Memory, including the ontology and object embeddings,will be updated in time by the operations defined for the corresponding classes. Usually, theactions are the driving force in those operations, which not only initiate and grow the ontology,but also coordinate other differentiable operations. For example, object-embedding associatedwith a certain object changes with any non-trivial action concerning this object, e.g., anyupdate on the internal properties or the external links, or even a mention (corresponding toan Assign action described in Section 3) without any update.

†In this paper, we limit ourselves to a flat structure of classes, but it is possible and even beneficial to havea hierarchy of classes. In other words, we can have classes with different levels of abstractness, and allow anobject to go from abstract class to its child class during the parsing process, with more and more informationis obtained.

5

Figure 4: An example of the objects from three classes.

According to the way the ontology evolves with time, the parsing task can be roughlyclassified into two categories

• Stationary: there is a final ground truth that does not change with time. So withany partial history of the text, the corresponding ontology is always part of the finalone, while the missing part is due to the lack of information. See task in Section 6.2and 6.3 for example.

• Dynamical: the truth changes with time, so the ontology corresponding to partialhistory of text may be different from that of the final state. See task in Section 6.1 forexample.

It is important to notice that this categorization depends not only on the text but also heavilyon the definition of ontology. Taking the text in Figure 1 for example: if we define ownershiprelation between a Person-object and Item-object, the ontology becomes dynamical, sinceownership of the BMW changed from Tom to John.

2.2 Inline Memory

Inline Memory stores the relatively raw representation of the document that follows the tempo-ral structure of the text, as illustrated through Figure 2. Basically, Inline Memory is an arrayof memory cells, each corresponding to a pre-defined language unit (e.g., word) in the sameorder as they are in the original text. Each cell can have distributed part and symbolic part,designed to save 1) the result of preprocessing of text from different models, and 2) certainoutput from Reader, for example from previous reading rounds. Following are a few examplesfor preprocessing

• Word embedding: context-independent vectorial representation of words

6

• Hidden states of NNs: we can put the context in local representation of words throughgated RNN like LSTM [Greff et al., 2015] or GRU [Cho et al., 2014], or particular designof convolutional neural nets (CNN) [Yu and Koltun, 2015].

• Symbolic preprocessing: this refer to a big family of methods that yield symbolicresult, including various sequential labeling models and rule-based methods. As theresult we may have tag on words, extracted sub-sequences, or even relations on twopieces.

During the parsing process, Reader can write to Inline Memory with its discrete or continuousoutputs, a process we named “notes-taking”. When the output is continuous, the notes-takingprocess is similar to the interactive attention in machine translation [Meng et al., 2016], whichis from a NTM-style write-head [Graves et al., 2014] on Neural Net Controller. When theoutput is discrete, the notes-taking is essentially an action issued by Policy-net.

Inline Memory provides a way to represent locally encoded “low level” knowledge of thetext, which will be read, evaluated and combined with the global semantic representation inCarry-on Memory by Reader. One particular advantage of this setting is that it allows us toincorporate the local decisions of some other models, including “higher order” ones like localrelations across two language units, as illustrated in the left panel of Figure 5. We can alsohave a rather “nonlinear” representation of the document in Inline Memory. As a particularexample [Yan et al., 2017], at each location, we can have the representation of the currentword, the representation of the rest of the sentence, and the representation of the rest of thecurrent paragraph, which enables Reader to see information of history and future at differentscales, as illustrated in the right panel of Figure 5.

Figure 5: Left panel: Inline Memory with symbolic knowledge; Right panel: one choice ofnonlinear representation of the distributed part onf Inline Memory used in [Yan et al., 2017].

2.3 Reader

Reader is the control center of OONP , which manages all the (continuous and discrete) oper-ations in the OONP parsing process. Reader has three symbolic processors (namely, SymbolicMatching, Symbolic Reasoner, Symbolic Analyzer) and a Neural Net Controller (with Policy-netas the sub-component). All the components in Reader are coupled through intensive exchangeof information as shown in Figure 6. Below is a snapshot of the information processing attime t in Reader

• STEP-1: let the processor Symbolic Analyzer to check the Action History (Mtact ) to construct

some symbolic features for the trajectory of actions;

• STEP-2: access Matrix Memory (Mtmat) to get an vectorial representation for time t, denoted

as st ;

7

• STEP-3: access Inline Memory (Mtinl) to get the symbolic representation x

(s)t (through location-

based addressing) and distributed representation x(d)t (through location-based addressing and/or

content-based addressing);

• STEP-4: feed x(d)t and the embedding of x

(s)t to Neural Net Controller to fuse with st;

• STEP-5: get the candidate objects (some may have been eliminated by x(s)t ) and let them meet

x(d)t through the processor Symbolic Matching for the matching of them on symbolic aspect;

• STEP-6: get the candidate objects (some may have been eliminated by x(s)t ) and let them meet

the result of STEP-4 in Neural Net Controller ;

• STEP-7: Policy-net combines the result of STEP-6 and STEP-5, to issue actions;

• STEP-8: update Mtobj, M

tmat and Mt

inl with actions on both symbolic and distributed represen-tations;

• STEP-9: put Mtobj through the processor Symbolic Reasoner for some high-level reasoning and

logic consistency.

Note that we consider only single action for simplicity, while in practice it is common tohave multiple actions at one time step, which requires a slightly more complicated design ofthe policy as well as the processing pipeline.

Figure 6: A particular implementation of Reader in a closer look, which reveals some detailsabout the entanglement of neural and symbolic components. Dashed lines stand for continuoussignal and the solid lines for discrete signal

3 OONP: Actions

The actions issued by Policy-net can be generally categorized as the following

• New-Assign : determine whether to create an new object (a “New” operation) for theinformation at hand or assign it to a certain existed object

8

• Update.X : determine which internal property or external link of the selected object toupdate;

• Update2what : determine the content of the updating, which could be about string,category or links.

The typical order of actions is New-Assign → Update.X → Update2what, but it is verycommon to have New-Assign action followed by nothing, when, for example, an object ismentioned but no substantial information is provided,

3.1 New-Assign

With any information at hand (denoted as St) at time t, the choices of New-Assign typicallyinclude the following three categories of actions: 1) creating (New) an object of a certaintype, 2) assigning St to an existed object, and 3) doing nothing for St and moving on. ForPolicy-net, the stochastic policy is to determine the following probabilities:

prob(c, new|St), c = 1, 2, · · · , |C|prob(c, k|St), for Oc,kt ∈ Mt

obj

prob(none|St)

where |C| stands for the number of classes, Oc,kt stands for the kth object of class c at time t.Determining whether to new objects always relies on the following two signals

1. The information at hand cannot be contained by any existed objects;

2. Linguistic hints that suggests whether a new object is introduced.

Based on those intuitions, we takes a score-based approach to determine the above-mentionedprobability. More specifically, for a given St, Reader forms a “temporary” object with itsown structure (denoted Ot), including symbolic and distributed sections. In addition, we alsohave a virtual object for the New action for each class c, denoted Oc,newt , which is typically atime-dependent vector formed by Reader based on information in Mt

mat. For a given Ot, wecan then define the following 2|C|+ 1 types of score functions, namely

New an object of class c: score(c)new(Oc,newt , Ot; θ(c)

new), c = 1, 2, · · · , |C|Assign to existed objects: score

(c)assign(Oc,kt , Ot; θ(c)

assign), for Oc,kt ∈ Mtobj

Do nothing: scorenone(Ot; θnone).

to measure the level of matching between the information at hand and existed objects, as wellas the likeliness for creating an object or doing nothing. This process is pictorially illustratedin Figure 7. We therefore can define the following probability for the stochastic policy

prob(c, new|St) =escore

(c)new(Oc,newt ,Ot;θ(c)new)

Z(t)

prob(c, k|St) =escore

(c)assign(Oc,kt ,Ot;θ(c)assign)

Z(t)

prob(none|St) =escorenone(Ot;θnone)

Z(t)

9

where Z(t) =∑

c′∈C escore

(c′)new (Oc

′,newt ,Ot;θ(c

′)new )+

∑(c′′,k′)∈idx(Mt

obj)escore

(c′′)assign(Oc

′′,kt ,Ot;θ(c

′′)assign)+escorenone(Ot;θnone)

is the normalizing factor.

Figure 7: A pictorial illustration of what the Reader sees in determining whether to New anobject and the relevant object when the read-head on Inline Memory reaches the last word inthe sentence in Figure 2. The color of the arrow line stands for different matching functionsfor object classes, where the dashed lines is for the new object.

Many actions are essentially trivial on the symbolic part, for example, when Policy-netchooses none in New-Assign, or assigns the information at hand to an existed object butchoose to update nothing in Update.X, but this action will affect the distributed operationsin Reader. This distributed operation will affect the representation in Matrix Memory or theobject-embedding in Object Memory.

3.2 Updating objects: Update.X and Update2what

In Update.X step, Policy-net needs to choose the property or external link (or none) to up-date for the selected object determined by New-Assign step. If Update.X chooses to updatean external link, Policy-net needs to further determine which object it links to. After thatUpdate2what updates the chosen property or links. In task with static ontology, most internalproperties and links will be “locked” after they are updated for the first time, with some ex-ception on a few semi-structured property (e.g., the Description property in the experimentin Section 6.2). For dynamical ontology, on contrary, many important properties and links arealways subject to changes. A link can often be determined from both ends, e.g., the link thatstates the fact that “Tina (a Person-object ) carries apple (an Item-object )” can be eitherspecified from from Tina(through adding the link “carry” to apple) or from apple (throughadding the link “iscarriedby” to Tina ), as in the experiment in Section 6.1. In practice, itis often more convenient to make it asymmetrical to reduce the size of action space.

In practice, for a particular type of ontology, both Update.X and Update2what can oftenbe greatly simplified: for example,

10

• when the selected object (in New-Assign step) has only one property “unlocked”, theUpdate.X step will be trivial;

• in St, there is often information from Inline Memory that tells us the basic type of thecurrent information, which can often automatically decide the property or link.

3.3 An example

In Figure 8, we give an example of the entire episode of OONP parsing on the short text givenin the example in Figure 1. Note that different from our late treatment of actions, we letsome selection actions (e.g., the Assign) be absorbed into the updating actions to simplifythe illustration.

Figure 8: A pictorial illustration of a full episode of OONP parsing, where we assume thedescription of cars (highlighted with shadow) are segmented in preprocessing.

11

4 OONP: Neural-Symbolism

OONP offers a way to parse a document that imitates the cognitive process of human whenreading and comprehending a document: OONP maintains a partial understanding of docu-ment as a mixture of symbolic (representing clearly inferred structural knowledge) and dis-tributed (representing knowledge without complete structure or with great uncertainty). Asshown in Figure 2, Reader is taking and issuing both symbolic signals and continuous signals,and they are entangled through Neural Net Controller.

OONP has plenty space for symbolic processing: in the implementation in Figure 6, it iscarried out by the three symbolic processors. For each of the symbolic processors, the inputsymbolic representation could be rendered partially by neural models, therefore providing anintriguing way to entangle neural and symbolic components. Here are three examples weimplemented for two different tasks

1. Symbolic analysis in Action History: There are many symbolic summary of history wecan extracted or constructed from the sequence of actions, e.g., “The system just New anobject with Person-class five words ago” or “The system just put a paragraph startingwith ‘(2)’ into event-3”. In the implementation of Reader shown in Figure 6, thisanalysis is carried out with the component called Symbolic Analyzer. Based on thosemore structured representation of history, Reader might be able to make a informedguess like “If the coming paragraph starts with ‘(3)’, we might want to put it toevent-2” based on symbolic reasoning. This kind of guess can be directly translatedinto feature to assist Reader’s decisions, resembling what we do with high-order featuresin CRF [Lafferty et al., 2001], but the sequential decision makes it possible to constructa much richer class of features from symbolic reasoning, including those with recursivestructure. One example of this can be found in [Yan et al., 2017], as a special case ofOONP on event identification.

2. Symbolic reasoning on Object Memory: we can use an extra Symbolic Reasoner totake care of the high-order logic reasoning after each update of the Object Memorycaused by the actions. This can illustrated through the following example. Tina (aPerson-object) carries an apple (an Item-object), and Tina moves from kitchen (aLocation-object) to garden (Location-object) at time t. Supposing we have bothTina-carry-apple and Tina-islocatedat-kitchen relation kept in Object Memory attime t, and OONP updates the Tina -islocatedat-kitchen to Tina -islocatedat-garden at time t+1, the Symbolic Reasoner can help to update the relation apple

-islocatedat-kitchen to apple -islocatedat-garden . This is feasible since theObject Memory is supposed to be logically consistent. This external logic-based updateis often necessary since it is hard to let the Neural Net Controller see the entire ObjectMemory due to the difficulty to find a distributed representation of the dynamic structurethere. Please see Section 6.1 for experiments.

3. Symbolic prior in New-Assign : When Reader determines an New-Assign action, it needsto match the information about the information at hand (St) and existed objects. Thereis a rich set of symbolic prior that can be added to this matching process in SymbolicMatching component. For example, if St contains a string labeled as entity name (inpreprocessing), we can use some simple rules (part of the Symbolic Matching component)to determine whether it is compatible with an object with the internal property Name.

12

5 Learning

The parameters of OONP models (denoted Θ) include that for all operations and that forcomposing the distributed sections in Inline Memory. They can be trained with different learn-ing paradigms: it takes both supervised learning (SL) and reinforcement learning (RL) whileallowing different ways to mix the two. Basically, with supervised learning, the oracle givesthe ground truth about the “right action” at each time step during the entire decision process,with which the parameter can be tuned to maximize the likelihood of the truth. In a sense,SL represents rather strong supervision which is related to imitation learning [Stefan, 1999]and often requires the labeler (expert) to give not only the final truth but also when andwhere a decision is made. For supervised learning, the objective function is given as

JSL(Θ) = − 1

N

N∑i

Ti∑t=1

log(π(i)t [a?t ]) (1)

where N stands for the number of instances, Ti stands for the number of steps in decision

process for the ith instance, π(i)t [·] stands for the probabilities of the feasible actions at t from

the stochastic policy, and a?t stands fro the ground truth action in step t.With reinforcement learning, the supervision is given as rewards during the decision pro-

cess, for which an extreme case is to give the final reward at the end of the decision processby comparing the generated ontology and the ground truth, e.g.,

r(i)t =

{0, if t 6= Ti

match(MTiobj,Gi), if t = Ti

(2)

where the match(MTiobj,Gi) measures the consistency between the ontology of in MTi

obj and theground truth G?. We can use any policy search algorithm to maximize the expected totalreward. With the commonly used REINFORCE [Williams, 1992] for training, the gradient isgiven by

∇ΘJRL(Θ) = Eπθ[∇Θ log πΘ

(ait|sit

)r

(i)t:T

]≈ − 1

NTi

N∑i

T∑t=1

∇Θ log πΘ

(ait|sit

)r

(i)t:Ti. (3)

When OONP is applied to real-world tasks, there is often quite natural SL and RL. Morespecifically, for “static ontology” one can often infer some of the right actions at certain timesteps by observing the final ontology based on some basic assumption, e.g.,

• the system should New an object the first time it is mentioned,

• the system should put an extracted string (say, that for Name ) into the right propertyof right object at the end of the string.

For those that can not be fully reverse-engineered, say the categorical properties of an object(e.g., Type for event objects), we have to resort to RL to determine the time of decision, whilewe also need SL to train Policy-net on the content of the decision. Fortunately it is quitestraightforward to combine the two learning paradigms in optimization. More specifically, wemaximize this combined objective

J (Θ) = JSL(Θ) + λJRL(Θ), (4)

where JSL and JRL are over the parameters within their own supervision mode and λ co-ordinates the weight of the two learning mode on the parameters they share. Equation 4

13

actually indicates a deep coupling of supervised learning and reinforcement learning, since forany episode the samples of actions related to RL might affect the inputs to the models undersupervised learning.

For dynamical ontology (see Section 6.1 for example), it is impossible to derive most ofthe decisions from the final ontology since they may change over time. For those, we haveto rely mostly on the supervision at the time step to train the action (supervised mode) orcount on the model to learn the dynamics of the ontology evolution by fitting the final groundtruth. Both scenarios are discussed in Section 6.1 on a synthetic task.

6 Experiments

We applied OONP on three document parsing tasks, to verify its efficacy on parsing documentswith different characteristics and investigate different components of OONP.

6.1 Task-I: bAbI Task

6.1.1 Data and task

We implemented OONP an enriched version of bAbI tasks [Johnson, 2017] with intermediaterepresentation for history of arbitrary length. In this experiment, we considered only theoriginal bAbi task-2 [Weston et al., 2015], with an instance shown in the left panel Figure 9.The ontology has three types of objects: Person-object, Item-object, and Location-object,and three types of links:

1. is-located-atA: between a Person-object and a Location-object,

2. is-located-atB: between a Item-object and a Location-object;

3. carry: between a Person-object and Item-object;

which could be rendered by description of different ways. All three types of objects have Name

as the only internal property.

Figure 9: One instance of bAbI (6-sentence episode) and the ontology of two snapshots.

14

The task for OONP is to read an episode of story and recover the trajectory of the evolvingontology. We choose this synthetic dataset because it has dynamical ontology that evolveswith time and ground truth given for each snapshot, as illustrated in Figure 9. Comparingwith the real-world tasks we will present later, bAbi has almost trivial internal property butrelatively rich opportunities for links, considering any two objects of different types couldpotentially have a link.

6.1.2 Implementation details

For preprocessing, we have a trivial NER to find the names of people, items and locations(saved in the symbolic part of Inline Memory) and word-level bi-directional GRU for thedistributed representations of Inline Memory. In the parsing process, Reader goes through theinline word-by-word in the temporal order of the original text, makes New-Assign action atevery word, leaving Update.X and Update2what actions to the time steps when the read-headon Inline Memory reaches a punctuation (see more details of actions in Table 1). For thissimple task, we use an almost fully neural Reader (with MLPs for Policy-net) and a vector forMatrix Memory, with however a Symbolic Reasoner for some logic reasoning after each updateof the links, as illustrated through the following example. Suppose at time t, the ontology inMt

obj contains the following three facts (among others)

• fact-1: John (a Person-object) is in kichten (a Location-object);

• fact-2: John carries apple (an Item-object);

• fact-3: John drops apple;

where fact-3 is just established by Policy-net at t. Symbolic Reasoner will add a new is-located-atBlink between apple and kitchen based on domain logic‡.

Action Description

NewObject(c) New an object of class-c.AssignObject(c, k) Assign the current information to existed object (c, k)Update(c, k).AddLink(c′, k′, `) Add an link of type-` from object-(c, k) to object-(c′, k′)Update(c, k).DelLink(c′, k′, `) Delete the link of type-` from object-(c, k) to object-(c′, k′)

Table 1: Actions for bAbI.

6.1.3 Results and Analysis

For training, we use 1,000 episodes with length evenly distributed from one to six. We usejust REINFORCE with only the final reward defined as the overlap between the generatedontology and the ground truth, while step-by-step supervision on actions yields almost perfectresult (result omitted). For evaluation, we use the following two metrics:

• the Rand index [Rand, 1971] between the generated set of objects and the ground truth,which counts both the duplicate objects and missing ones, averaged over all snapshotsof all test instances;

• the F1 [Rijsbergen, 1979] between the generated links and the ground truth averagedover all snapshots of all test instances, since the links are typically sparse comparedwith all the possible pairwise relations between objects.

‡The logic says, an item is not “in” a location if it is held by a person.

15

with results summarized in Table 2. OONP can learn fairly well on recovering the evolvingontology with such a small training set and weak supervision (RL with the final reward),which clearly shows that the credit assignment over to earlier snapshots does not cause muchdifficulty in the learning of OONP even with a generic policy search algorithm. It is not sosurprising to observe that Symbolic Reasoner helps to improve the results on discovering thelinks, while it does not improves the performance on identifying the objects although it istaken within the learning. It is quite interesting to observe that OONP achieves rather highaccuracy on discovering the links while it performs relatively poorly on specifying the objects.It is probably due to the fact that the rewards does not penalizes the objects.

model F1 (for links) % RandIndex (for objects)%

OONP (without S.R.) 94.80 87.48

OONP (with S.R. ) 95.30 87.48

Table 2: The performance a implementation of OONP on bAbI task 2.

6.2 Task-II: Parsing Police Report

6.2.1 Data & task

We implement OONP for parsing Chinese police report (brief description of criminal caseswritten by policeman), as illustrated in the left panel of Figure 10. We consider a corpus of5,500 cases with a variety of crime categories, including theft, robbery, drug dealing and others.The ontology we designed for this task mainly consists of a number of Person-objects andItem-objects connected through a Event-object with several types of relations, as illustratedin the right panel of Figure 10. A Person-object has three internal properties: Name (string),Gender (categorical) and Age (number), and two types of external links (suspect and victim)to an Event-object. An Item-object has three internal properties: Name (string), Quantity(string) and Value (string), and six types of external links (stolen, drug, robbed, swindled,damaged, and other) to an Event-object. Compared with bAbI in Section 6.1, the policereport ontology has less pairwise links but much richer internal properties for objects of allthree objects. Although the language in this dataset is reasonably formal, the corpus coveragesa big variety of topics and language styles, and has a high proportion of typos. The averagelength of a document is 95 Chinese characters, with digit string (say, ID number) counted asone character.

Figure 10: An example of police report and its ontology.

16


The OONP model is designed to generate ontology as illustrated in Figure 10 through adecision process with actions in Table 3. As pre-processing, we performed regular NERwith third party algorithm (therefore not part of the learning) and simple rule-based ex-traction to yield the symbolic part of Inline Memory as shown in Figure 11. For the dis-tributed part of Inline Memory, we used dilated CNN with different choices of depth andkernel size [Yu and Koltun, 2015], all of which will be jointly learned during training. Inmaking the New-Assign decision, Reader considers the matching between two structured ob-jects, as well as the hints from the symbolic part of Inline Memory as features, as pictoriallyillustrated in Figure 7. In updating objects with its string-type properties (e.g., Name for aPerson-object ), we use Copy-Paste strategy for extracted string (whose NER tag alreadyspecifies which property in an object it goes to) as Reader sees it. For undetermined categoryproperties in existed objects, Policy-net will determine the object to update (an New-Assign

action without New option), its property to update (an Update.X action), and the updatingoperation (an Update2what action) at milestones of the decision process , e.g., when reach-ing an punctuation. For this task, since all the relations are between the single by-defaultEvent-object and other objects, the relations can be reduced to category-type propertiesof the corresponding objects in practice. For category-type properties, we cannot recoverNew-Assign and Update.X actions from the label (the final ontology), so we resort RL forlearning to determine that part, which is mixed with the supervised learning for Update2whatand other actions for string-type properties.

Action Description

NewObject(c) New an object of class-c.AssignObject(c, k) Assign the current information to existed object (c, k)UpdateObject(c, k).Name Set the name of object-(c, k) with the extracted string.UpdateObject(Person, k).Gender Set the name of a Person-object indexed k with the extracted string.UpdateObject(Item, k).Quantity Set the quantity of an Item-object indexed k with the extracted string.UpdateObject(Item, k).Value Set the value of an Item-object indexed k with the extracted string.UpdateObject(Event, 1).Items.x Set the link between the Event-object and an Item-object, where x ∈{stolen,

drug, robbed, swindled, damaged, other}UpdateObject(Event, 1).Persons.x Set the link between the Event-object and an Person-object, and x ∈{victim,

suspect}

Table 3: Actions for parsing police report.

6.2.3 Results & discussion

We use 4,250 cases for training, 750 for validation an held-out 750 for test. We consider thefollowing four metrics in comparing the performance of different models:

Assignment Accuracy the accuracy on New-Assign actions made by the modelCategory Accuracy the accuracy of predicting the category properties of all the objectsOntology Accuracy the proportion of instances for which the generated ontology is exactly

the same as the ground truthOntology Accuracy-95 the proportion of instances for which the generated ontology achieves

95% consistency with the ground truth

which measures the accuracy of the model in making discrete decisions as well as generatingthe final ontology. We empirically examined several OONP implementations and comparedthem with a Bi-LSTM baseline, with results given in Table 4.

17

Model Assign Acc. (%) Type Acc. (%) Ont. Acc. (%) Ont. Acc-95 (%)

Bi-LSTM (baseline) 73.2 ± 0.58 - 36.4± 1.56 59.8 ± 0.83

OONP (neural) 88.5 ± 0.44 84.3 ± 0.58 61.4 ± 1.26 75.2 ± 1.35

OONP (structured) 91.2 ± 0.62 87.0 ± 0.40 65.4 ± 1.42 79.9 ± 1.28

OONP (RL) 91.4 ± 0.38 87.8 ± 0.75 66.7 ± 0.95 80.7 ± 0.82

Table 4: OONP on parsing police reports.

The Bi-LSTM is essentially a simple version of OONP without a structured Carry-onMemory and designed operations (sophisticated matching function in New-Assign ). Basicallyit consists of a Bi-LSTM Inline Memory encoder and a two-layer MLP on top of that acting asa simple Policy-net for prediction actions. Since this baseline does not has an explicit objectrepresentation, it does not support category type of prediction. We hence only train thisbaseline model to perform New-Assign actions, and evaluate with the Assignment Accuracy(first metric) and a modified version of Ontology Accuracy (third and fourth metric) thatcounts only the properties that can be predicted by Bi-LSTM, hence in favor of Bi-LSTM.We consider three OONP variants:

• OONP (neural): simple version of OONP with only distributed representation in Readerin determining all actions;

• OONP (structured): OONP that considers the matching between two structured objectsin New-Assign actions, with symbolic prior encoded in Symbolic Matching and otherfeatures for Policy-net;

• OONP (RL): another version of OONP (structured) that uses RL to determine the timefor predicting the category properties, while OONP (neural) and OONP (neural) use arule-based approach to determine the time.

As shown in Table 4, Bi-LSTM baseline struggles to achieve around 73% AssignmentAccuracy on test set, while OONP (neural) can boost the performance to 88.5%. Arguably,this difference in performance is due to the fact that Bi-LSTM lacks Object Memory, so allrelevant information has to be stored in the Bi-LSTM hidden states along the reading process.When we start putting symbolic representation and operation into Reader , as shown in theresult of OONP (structure), the performance is again significantly improved on all four metrics.More specifically, we have the following two observations (not shown in the table),

• Adding inline symbolic features as in Figure 11 improves around 0.5% in New-Assign

action prediction, and 2% in category property prediction. The features we use includethe type of the candidate strings and the relative distance to the maker character wechose.

Figure 11: Information in distributed and symbolic forms in Inline Memory.

18

• Using a matching function that can take advantage of the structures in objects helpsbetter generalization. Since the objects in this task has multiple property slots likeName, Gender, Quantity, Value. We tried adding both the original text (e.g., Name,Gender, Quantity, Value ) string of an property slot and the embedding of that asadditional features, e.g., the length the longest common string between the candidatestring and a relevant property of the object.

When using REINFORCE to determine when to make prediction for category property,as shown in the result of OONP (RL), the prediction accuracy for category property andthe overall ontology accuracy is improved. It is quite interesting that it has some positiveimpact on the supervised learning task (i.e., learning the New-Assign actions) through sharedparameters. The entanglement of the two learning paradigms in OONP is one topic for futureresearch, e.g., the effect of predicting the right category property on the New-Assign actions ifthe predicted category property is among the features of the matching function for New-Assignactions.

6.3 Task-III: Parsing court judgment documents

6.3.1 Data and task

We also implement OONP for parsing court judgement on theft. Unlike the two previoustasks, court judgements are typically much longer, containing multiple events of differenttypes as well as bulks of irrelevant text, as illustrated in the left panel of Figure 10. Thedataset contains 1961 Chinese judgement documents, divided into training/dev/testing setwith 1561/200/200 texts respectively. The ontology we designed for this task mainly consistsof a number of Person-objects and Item-objects connected through a number Event-objectwith several types of links. A Event-object has three internal properties: Time (string),Location (string), and Type (category, ∈{theft, restitutionx, disposal}), four typesof external links to Person-objects (namely, principal, companion, buyer, victim) andfour types of external links to Item-objects (stolen, damaged, restituted, disposed). Inaddition to the external links to Event-objects , a Person-object has only the Name (string)as the internal property. An Item-object has three internal properties: Description (arrayof strings), Value (string) and Returned(binary) in addition to its external links to Event-objects , where Description consists of the words describing the corresponding item, whichcould come from multiple segments across the document. A Person-object or an Item-object could be linked to more than one Event-object, for example a person could be theprincipal suspect in event A and also a companion in event B. An illustration of the judgementdocument and the corresponding ontology can be found in Figure 12.


We use a model configuration similar to that in Section 6.2, with however the following impor-tant difference. In this experiment, OONP performs a 2-round reading on the text. In the firstround, OONP identifies the relevant events, creates empty Event-objects, and does Notes-Taking on Inline Memory to save the information about event segmentation (see [Yan et al., 2017]for more details). In the second round, OONP read the updated Inline Memory, fills the Event-objects, creates and fills Person-objects and Item-objects, and specifies the links betweenthem. When an object is created during a certain event, it will be given an extra feature (not

19

Figure 12: Left panel: the judgement document with highlighted part being the descriptionthe facts of crime, right panel: the corresponding ontology

an internal propoerty) indicating this connection, which will be used in deciding links betweenthis object and event object, as well as in determining the future New-Assign actions. Theactions of the two round reading are summarized in Table 5.

Action for 1st-round Description

NewObject(c) New an Event-object, with c =Event.NotesTaking(Event, k).word Put indicator of event-k on the current word.NotesTaking(Event, k).sentence Put indicator of event-k on the rest of sentence, and move the read-head to the

first word of next sentenceNotesTaking(Event, k).paragraph Put indicator of event-k on the rest of the paragraph, and move the read-head to

the first word of next paragraph.Skip.word Move the read-head to next wordSkip.sentence Move the read-head to the first word of next sentenceSkip.paragraph Move the read-head to the first word of next paragraph.

Action for 2nd-round Description

NewObject(c) New an object of class-c.AssignObject(c, k) Assign the current information to existed object (c, k)UpdateObject(Person, k).Name Set the name of the kth Person-object with the extracted string.UpdateObject(Item, k).Description Add to the description of an kth Item-object with the extracted string.UpdateObject(Item, k).Value Set the value of an kth Item-object with the extracted string.UpdateObject(Event, k).Time Set the time of an kth Event-object with the extracted string.UpdateObject(Event, k).Location Set the location of an kth Event-object with the extracted string.UpdateObject(Event, k).Type Set the type of the kth Event-object among {theft, disposal, restitution}UpdateObject(Event, k).Items.x Set the link between the kth Event-object and an Item-object, where x ∈

{stolen, damaged, restituted, disposed }UpdateObject(Event, k).Persons.x Set the link between the kth Event-object and an Person-object, and x ∈

{principal, companion, buyer, victim}

Table 5: Actions for parsing court judgements.

6.3.3 Results and Analysis

We use the same metric as in Section 6.2, and compare two OONP variants, OONP (neural)and OONP (structured), with Bi-LSTM. The Bi-LSTM will be tested only on the second-round reading, while both OONP variant are tested on a two-round reading. The results areshown in Table 6. OONP parsers attain accuracy significantly higher than Bi-LSTM models.Among, OONP (structure) achieves over 64% accuracy on getting the entire ontology rightand over 78% accuracy on getting 95% consistency with the ground truth.

20

Model Assign Acc. (%) Type Acc. (%) Ont. Acc. (%) Ont. Acc-95 (%)

Bi-LSTM (baseline) 84.66 ± 0.20 - 18.20 ± 1.01 36.88 ± 0.01

OONP (neural) 94.50 ± 0.24 97.73 ± 0.12 53.29 ± 0.26 72.22 ± 1.01

OONP (structured) 97.49 ± 0.43 97.43 ± 0.07 64.51 ± 0.99 78.61 ± 0.95

Table 6: OONP on judgement documents with the four metrics defined in Section 6.2.3.

7 Conclusion

We proposed Object-oriented Neural Programming (OONP), a framework for semanticallyparsing in-domain documents. OONP is neural net-based, but equipped with sophisticatedarchitecture and mechanism for document understanding, therefore nicely combining inter-pretability and learnability. Our experiments on both synthetic and real-world documentparsing tasks have shown that OONP can learn to handle fairly complicated ontology withtraining data of modest sizes.

References

[Cho et al., 2014] Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H.,and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder forstatistical machine translation. Proceedings of EMNLP, pages 17241734.

[Daume III et al., 2009] Daume III, H., Langford, J., and Marcu, D. (2009). Search-basedstructured prediction.

[Dong and Lapata, 2016] Dong, L. and Lapata, M. (2016). Language to logical form withneural attention. ACL, pages 33–43.

[Graves et al., 2014] Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines.CoRR, abs/1410.5401.

[Greff et al., 2015] Greff, K., Srivastava, R. K., Koutnık, J., Steunebrink, B. R., and Schmid-huber, J. (2015). LSTM: A search space odyssey. CoRR, abs/1503.04069.

[Henaff et al., 2016] Henaff, M., Weston, J., Szlam, A., Bordes, A., and LeCun, Y. (2016).Tracking the world state with recurrent entity networks. CoRR, abs/1612.03969.

[Henderson et al., 2014] Henderson, Matthew, Thomson, B., , and Young, S. (2014). Word-based dialog state tracking with recurrent neural networks. Proceedings of the 15th AnnualMeeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pp. 292-299.

[Johnson, 2017] Johnson, D. D. (2017). Learning graphical state transitions. the internationalconference on learning representations.

[Lafferty et al., 2001] Lafferty, D., Mccallum, A., and Pereira, F. (2001). Conditional ran-dom fields: Probabilistic models for segmenting and labeling sequence data. internationalconference on machine learning, pages 282–289.

[Liang et al., 2017] Liang, C., Berant, J., Le, Q., Forbus, K. D., and Lao, N. (2017). Neuralsymbolic machines: Learning semantic parsers on freebase with weak supervision. ACL.

21

[Meng et al., 2016] Meng, F., Lu, Z., Li, H., and Liu, Q. (2016). Interactive attention forneural machine translation. CoRR, abs/1610.05011.

[Mitchell, 2003] Mitchell, J. C. (2003). Concepts in programming languages. CambridgeUniversity Press.

[Mou et al., 2017] Mou, L., Lu, Z., Li, H., and Jin, Z. (2017). Coupling distributed andsymbolic execution for natural language queries. In Proceedings of the 34th InternationalConference on Machine Learning (ICML), pages 2518–2526.

[Rand, 1971] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846–850.

[Rijsbergen, 1979] Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition.

[Stefan, 1999] Stefan, S. (1999). Is imitation learning the route to humanoid robots? Trendsin cognitive sciences, 3.6:233–242.

[Weston et al., 2015] Weston, J., Bordes, A., Chopra, S., and Mikolov, T. (2015). Towardsai-complete question answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698.

[Williams, 1992] Williams, R. J. (1992). Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine Learning, 8:229–256.

[Yan et al., 2017] Yan, Y., Zheng, D., Lu, Z., and Song, S. (2017). Event identification as adecision process with non-linear representation of text. CoRR, abs/1710.00969.

[Yu and Koltun, 2015] Yu, F. and Koltun, V. (2015). Multi-scale context aggregation bydilated convolutions. CoRR, abs/1511.07122.

[Zeng et al., 2014] Zeng, D., Liu, K., Lai, S., Zhou, G., and Zhao, J. (2014). Relation classi-fication via convolutional deep neural network. In Proceedings of COLING.

22

Object-oriented Neural Programming (OONP) for … Neural Programming (OONP) for Document Understanding Zhengdong Lu1 Haotian Cui 2* Xianggen Liu * Yukun Yan * Daqi Zheng1 1DeeplyCurious.ai

Documents