Caiming Xiong , Nishant Shukla , Wenlong Xiong, and Song-Chun … · 2020. 8. 1. · Caiming Xiong , Nishant Shukla , Wenlong Xiong, and Song-Chun Zhu Abstract—We propose a stochastic

Robot Learning with a Spatial, Temporal, and Causal And-Or Graph

Caiming Xiong∗, Nishant Shukla∗, Wenlong Xiong, and Song-Chun Zhu

Abstract— We propose a stochastic graph-based frameworkfor a robot to understand tasks from human demonstrationsand perform them with feedback control. It unifies bothknowledge representation and action planning in the samehierarchical data structure, allowing a robot to expand itsspatial, temporal, and causal knowledge at varying levels ofabstraction. The learning system can watch human demonstra-tions, generalize learned concepts, and perform tasks in newenvironments, across different robotic platforms. We show thesuccess of our system by having a robot perform a cloth-foldingtask after watching few human demonstrations. The robot canaccurately reproduce the learned skill, as well as generalize thetask to other articles of clothing.

I. INTRODUCTION

Writing automated software on robots is not nearly asrobust as that on traditional computers. This is due to theheavy burden of matching software assumptions to physicalreality. The complexities and surprises of the real worldrequire robots to adapt to new environments and learn newskills to remain useful.

In robot automation, implicit motor control is widelyused for learning from human demonstrations [1] [2] [3].However, implicit motor control is insufficient for gener-alizing robot execution. For instance, a robot can imitatea human’s demonstration to open a door; yet, it cannotexecute a similar motion trajectory such as opening a windowwithout the explicit representation of the task. Intuition suchas how to rotate the joints of an arm is not somethingeasily expressible, but rather learned through experiences.Uniting explicit and implicit knowledge allows immediatecommunication through natural language [8], as well as cleargrounding of abstract concepts into atomic actions.

In this paper, we propose a unified framework to bridgethe implicit motor control with explicit high-level knowledgeso the robot can understand human behavior, perform atask with feedback control, and reason in vastly differentenvironments. As a proof of concept, we teach a robot howto fold a shirt through few human demonstrations, and haveit infer how to fold never-before-seen articles of clothing,such as pants or towels. The same causality-learning frame-work can be extrapolated to arbitrary tasks, not just cloth-folding. Specifically, the robot can learn different skills (e.g.flattening, stretching) depending on which features it tracks(e.g. smoothness, elastic stress). Moreover, since explicitknowledge is structured graphically, our framework naturally

∗ C. Xiong and N. Shukla contributed equally to this workC. Xiong, N. Shukla, W. Xiong, and S.-C. Zhu are with the Center for Vi-

sion, Cognition, Learning, and Autonomy (VCLA), University of California,Los Angeles [email protected], [email protected],[email protected], [email protected]

allows for the merging, trimming, and addition of knowl-edge from various human demonstrations, all with feedbackcontrol. The high-level concepts are human-understandable,so both the human and robot can communicate throughthis intermediate language [7]. Thus, programming the robotbecomes an act of merely modifying a graph-based datastructure.

The contributions of this paper include the following:

• Proposes a cross-platform stochastic framework forrobots to ground human demonstrations into hierarchi-cal spatial, temporal, and causal knowledge.

• Demonstrates a robot capable of learning, correcting itsmistakes, and generalizing in a cloth-folding task fromhuman demonstrations.

• Establishes the first system to use a non-rigid physicalsimulation to model the robot’s environment to improvetask execution.

• Provides experimental evidence of our framework togeneralize a cloth-folding task across different clothesand different robot platforms.

II. RELATED WORKS

While precisely grounding a human demonstration toatomic robot actions has been done in various forms [6][13] [26], we instead focus on the novel representation andgeneralizability of tasks. Beetz et al. integrate robot knowl-edge representation into the perception processes as well,but our framework allows alternative planning generatedby probabilistic sampling to match observed expectations.For example, there are multiple ways to fold a t-shirt, andeach of these ways has its own likelihood. Our probabilisticlearning framework resembles closest to the human-inspiredBayesian model of imitation by Rao et la. [21]. However, weinstead emphasize the hierarchical and ever-changing natureof spatial, temporal, and causal concepts in the real world.

Autonomously folding clothes has been demonstrated invarious works. Wang et al. [29] were able to successfullydesign a perception-based system to manipulate socks forlaundry. Miller et al. [11] have demonstrated sophisticatedcloth-folding robots, and Doumanoglou et al. [28] have madesubstantial progress in autonomously unfolding clothes. Onthe other hand, our focus is to understand how to performarbitrary tasks. There are other systems [6] that also learnconcrete action commands from small video clips, but unlikethose, our design allows a modifiable grammar and ourperformance is measured on multi-step long-term actions.Furthermore, our solution to knowledge representation ismore powerful than commonsense reasoning employed by

Fig. 1. The Spatial And-Or Graph on the left represents the ongoing perceptual knowledge of the world, i.e. a learned stochastic visual grammar. Aspecific instance of the And-Or graph is realized in the parse graph on the right.

first-order logic [19], since it takes advantage of the proba-bilistic models under ambiguous real-world perception.

Our work is based on the knowledge representation systemincorporated by Tu et al. [12], augmented heavily into therobotics domain. We extend the learning of event And-Or grammars and semantics from video [4] to our real-time robotics framework. The And-Or graph encapsulatesa conformant plan under partial observability, enabling anarchitecture that is cognitively penetrable since an updatedbelief of the world alters the robot’s behavior [14]. Unliketraditional graph planning [10], the hierarchical nature of theknowledge representation system enables a practical way ofgenerating actions for a long-term goal.

III. METHOD

There is often a fine distinction between memorizationand understanding, where the latter enables generalizinglearned concepts. In order to understand a human task fromdemonstrations/videos such as cloth-folding, a knowledgerepresentation system is necessary to ensure actions are notsimply memorized. Four types of knowledge are importantfor understanding and generalizing:

• Spatial knowledge expresses the physical configurationof the environment when performing the task. For acloth-folding task, a table, cloth, and each part of thecloth, such as the left and right sleeve of a shirt, needsto be detected.

• Temporal knowledge reveals the series of human ac-tions in the process of the task. In cloth-folding, thehand motion, grip opening, and grip closing actions areessential. These actions combine together to form a foldaction.

• Causal knowledge conveys the status change of anobject in each dynamic human action. For example, ashirt may be folded in various ways, either by foldingthe left sleeve into the middle and then the rightsleeve, or vice versa. Folding a cloth requires multiplehierarchical steps for reasoning.

• The interplay between the spatial, temporal, andcausal concepts manifests a generalizable form ofknowledge to be used in changing application domains.The robot must choose an action to achieve a statechange by using a causal reasoning concept. Each of thethree must work together to express learned knowledge.

A. Mathematical Formulation for Human Task

Given a set of human task demonstrations D =D1, D2, · · · , Dn such as cloth-folding videos, the goal isto learn a joint model (GSTC) including Spatial, Temporal,and Causal concepts, that we formulate as

G∗STC = argmaxGSTC

P (GSTC |D) (1)

= P (GS |D) · P (GT |D) · P (GC |D)· P (R(GS , GT , GC)|D)

where GS is the model of spatial concepts, GT is themodel of temporal concepts, GC is the model of causalconcepts, and R(GS , GT , GC) is the relational/conditionalmodel between spatial, temporal, causal concepts.

To implement this formulation, we need to define theconcrete representation for each symbol in Eq. 1. Due tothe structured and compositional nature of spatial, temporal,and causal concepts, we adopt the hierarchical stochasticgrammar model, And-Or graph (AoG) [5], as the base of ourmodel representation which is introduced below. To simplifythe learning process, we marginalized the complex STC-AoG (GSTC) into the S-AoG (GS), T-AoG (GT ) and C-AoG(GC); thus, we can learn the GS , GT and GC separately asthe model’s initialization, then jointly learn the conditionalmodel between them.

B. And-Or Graph Overview

The And-Or Graph is defined as a 3-tuple G = (V,R, P ),where V = V AND ∪ V OR ∪ V T consists of a disjoint setof And-nodes, Or-nodes, and Terminal nodes respectively. Ris a set of relations between Or-nodes or subgraphs, each ofwhich represents a generating process from a parent node to

Fig. 2. The Temporal And-Or Graph on the left is a database of all actions currently known in the real world. Each action has an associated agent andpatient. The realized parse graph on the right shows a generated sequence of actions directly executable by the robot.

its children nodes. P (r) is an expansion probability for eachrelation.

Figure 1 shows an example of an And-Or graph. An And-node represents the decomposition of a graph into multiplesub-graphs. It is denoted by an opaque circle, and all the out-going edges are opaque lines. An Or-node is a probabilisticswitch deciding which of the sub-graphs to accept. It isdenoted by an open circle with out-going edges drawnin dashed lines. The Terminal node represents groundedcomponents, often referred to as a dictionary.

The nodes are structured into a hierarchical directedacyclic graph (DAG) structure. The AoG is a combination ofa Markov tree and Markov random field, where an And-nodecorresponds to a graphic template model, and an Or-nodecorresponds to a switch in a Markov tree [17].

Given a set of human demonstrations D, the graph G iscomposed of an AoG graph structure G and parameters θ.The nodes and rules/edges in the graph structure aim tomaximize the objective function, denoted by the posteriorprobability:

P (G|D) = P (G, θ|D) (2)

= P (G|D)P (θ|D, G) (3)

The first term models the structure of an And-Or graph Gfrom a human demonstration D. To solve the first term, wemanually design the structure of the S-AoG, but we learnthe T-AoG and C-AoG structure automatically [4] [25] [15].

The second term models the parameters θ in the graph,given the learned knowledge graph structure. It is reformu-lated as follows:

P (θ|D, G) ∝∏Di∈D

P (Di|θ, G) (4)

≈∏Di∈D

maxpgi

P (Di|pgi, θ, G)P (pgi|θ, G) (5)

where pgi is the parse graph of Di. A parse graph is aninstance of G where each Or-node decides one of its children.P (pgi|θ, G) is the prior probability distribution of parse

graph pgi given G. To simply the learning process, we set itas a uniform distribution. Thus,

P (θ|D, G) ∝∏Di∈D

maxpgi

P (Di|pgi, θ, G) (6)

And,

P (Di|pgi, θ, G) =∏

v∈V AND

P (Chv|v, θANDv ) (7)∏v∈V OR

P (Chv|v, θORv ) (8)∏v∈V T

P (Di|v) (9)

where Chv denotes the child of a non-terminal node v ∈V AND ∪ V OR. The probability derivation represents a gen-erating process from a parent node to its child node, andstops at the terminal nodes to generate the sample Di. Theparameters are learned in an iterative process through aMinimax Entropy algorithm explain in more detail later.

C. S-AoG: Spatial Concepts Model

A powerful way to capture perceptual information isthrough a visual grammar to produce the most probableinterpretations of observed images. Therefore, we representspatial concepts through a stochastic Spatial And-Or Graph(S-AoG) [5]. Nodes in the S-AoG represent visual informa-tion of varying levels of abstraction. The deeper a node lies inthe graph, the more concrete of a concept it represents. AnAnd-node signifies physical compositionality (i.e. a wheelis a part of a car) whereas an Or-node describes structuralvariation (i.e. a car is a type of vehicle).

As demonstrated in Figure 1, the root node of the S-AoGencompasses all possible spatial states a robot may perceive.Here, the “Indoor scene” is decomposed into “Foreground”and “Background,” which are then further decomposed. Thenodes deeper in the tree represent finer and finer conceptsuntil they end up the terminial nodes consisting of groundedperception units such as the sleeve of t-shirt.

Fig. 3. The Causal And-Or Graph encapsulates the fluent changes per action. The parse graph on the right shows the reasoning system in action.

D. T-AoG: Temporal Concepts ModelThe action-space of the world is often an assortment of

compositional and variational sub-actions. The hierarchicalnature of actions leads us to represent actions by a stochasticTemporal And-Or Graph (T-AoG) [4]. And-nodes correspondto a sequence of actions (i.e. close the door, then lock it),whereas Or-nodes correspond to alternate conflicting actions(i.e. close the door, or open the door). The leaf nodes ofthis graph are atomic action primitives that the robot canimmediately perform. Different sequences of atomic actionsproduce different higher-level actions.

The T-AoG structure is learned automatically using tech-niques from Si et al. [4] establishing an initial knowledgebase of actions. Our T-AoG does not learn new atomicactions, but may learn higher-level actions that are built fromthese atomic actions. By fixing the set of atomic actions, weensure the grounding of higher-level actions to alleviate thecorrespondence problem. Our framework assumes detectorsof such atomic action as input.

As shown in Figure 2, the root node of the T-AoG repre-sents all possible actions. As we traverse the tree down, theactions become less and and less abstract, until they can nolonger be simplified. Therefore, the robot can unambiguouslyperform the atomic actions represented by the leaf nodes.

The T-AoG provides us a way to define the structure andsequence of actions, but how an action causes a change instate is incorporated in the causality data structure definednext.

E. C-AoG: Causal Concepts ModelCausality is defined as a fluent change due to a relevant

action. We can think of fluents as functions on a situationx1(s), x2(s), ..., such as the state of a car’s engine (on vs. off)or its current speed (5mph, 10mph, etc.). We use the CausalAnd-Or Graph (C-AoG) to encapsulate causality learnedfrom human demonstration [15], as shown in Figure 3. Eachcausal node is a fluent change operator, transforming an inputfluent to an output fluent by using an action from the T-AoG.As shown in the diagram, there are various ways to reachthe same state. Or-nodes capture the various ways a fluentmay change from one state to another.

From the point of view of automated planning, fluents aremulti-variate observations of a state. The fluents that changedue to a relevant action are vital for predicting future actions.If a fluent does not change from a change-inducing action,then it is irrelevant with respect to the action. These time-invariant properties as defined as “attributes” of the node(i.e. color, weight). Additionally, fluents that change due toan inertial action (i.e. actions that are irrelevant to a fluentchange) are noted inconsistent.

For example, given an cloth s, let fluent x1(s) representhigh-level abstract information such as the shape of a cloth,whereas if the cloth is a shirt, fluent x2(s) represents specifickeypoints for shirts. The C-AoG structure is learned throughan information projection pursuit outlined by Fire et al [15].The STC-AoG uses these relevant fluent changes to plan outtasks.

F. Relational Model between Spatial, Temporal, Causal And-Or Graph

Each of the three And-Or Graphs are unified into acommon framework for a complete representation of theworld [12]. This explicit knowledge is represented by ahierarchical graphical network specifying a stochastic contextsensitive grammar [16], called the the Spatial, Temporal, andCausal And-Or Graph (STC-AoG) [12]. The cloth-foldingtask in our real-time robot framework is incorporated asdescribed in Figure 4.

Formally, the fluent functions ∀j xi(sj) partition the realsR. Two fluents xi(sa) and xi(sb) are identical if they belongin the same partition. Each spatial or temporal situation simay have multiple fluents (x1, x2, ...).

x(si) =

x1(si)x2(si)...

(10)

The fluent change between two states sj and sk is formallydefined as a binary vector:

4x(sj , sk) =

4x1(sj , sk)4x2(sj , sk)

...

(11)

Fig. 4. For illustrative purposes, this diagram shows simple interactionsbetween the spatial, temporal, and causal And-Or graphs. When the widthw or height h of the shirt is larger than the target width wT or height hT ,the C-AoG triggers a fold action in an attempt to reach a smaller foldedshirt. The robot then folds the shirt to produce the desired width and height(w ≤ wt AND h ≤ hT ).

4xi(sj , sk) =

0 if xi(sj) = xi(sk)

1 otherwise

By accumulating human demonstrations of an action, weobtain a set of video clips Qa = q1, q2, ... for a specificaction a, where qi is a video clip showing action a. Thescore wj(a) of an action to make a fluent change is definedas:

∀j wj(a) = P (4xj = 1 | Q) =

∑i 14xj=1|qi||Q||

(12)

with the scores normalized by√∑

j wj(a)2.

Fluents that represent specific properties, such as key-points, tend to be heavier weighted than those that are broadhigh-level concepts, such as shape [18]. The fluents are typ-ically hand-chosen, but we suggest automatically generatingvarious abstractions of fluents by varying the dimensionalityof autoencoders. Recent work on spatial semantics [27] canalso initialize nodes with a set of useful fluents.

The STC-AoG is not just a knowledge representationsystem, but also a hierarchical planning graph. Folding ashirt using shirt fluents x1(s) and x2(s) has greater affor-dance than that from using just abstract shape informationx1(s). That way, causal reasoning remains specific to theobject, guaranteeing that when folding a shirt, there is lesspreference to use knowledge about how to fold pants ifknowledge about how to fold shirts already exists. We definethe affordance of transferring from state si to sj using actiona by aff(a, si, sj) = w(a)T4x(si, sj), suggesting that theautomated planning and reasoning should only be based onthe relevant features.

Unifying the three sub-graphs produces a closed-loopframework for robots learning from demonstrations. More-over, graphs can store relationships in an intuitive and highlyregular structure, allowing for algorithms that rely on simplegraph manipulations. The real world is encoded throughperception into the S-AoG to form a physical belief stateof the world. The learning algorithm constructs a C-AoG to

understand actions from human demonstrations. And lastly,inference combines the reasoning from the C-AoG and theactuators from the T-AoG to physically perform the task. Theenergy of the joint parse graph [12] combines the energyterms of each:

ESTC(pg) = ES(pg) + ET (pg) + EC(pg) +∑r∈R∗

pg

ER(r)

(13)We use generative learning by the Minimax Entropy

Principle [20] to learn the probability distribution of STCparse graphs P (pg). Doing so assumes that the sample meanof statistics φj(pg) should approach the true expectation sjfrom observations. The parameters are solved by minimizingthe Kullback-Leibler divergence between the observed dis-tribution and the candidate KL(f ||p) = Ef [log f(pg)] −Ef [log p(pg)]. This simplifies to a maximum likelihoodestimate, formulated by

p∗ = argmaxp∈Ω

Ef [log p(pg)] = argmaxp∈Ω

n∑i=1

log p(pgi) + ε

(14)Iteratively, we choose the statistics F = φ1, φ2, ... thatminimize the entropy of the model, and the parameters βthat yield maximum entropy.

p∗ = argminFmax

βentropy(p(pg; θ) (15)

Effectively, the robot “daydreams” possible probability dis-tributions of parse graphs to converge with observations.During inference, it samples a parse graph to perform theaction.

G. Learning Motor Control

The STC-AoG expresses explicit knowledge in a graph-ical structure easily understandable by humans, acting as agateway for communication. However, the STC-AoG onlydefines discrete salient spatial, temporal, and causal concepts.The interpolation of how an individual action is performedrequires a specification of the fine motor skills involved aswell as an assignment of probability distribution parameters.

The explicit knowledge captured by a causal node repre-sents a conformant plan learned by human demonstrations.The information stored in the STC-AoG only provides re-sults from discrete time-steps, t ∈ N. Its state-action tablerepresents fluent changes by xt+1(s) = f(xt(s), xt(a)). Toshift paradigms from explicit to implicit knowledge, we relaxthe assumption of null run-time observability, and use a finerdistinction in time, xt+δt(s) = f(xt(s), xt(a)). By learningthis continuous function f , the robot system is capable ofverifying, correcting, and inferring causal relations to adaptto dynamic environments.

We make two assumptions to simplify the learning of f .First, we restrict the range of spatial and temporal changesto adhere to spatiotemporal continuity, rendering suddenchanges impossible. Second, we use a physical simulatorbased on perception encoded by the STC parse graph (STC-pg) to compare with reality at rapid time intervals. When a

discrepancy is detected, we point fault at the robot’s actions.The feedback learning system uses a simplified optimizationprocess inspired by Atkeson et al [22] to update the controlmechanics. Adjusting the parameters of the simulator toadhere to reality also reveals useful knowledge, but it is outof scope for this study.

H. Inference

Fig. 5. The inference engine samples a parse graph to create a conformantaction plan. There is feedback between the plan, its simulation, and thecorresponding perceived execution.

Since the STC-AoG model is generatively learned, weinfer a parse graph through a simple sampling process. Asseen in Figure 5, the procedurally generated parse graph laysout a conformant action plan for the robot. It then createsa simulation of the action by converting the STC-pg into amotion plan and spatial objects into 3D meshes from pointcloud.

The simulation plan is matched with reality at smallinterval steps to verify that the robot is at its correspondingsimulated state. In case of substantial mismatch betweenexpected and actual states, the robot understands the actiondid not complete, and that a new action plan must begenerated based on the latest perception input. Concretely,the sampling procedure is encapsulated by the algorithm inFigure 6.

IV. EXPERIMENTSWe conduct our experiments on a cloth-folding task. The

S-AoG models the physical status of the cloth, table, robot,human, and various decompositions of each. The T-AoGconsists of three atomic actions to span the action-space forthis simple task: MoveArm(a), Grab, and Release. A Foldaction in the T-AoG is a higher-level And-node consisting offour children: MoveArm(a), Grab, MoveArm(b), and Release,with the corresponding textual representation: Fold(a, b) =MoveArm(a);Grab;MoveArm(b);Release. And conse-quently, a specific instance of folding is a series of Fold ac-tions: FoldStyle1 = Fold(a, b);Fold(c, d); ...;Fold(y, z).

1: while camera is producing image I do2: pgtS ← Interpret(GS , It)3: pgtT ← Sample(GSTC , pgtS)4: pgtC ← Sample(GSTC , pgtS , pg

tT )

5: pgSTC ← Merge(pgtS , pgtT , pg

tC)

6: PerformWithFeedback(pgSTC)7: end while

Fig. 6. The robot inference algorithm performs tasks on a learned STC-AoG. It interprets the sensory input as spatial, temporal, and causal parsegraphs, which are merged to formed a joint representation that is sampledand acted on.

Lastly, the C-AoG nodes describe how to fold a shirt fromone state to another, learned through human demonstrations.

We use Baxter, a two-armed industrial robot to performour cloth-folding task. Each arm consists of 7 degrees offreedom that are adjusted through inverse kinematics relativeto the robot’s frame of reference. The robot’s primary per-ception sensor is an Asus PrimeSense camera that providesan aligned RGB-D (Red, Green, Blue, and Depth) pointcloud in real-time. In order to use localization results fromperception, we compute the affine transformation matrixfrom the camera coordinate system to that of the robot. Allcomponents interact together through the Robot OperatingSystem (ROS).

The STC-AoG is stored in the platform-independentGraphviz DOT language, and used by our platform writtenin C++. The hand-designed perception logic combines off-the-shelf graph-based [24] and foreground/background [23]segmentation to localize a cloth per frame. On top of that, wetrain a shirt detector model using a Support Vector Machineto facilitate narrowing down the search for an optimal S-AoGparse graph. Each cloth node has a fluent x1 describing thelow-level shape. If a cloth is a shirt, we represent the structureof its keypoints as another fluent x2. We simplify learningthe probability distribution of parse graphs by limiting thenumber of statistics to F = φ1, where φ1 is the affordancecost of the action sequence in a STC-pg.

Performance on a task is measured by the percent of suc-cessful actions throughout the task. The overall performanceis the average of all task performances over multiple trials.An action is successful if performing the action satisfies thepre- and post-conditions of the causal relationship used.

A. Experiment Settings

In the first set of experiments, we measure the per-formance of representing learned knowledge from humandemonstrations. After watching human demonstrations, therobot generates an action plan step by step. The humanperforms the action suggested by the robot, and at each step,the human qualitatively verifies whether the robot’s actionwas indeed the intended action as per the demonstration. Ifverification fails in either case, then the action is markedunsuccessful, and otherwise it is marked successful. Thisperformance score on learning will set the baseline for thenext set of experiments.

In the second series of experiments, we measure thequality of grounding the learned knowledge to the robot’sactions. This time we let the robot, instead of the human,perform the actions. We compare the performance of therobot folding clothes with the results from the first set ofexperiments to evaluate the success of grounding physicalactions to see how well they match that of a human. Theexpected performance should be less than the ground truthestablished from the previous experiment.

In the third series of experiments, we measure the im-provements from a feedback system compared to no feed-back. We expect that the performance score calculatedthrough this step should be higher than that from the previousexperiment, but lower than the ground truth.

Finally, we are also curious how much we can stretch thegeneralizability of a learned task. After demonstrating how tofold a t-shirt, we ask the robot to infer how to fold differentarticles of clothing, such as full-sleeve shirts, towels, andpants. The criteria for generalizability of knowledge willfollow the similar performance procedure as in the previousexperiments.

B. RESULTS

On 10 trials per four sets of different t-shirt foldingdemonstrations D1, D2, D3, D4, we measure the averageperformance of using our system to learn knowledge, groundrobot actions, and control feedback.

Fig. 7. Our learning system successfully understood the various foldingtechniques. It had some difficulty executing the task using simply a confor-mant plan, but with added feedback the execution was highly successful.

As seen in Figure 7, our knowledge representation systemwas able to characterize the cloth-folding task enough tofaithfully communicate with a human, producing a learnedrepresentation with an average performance of 90%. Thissets the upper bound for the next two inference experiments.As anticipated, our framework was able to ground the actionswith a performance of 42.5%. The low score indicates thatalthough the robot knows what to do, there is still a dis-crepency between the human’s action and that generated bythe STC-AoG. By adding feedback correction of comparingperception to physical simulation, the performance leaped to83.125%, also matching our expectation.

The performance of generalizability was measured aftertraining the robot on only t-shirt folding videos. The resultsare visualized in Figure 8. For example, since a full-sleeveshirt may have the same width and height fluents as thatof a t-shirt, the inference plan for folding a full-sleeveshirt performed very well. Moreover, the robot was able to

Fig. 8. Our knowledge framework correctly understood how to generalizea t-shirt folding instruction to long-sleeve shirts and towels; however, itexpectedly had difficulty extrapolating its knowledge to fold pants.

generate reasonable action plans to fold a towel it has neverseen, since a t-shirt with both its sleeves folded resemblesthe same rectangular shape of a towel. However, generatinga reasonable inference result for folding pants was lesssuccessful due to the natural lack of knowledge transferredbetween a shirt folding and pant folding task. Figure 9 showsa few qualitative results of successful folding plans andexecutions.

Fig. 9. Some qualitative results on the robot execution after learning fromhuman demonstrations.

V. DISCUSSION AND FUTURE WORK

The experiments show preliminary support for the expres-sive power of the robot learning and execution frameworklaid out in this paper. While we focus heavily in the cloth-folding domain, the framework may be used for trainingany goal-oriented task. In future work, we wish to continueimproving the robustness of each spatial, temporal, andcausal And-Or graph to optimize for speed and accuracy.

The STC-AoG acts as a language to ground knowledgeand reasoning into robot actions. Since the knowledge rep-resentation and robot action planning systems share thesame And-Or graph data structure, the graph acts as aprogramming language for the robot, and self-updating thegraph is an act of metaprogramming.

Due to the hierarchical nature of the STC-AoG, thehigher level nodes are readily articulated and understandableby humans. are currently working on incorporating naturallanguage statements, commands, and questions to moreeasily allow humans to manipulate the graph. To scale upthe graph for life-long learning, we are investigating otherpractical storage solutions, including graph-based databasessuch as Neo4j [30]. Since the graph is sufficient to transferknowledge, we can upload different skills to a cloud platformand share knowledge between different robots.

Limits in physical reachability and dexterity of the robotarms played a crucial difficulty in mapping action plans tomotor control execution. If a grip location was unreachable,

the conformant plan would fail to execute the action at all.Fortunately, by introducing the feedback control system, wewere able to at least extend the reach as far as possible togrip a reasonable point.

Lastly, the performance of the causal learning systemrelies on successfully detecting fluent changes. This requiresadjusting thresholds for fluent-change detectors until theresults seem just right. We solved this problem by offlinesupervised learning for our chosen fluents, but we set asidethe problem of learning these threshold parameters online tofuture work.

VI. CONCLUSIONSThe stochastic graph-based framework is capable of repre-

senting task-oriented knowledge for tractable inference andgeneralizability. It successfully unified theoretical founda-tions of And-Or perception grammars to a practical roboticsplatform. The experimental results support our claims forgrounding learned knowledge to execute tasks accurately.We also express the generalizability of our framework byextrapolating from human demonstrations of folding a t-shirtto other articles of clothing. And lastly, our novel frameworkcan make use of perceived discrepancies between high-levelaction plans and low-level motor control to verify and correctactions.

ACKNOWLEDGMENTThe authors would like to thank the support of DARPA

SIMPLEX project N66001-15-C-4035 and DARPA MSEEproject FA 8650-11-1-7149. In addition, we would like tothank SRI International and OSRF for their support.

REFERENCES

[1] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning ofmotor skills in high dimensions: A path integral approach,” in Roboticsand Automation (ICRA), 2010 IEEE International Conference on.IEEE, 2010, pp. 2397–2403.

[2] D. Kulic, C. Ott, D. Lee, J. Ishikawa, and Y. Nakamura, “Incrementallearning of full body motion primitives and their sequencing throughhuman motion observation,” The International Journal of RoboticsResearch, p. 0278364911426178, 2011.

[3] S. Calinon, F. Guenter, and A. Billard, “On learning, representing,and generalizing a task in a humanoid robot,” Systems, Man, andCybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 37, no. 2,pp. 286–298, 2007.

[4] Z. Si, M. Pei, B. Yao, and S.-C. Zhu, “Unsupervised learning ofevent and-or grammar and semantics from video,” in Computer Vision(ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp.41–48.

[5] S.-C. Zhu and D. Mumford, “A stochastic grammar of images,”Foundations and Trends R© in Computer Graphics and Vision, vol. 2,no. 4, pp. 259–362, 2006.

[6] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos, “Robot learning ma-nipulation action plans by unconstrained videos from the world wideweb,” in The Twenty-Ninth AAAI Conference on Artificial Intelligence(AAAI-15), 2015.

[7] N. Shukla, C. Xiong, and S.-C. Zhu, “A unified framework for human-robot knowledge transfer,” in AAAI’15 Fall Symposium on AI forHuman-Robot Interaction (AI-HRI 2015), 2015.

[8] C. Liu, J. Y. Chai, N. Shukla, and S.-C. Zhu, “Task learning throughvisual demonstration and situated dialogue,” in AAAI’16 Workshop onSymbiotic Cognitive Systems, 2016.

[9] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang, “Error-drivenincremental learning in deep convolutional neural network for large-scale image classification,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 177–186.

[10] A. L. Blum and M. L. Furst, “Fast planning through planning graphanalysis,” Artificial Intelligence, vol. 90, no. 1, pp. 281–300, 1997.

[11] S. Miller, J. van den Berg, M. Fritz, T. Darrell, K. Goldberg, andP. Abbeel, “A geometric approach to robotic laundry folding,” In-ternational Journal of Robotics Research (IJRR), vol. 31, no. 2, pp.249–267, 2012.

[12] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “Joint videoand text parsing for understanding events and answering queries,”MultiMedia, IEEE, vol. 21, no. 2, pp. 42–70, 2014.

[13] J.-W. Ha, K.-M. Kim, and B.-T. Zhang, “Automated construction ofvisual-linguistic knowledge via concept learning from cartoon videos,”Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intel-ligence (AAAI 2015), Austin, 2015.

[14] G. W. Strong and Z. W. Pylyshyn, “Computation and cognition:Toward a foundation for cognitive science. cambridge, massachusetts:The mit press, 1984, 320 pp.” Behavioral Science, vol. 31, no. 4, pp.286–289, 1986.

[15] A. S. Fire and S. Zhu, “Learning perceptual causality from video,” inLearning Rich Representations from Low-Level Sensors, Papers fromthe 2013 AAAI Workshop, Bellevue, Washington, USA, July 15, 2013,2013.

[16] J. Rekers and A. Schrr, “A parsing algorithm for context-sensitivegraph grammars,” Tech. Rep., 1995.

[17] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu, “Composite templatesfor cloth modeling and sketching,” in Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on, vol. 1, June2006, pp. 943–950.

[18] C. Chao, M. Cakmak, and A. Thomaz, “Towards grounding conceptsfor transfer in goal learning from demonstration,” in Development andLearning (ICDL), 2011 IEEE International Conference on, vol. 2, Aug2011, pp. 1–6.

[19] E. T. Mueller, in Commonsense Reasoning, E. T. Mueller, Ed. MorganKaufmann, 2006.

[20] S. C. Zhu, Y. N. Wu, and D. Mumford, “Minimax entropy principleand its application to texture modeling.” Neural Computation, vol. 9,no. 8, pp. 1627–1660, 1997.

[21] R. P. N. Rao, A. P. Shon, and A. N. Meltzoff, “A bayesian model ofimitation in infants and robots,” in In Imitation and Social Learningin Robots, Humans, and Animals. Cambridge University Press, 2004,pp. 217–247.

[22] C. Atkeson and S. Schaal, “Learning tasks from a single demonstra-tion,” in Robotics and Automation, 1997. Proceedings., 1997 IEEEInternational Conference on, vol. 2, Apr 1997, pp. 1706–1712 vol.2.

[23] C. Rother, V. Kolmogorov, and A. Blake, “Grabcut -interactive fore-ground extraction using iterated graph cuts,” ACM Transactions onGraphics (SIGGRAPH), August 2004.

[24] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-basedimage segmentation,” Int. J. Comput. Vision, vol. 59, no. 2, pp. 167–181, Sept. 2004.

[25] K. Tu, M. Pavlovskaia, and S.-C. Zhu, “Unsupervised structure learn-ing of stochastic and-or grammars,” in Advances in Neural InformationProcessing Systems 26. Curran Associates, Inc., 2013, pp. 1322–1330.

[26] Y. Yamakawa, A. Namiki, and M. Ishikawa, “Motion planning fordynamic folding of a cloth with two high-speed robot hands and twohigh-speed sliders.” in ICRA. IEEE, 2011, pp. 5486–5491.

[27] K. Zampogiannis, Y. Yang, C. Fermller, and Y. Aloimonos, “Learningthe spatial semantics of manipulation actions through prepositiongrounding.” in ICRA. IEEE, 2015, pp. 1389–1396.

[28] A. Doumanoglou, A. Kargakos, T. Kim, and S. Malassiotis, “Au-tonomous active recognition and unfolding of clothes using randomdecision forests and probabilistic planning,” in 2014 IEEE Interna-tional Conference on Robotics and Automation, ICRA 2014, HongKong, China, May 31 - June 7, 2014, 2014, pp. 987–993.

[29] P. C. Wang, S. Miller, M. Fritz, T. Darrell, and P. Abbeel, “Perceptionfor the manipulation of socks.” in IROS, 2011, pp. 4877–4884.

[30] Neo4j, Neo4j - The Worlds Leading Graph Database, Std., 2012.[Online]. Available: http://neo4j.org/

Caiming Xiong , Nishant Shukla , Wenlong Xiong, and Song-Chun … · 2020. 8. 1. · Caiming Xiong , Nishant Shukla , Wenlong Xiong, and Song-Chun Zhu Abstract—We propose a stochastic

Documents