Top Banner
CINet: A Learning Based Approach to Incremental Context Modeling in Robots Fethiye Irmak Do˘ gan 1,? , ˙ Ilker Bozcan 1,? , Mehmet C ¸ elik 1 and Sinan Kalkan 1 ? Equal contribution Abstract— There have been several attempts at modeling context in robots. However, either these attempts assume a fixed number of contexts or use a rule-based approach to determine when to increment the number of contexts. In this paper, we pose the task of when to increment as a learning problem, which we solve using a Recurrent Neural Network. We show that the network successfully (with 98% testing accuracy) learns to predict when to increment, and demonstrate, in a scene modeling problem (where the correct number of contexts is not known), that the robot increments the number of contexts in an expected manner (i.e., the entropy of the system is reduced). We also present how the incremental model can be used for various scene reasoning tasks. I. I NTRODUCTION Context is known to be very crucial for human cognition, functioning as a modulator affecting our perception, reason- ing, communication and action [1], [2]. The robots that we expect to have an important role in our daily lives in the near future should have the ability to perceive, to learn and to use context, like we do. However, context modeling should be incremental since it is not possible know beforehand the set of all possible situations that a robot is going to encounter. The types of situations (contexts) are going to be different even for a simple vacuum cleaning robot based on the environment and the users. Although there have been incremental context modeling efforts on robots [3], [4] or incremental topic modeling in linguistics [5], to the best of our knowledge, this is the first study that considers the question of “when to increment the number of contexts” as a learning problem. However, this is challenging since there is no ground truth on the correct number of contexts in any problem domain. A. Related Work Scene modeling: Scene modeling is the task of modeling what is in the scene. Such scene models are critical in robots since they allow reasoning about the scene and the objects in it. Many models have been proposed in computer vision and robotics using probabilistic models such as Markov Random Fields [6], [7], Bayesian Networks [8], [9], Latent Dirichlet Allocation variants [10], [11], predicate logic [12], [13], and Scene Graphs [14]. There have also been many attempts for 1 All authors are with the KOVAN Research Lab at the Depart- ment of Computer Engineering, Middle East Technical University, Ankara, Turkey {irmak.dogan, ilker.bozcan, mcelik, skalkan}@metu.edu.tr Add new context? (yes, no) Contextualized Scene Model (Latent Dirichlet Allocation) Object 1 Context 1 Object 2 Context M Object N New Scene Update Model Visible variable Latent variable Feed each context’s links as inputs at different time steps of RNN (object labels and locations detected) Recurrent Neural Network (with Long Short Term Memory units) Input Hidden Units Fig. 1: An overview of how we address incremental context modeling as a learning problem. When a new scene is encountered, the objects are detected (not a contribution of the paper), and the Latent Dirichlet Allocation Model is updated. The updated model is fed as input to a Recurrent Neural Network, which predicts whether to increment the number of contexts or not. ontology-based scene modeling where objects and various types of relations are modeled [13], [15], [16]. Context modeling: Although context has been widely studied in other fields, it has not received sufficient attention in the robotics community, except for, e.g., [6] who used spatial relations between objects as contextual prior for object detection in a Markov Random Field; [4], who adapted Latent Dirichlet Allocation on top of object concepts for rule- based incremental context modeling; and [9], who proposed using a variant of Bayesian Networks for context modeling in underwater robots. Incremental context or topic modeling: There have been some attempts at incremental context or topic modeling, in text modeling [5], computer vision [17] and in robotics [3], [4]. However, these attempts are generally rule-based approaches, looking at the errors or the entropy (perplexity) of the system to determine when to increment. Since these rules are hand-crafted by intuition, they may fail to capture the underlying structure of the data for determining when to increment. There are also methods such as Hierarchical Dirichlet Processes [18] or its nested version [19] that arXiv:1710.04981v3 [cs.RO] 29 Jul 2018
6

CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

Sep 09, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

CINet: A Learning Based Approach to Incremental Context Modeling inRobots

Fethiye Irmak Dogan1,?, Ilker Bozcan1,?, Mehmet Celik1 and Sinan Kalkan1? Equal contribution

Abstract— There have been several attempts at modelingcontext in robots. However, either these attempts assume a fixednumber of contexts or use a rule-based approach to determinewhen to increment the number of contexts. In this paper, wepose the task of when to increment as a learning problem, whichwe solve using a Recurrent Neural Network. We show thatthe network successfully (with 98% testing accuracy) learnsto predict when to increment, and demonstrate, in a scenemodeling problem (where the correct number of contexts is notknown), that the robot increments the number of contexts inan expected manner (i.e., the entropy of the system is reduced).We also present how the incremental model can be used forvarious scene reasoning tasks.

I. INTRODUCTION

Context is known to be very crucial for human cognition,functioning as a modulator affecting our perception, reason-ing, communication and action [1], [2]. The robots that weexpect to have an important role in our daily lives in the nearfuture should have the ability to perceive, to learn and to usecontext, like we do.

However, context modeling should be incremental sinceit is not possible know beforehand the set of all possiblesituations that a robot is going to encounter. The types ofsituations (contexts) are going to be different even for asimple vacuum cleaning robot based on the environment andthe users.

Although there have been incremental context modelingefforts on robots [3], [4] or incremental topic modeling inlinguistics [5], to the best of our knowledge, this is the firststudy that considers the question of “when to increment thenumber of contexts” as a learning problem. However, thisis challenging since there is no ground truth on the correctnumber of contexts in any problem domain.

A. Related Work

Scene modeling: Scene modeling is the task of modelingwhat is in the scene. Such scene models are critical in robotssince they allow reasoning about the scene and the objects init. Many models have been proposed in computer vision androbotics using probabilistic models such as Markov RandomFields [6], [7], Bayesian Networks [8], [9], Latent DirichletAllocation variants [10], [11], predicate logic [12], [13], andScene Graphs [14]. There have also been many attempts for

1All authors are with the KOVAN Research Lab at the Depart-ment of Computer Engineering, Middle East Technical University,Ankara, Turkey {irmak.dogan, ilker.bozcan, mcelik,skalkan}@metu.edu.tr

Add new context?

(yes, no)

Contextualized Scene Model

(Latent Dirichlet Allocation)

Object 1

Context 1

Object 2 …

… Context M

Object N

New Scene

Update

Model

Visible variableLatent variable

Feed each

context’s links as

inputs at

different time

steps of RNN

(object

labels and

locations

detected)

Recurrent Neural

Network

(with Long Short Term

Memory units)

InputHidden Units

Fig. 1: An overview of how we address incremental contextmodeling as a learning problem. When a new scene is

encountered, the objects are detected (not a contribution ofthe paper), and the Latent Dirichlet Allocation Model is

updated. The updated model is fed as input to a RecurrentNeural Network, which predicts whether to increment the

number of contexts or not.

ontology-based scene modeling where objects and varioustypes of relations are modeled [13], [15], [16].

Context modeling: Although context has been widelystudied in other fields, it has not received sufficient attentionin the robotics community, except for, e.g., [6] who usedspatial relations between objects as contextual prior forobject detection in a Markov Random Field; [4], who adaptedLatent Dirichlet Allocation on top of object concepts for rule-based incremental context modeling; and [9], who proposedusing a variant of Bayesian Networks for context modelingin underwater robots.

Incremental context or topic modeling: There have beensome attempts at incremental context or topic modeling,in text modeling [5], computer vision [17] and in robotics[3], [4]. However, these attempts are generally rule-basedapproaches, looking at the errors or the entropy (perplexity)of the system to determine when to increment. Since theserules are hand-crafted by intuition, they may fail to capturethe underlying structure of the data for determining whento increment. There are also methods such as HierarchicalDirichlet Processes [18] or its nested version [19] that

arX

iv:1

710.

0498

1v3

[cs

.RO

] 2

9 Ju

l 201

8

Page 2: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

assume the availability of all data to estimate the numberof topics or assume an infinite number of topics, whichare both unrealistic for robots continually interacting withthe environment and getting into new contexts through theirlifetime.

In summary, we notice that there are no studies thataddress learnability of incrementing the number of topics,which is a necessity for life-long learning robots.

B. Contributions

The main contributions of our work are the following: (i)Pose incremental context modeling as a learning problem.This is challenging since it requires a training dataset withthe correct number of contexts to train a model and there isno such dataset available. We solve this issue by using LatentDirichlet Allocation [20], which, being a generative model,allows one to generate artificial data with a given number ofcontexts. For any data generated with a certain number ofcontexts, we can now know the correct number of contextsand train a model. (ii) Solve this learning problem byusing a deep network. For this end, we employ a RecurrentNeural Network, and model the problem as a sequence-to-label problem. The input of the network at each time stepis the weights of a context to the objects, and the output isa binary decision on whether to increment the number oftopics or not – See also Figure 1.

On an artificial dataset and a real dataset (SUN-RGBDscene dataset [21], we show that the network learns to add anew context when it decides necessary on encountering newscenes. Moreover, we compared our method with anotherincremental Latent Dirichlet Allocation method [4] and in-cremental Boltzmann Machines [22], and demonstrated thatit performs better.

II. LEARNING-BASED INCREMENTAL CONTEXTMODELING

We assume that objects in scenes occur in different con-texts, and we can model such contexts as latent variablesdefined over the objects as shown in Figure 1 and in ourprevious work [4].

A. Contextualized Scene Modeling with Latent DirichletAllocation (LDA)

LDA [20] is a generative model widely used for topicmodeling in document classification. Since it is more intu-itive, we will introduce LDA from a document modelingperspective: Assuming a document d ∈ D is a set of wordsw1, ..., wN drawn from a fixed vocabulary (wi ∈ W forvocabulary of size |W|), LDA posits a finite mixture overa fixed set of topics z1, ..., zk, (zt ∈ Z with |Z| = k beingthe topic count). Then, a document can be described by theprobability of relating to these topics, p(zt|di). Conversely,a topic is modeled by the likelihood for a document ofthis topic to contain each word in the vocabulary, p(wj |zt).LDA proposes to infer these document and topic probabilitydistributions from a set of documents D.

For contextualized scene modeling, we replace a documentdi with a scene si, a word wi with an object oi, and a topicti with a context ci, as suggested by our previous work [4].

B. Dataset Collection

There is no dataset with correct number of contexts, and itis not sensible to take the number of high-level categories inan existing dataset (such as those used in scene classification)as contexts since there may be more contexts in the data thanthe number of high-level categories.

context1

Artificially Generated Dataset

context2

context3

context4

context5

context6

context7

objects

context8

office

SUN RGB-D Dataset

library

bedroom

bathroom

livingroom

kitchen

classroom

objects

diningroom

Fig. 2: Context-object co-occurrence frequencies in theartificially generated dataset (top) and the real dataset, i.e.,

the SUN-RGBD dataset [21] (bottom). We see that thedistribution of the artificial dataset and that of the real

dataset are similar, suggesting that a model trained on theartificial dataset may generalize to the real dataset.

To address this issue, we used Latent Dirichlet Allocation(LDA) [20], which, being a generative model, allows oneto sample artificial data with various number of contexts(topics). Since we are interested in the distribution betweenthe contexts and the objects, and whether we can determinewhen to increment the number of topics for an online model,the data being artificial does not pose a problem – as shownby our experiments. The only assumption we make here isthat the contexts follow a Dirichlet distribution, which is areasonable approximation since it is a family of distributionsthat can approximate the various types of distributions many

Page 3: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

InputHidden Units

Object 1 Object 2 … Object N

Context 2

Object 1

Context 1

Object 2 Object N… Object 1 Object 2 …

Context M

Object N

𝐱0 = 𝑝 𝑐1 𝑜𝑖 𝑖=1𝑶

𝐱1 = 𝑝 𝑐2 𝑜𝑖 𝑖=1𝑶

𝐱M−1 = 𝑝 𝑐𝑀 𝑜𝑖 𝑖=1𝑶

𝑦 ∈ 0,1

Fig. 3: An overall view of the unfolded RNN architecture used in the paper. Various types and number of hidden units,and number of layers have been experimented.

high-level categories and natural phenomena (e.g., objects,words, scenes, activities) may follow; e.g., power-law [23].Looking at the distribution of objects in contexts generatedby the model and the objects in scenes in a real dataset, wesee that the distributions align closely as shown in Figure 2.

In LDA, a Dirichlet prior α, describing the corpus (envi-ronment), is assumed. A parameter, θ, governing the dis-tribution over the contexts is sampled from Dir(α). Thenumber of objects, N , in a scene is randomly sampled froma suitable distribution, e.g., Poisson. Then, for each sceneS to be generated, a topic (context) ci is sampled fromDiscrete(θ) for each index i, and an object oi is sampledusing p(oi|ci, β). We selected α as 0.9, and β as 0.01, sincethey yielded matching distributions for the real data.

Let us use Dk = {Sk1 , ..., S

kLk} to denote the dataset of

scenes generated with k number of contexts. We can nowtrain LDA models with k0 contexts s.t. k0 ≤ k. The trainingdataset of tuples (x, y) used for training the deep networkcan be constructed as follows:

• x: The input vector to the network, describing thecurrent state of the LDA model. Since the numberof contexts (hidden topics) is not fixed beforehand,x is variable length. Therefore, we take x to be asequence of sub-vectors xi = pci = {p(ci|oj)}Oj=1,i.e., a sequence of conditional probabilities of eachcontext given an object. If an object is not used atall during the generation of any scene, the probability

for that object is set to zero. We also considered usingpoi = {p(oi|cj)}Cj=1, i.e., a vector of probabilities ofeach object given a context.

• y: The expected output of the network; a binary variable(0 or 1) describing whether to increment (y = 1) thenumber of contexts or not (y = 0).

In total, we collected 14400 scenes (documents) from 1000objects (vocabulary) with 10 different contexts (from 1 to10). Each scene had 100 objects. From various combinationsof these instances, we gathered 27000 (x, y) pairs for trainingand 3400 for testing. The limitation for gathering more datawas the long computation time; for each instance, LDAmodel needed to be trained.

C. Context Incrementing Network (CINet)

Since our input vectors x have varying lengths, we usedRecurrent Neural Networks (RNNs) to cast the problem as alearning problem, where, given x describing the state of theLDA model, y (whether to increment the number of contexts)is predicted. An overall view of a CINet architecture isprovided in Figure 3. We evaluated different types (LSTM[24], GRU [25]) and number of hidden units and layers, andreported only the most competing ones in the next section.

For estimating the error, we formulated a binary cross-entropy loss J which CINet is expected to minimize:

J (W ) = − 1

n

∑i

[yi log yi + (1− yi) log(1− yi)] , (1)

Page 4: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

TABLE I: Training and test accuracies for the different models of CINet. Accuracy is the percentage of correct incrementdecisions calculated over the artificial data.

Input: pci Input: poi Input: poi ⊕ pci

Training Acc. Test Acc. Training Acc. Test Acc. Training Acc. Test Acc.Vanilla RNN (1 layers) 98.0% 97.1% 72.6% 66.2% 99.3% 95.5%Vanilla RNN (2 layers) 99.2% 97.7% 97.0% 69.5% 97.5% 94.8%Vanilla RNN (3 layers) 99.7% 97.9% 99.5% 71.4% 94.8% 93.9%GRU (1 layers) 99.4% 94.7% 97.2% 71.0% 99.5% 93.7%GRU (2 layers) 99.3% 97.3% 99.9% 71.7% 99.4% 96.0%GRU (3 layers) 99.4% 97.3% 99.8% 71.7% 94.3% 94.2%LSTM (1 layers) 99.1% 92.9% 73.4% 67.5% 99.7% 89.6%LSTM (2 layers) 99.4% 97.9% 99.7% 70.6% 99.7% 96.7%LSTM (3 layers) 99.9% 98.0% 99.4% 70.9% 99.4% 94.2%

where W is the set of parameters in the network; yi is theprediction of the network for the ith sample; and n is thenumber of samples in the dataset or the batch. As a precau-tion against over-fitting, we also added L2 regularization onthe weights.

D. Training

For training the network, as is quite common in trainingdeep networks, we used Adam optimizer [26] with β1 = 0.9,β2 = 0.999 (default values) with a batch size of m = 100with early-stopping (i.e., stopped training when accuracy onthe test set started to decrease) to stop training.

III. EXPERIMENTS AND RESULTS

In this section, we evaluate (i) the training and testingperformance of CINet on the artificial dataset, (ii) how wellCINet generalizes to scenarios where the number of contextsis more than the network is trained to, and (iii) how wellCINet performs on a real scene dataset where there are scenesof different categories, which roughly correspond to differentcontexts.

A. CINet Training and Testing Performance

The training and testing accuracies of CINet are listedin Table I. We have experimented with different hiddenmemory units and layers. We also evaluated different typesof inputs for the network: (i) the probabilities of contextsgiven objects, i.e., pci , (ii) the probabilities of objects givencontexts, i.e., poi , and (iii) concatenation of pci and poi –these terms were introduced in Section II-B. The numberof hidden units in each layer is empirically selected as 50(results not provided here for the sake of space).

We see that LSTM units with 3 layers on pci perform best.The better performance with pci suggests that how good acontext can be predicted given an object (i.e., p(c|o)) givescrucial information for whether to add a new context. Fromthe table, we also see that the difference between the trainingand testing accuracies are rather small in cases where pci andconcatenation of pci and poi are used, suggesting that thenetwork does not exhibit over-fitting in these cases. The factthat there is larger difference in case poi implies that poi

is more complex and more difficult for the network to learnfrom.

B. Applying CINet to incremental context modeling

In this part, we evaluate CINet (with 3-layer LSTM) onthe artificial test data (constructed as described in SectionII-B) and on a real dataset, the SUN-RGBD dataset [21].

1) Artificially Generated Dataset: Figure 4 plots the prob-ability of network to increment the number of contexts forvarious current number of contexts (k0) and ground truthcontexts (k = 5, 7, 10, 15, 20 – selected arbitrarily). We seethat the network predicts to increment k0 when k0 < kwith very high probability (on average 0.98 for contextsk = 5, 7 and 10; and 0.84 for k = 15 and 20). Moreover,as expected, when k0 is close or equal to k, the networkdecreases its prediction probability. From this, we concludethat the network can nicely determine when to increase thenumber of contexts and when to stop.

In addition, in Figures 4(d-e), we see the predictions whenk = 15 and k = 20. Note that our network has been trainedfor artificially created data with k up to 10. The results inFigures 4(d-e) suggest that our network generalizes well tonumber of contexts that is larger than what it has been trainedfor, although settling for a slightly lower number of contexts.

This nice generalization behavior can be explained by thefact that the data for k < 10 and k > 10 follow similardistributions, and the network, when, given the probabilitiesbetween contexts and objects, is not affected by the numberof contexts thanks to its weight-sharing mechanism over timesteps.

We also analyzed how the system behaves when a newcontext is added, as shown in Figure 5. Here, comparingwith the result of Celikkanat et al. [4] and Dogan et al. [22](iRBM and diBM models), we see that our system convergesto the correct number of contexts and yields the same level ofentropy for the system, where entropy is defined as follows(as in [4]):

H = ρH(o|c) + (1− ρ)H(c|s), (2)

where random variables o, c and s denote objects, contextsand scenes respectively; H(·|·) denotes conditional entropy;and ρ is a constant (selected as 0.9) controlling the impor-tance of the two terms. The first term measures the system’sconfidence in observing certain objects given a context, andthe second one promotes context confidence given a scene.

Page 5: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

1 2 3 4 5current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0pr

obab

ility

of a

ddin

g ne

w c

onte

xt

(a) Ground truth=5

1 2 3 4 5 6 7current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty o

f add

ing

new

con

text

(b) Ground truth=7

1 2 3 4 5 6 7 8 9 10current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty o

f add

ing

new

con

text

(c) Ground truth=10

1 5 10 15current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty o

f add

ing

new

con

text

(d) Ground truth=15

1 5 10 15 20current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty o

f add

ing

new

con

text

(e) Ground truth=20

Fig. 4: Probability of incrementing contexts for various states of an LDA model on the artificial data. Ground truth isrespectively (a) 5, (b) 7, (c) 10, (d) 15 and (e) 20. Note that the network was trained for LDA models up to 10 contexts.

0

2

4

6

8

10

12

14

16

#of

cont

exts

# of contexts in Celikkanat et al.# of contexts in Yu et al.total # of contexts in diBM# of contexts in iRBM# of contexts in CINet

0 50 100 150 200# of scenes encountered

−2

0

2

4

6

8

10

entr

opy

entropy in Celikkanat et al.entropy in Yu et al.entropy in diBMentropy in iRBMentropy in CINet

Fig. 5: How the entropy of the system changes on theartificial dataset with respect to change in number of

contexts. The graph is for the subset of the data with 5contexts (arbitrarily selected from the dataset). Note thatthe model of [4] converges to wrong number of contexts.

iRBM and diBM are from [22].

2) Real Dataset: In this experiment, on a real dataset,we evaluate the model trained on the artificial dataset. Weselected 878 scenes with 8 main and 25 sub-categories1 (thenumber of sub-categories gives us a baseline for the numberof contexts) and 1000 objects from the SUN-RDBD dataset[21] – see Figure 6 for some samples. These datasets aregenerally used for scene segmentation and classification. Wechose this dataset since it depicts a prominent challenge forrobots continually interacting with different environments:

1A main category is, e.g., office, and its sub-category is home-office.

Fig. 6: A few samples from the SUN-RGBD dataset [21].

The robot needs to learn the different types (contexts) ofenvironments but does not know beforehand what and howmany they are. The dataset has the objects labeled withbounding boxes on them. Although objects being labeledsimplifies the task for us, object detection can easily beperformed with great performances using deep learningnowadays and we leave this out of focus of the current paper.

Figure 7 plots how confident the network is in adding anew context for various number of contexts. We see that,when it is close to 25 (the number of sub-categories in thedataset), the probability decreases significantly, signaling therobot to not add a new context.

When we look at how the entropy of the system changesduring encountering new scenes plotted in Figure 8, we seethat our system gets closer to the number of sub-categoriescompared to the rule-based methods [4], [22]. Since the realdataset can be noisy due to labeling and the scenes mighthave more contexts than labeled, it is hard to expect exactly25 contexts on the real dataset.

These results are very important since CINet was nottrained on the real dataset. These results suggest that, al-though the network was trained on an artificially generated

Page 6: CINet: A Learning Based Approach to Incremental Context ... · Looking at the distribution of objects in contexts generated by the model and the objects in scenes in a real dataset,

3 6 9 12 15 18 21 24 27current number of contexts

0.0

0.2

0.4

0.6

0.8

1.0

prob

abili

ty o

f add

ing

new

con

text

Fig. 7: Probability of incrementing contexts for variousstates of the LDA model on the real data.

0

5

10

15

20

25

#of

cont

exts

# of contexts in Celikkanat et al.# of contexts in Yu et al.total # of contexts in diBM# of contexts in iRBM# of contexts in CINet

0 200 400 600 800# of scenes encountered

−5

0

5

10

15

20

25

entr

opy

entropy in Celikkanat et al.entropy in Yu et al.entropy in diBMentropy in iRBMentropy in CINet

Fig. 8: How the entropy of the system changes on the realdataset with respect to change in number of contexts. There

are 25 sub-categories (giving as a baseline for contexts).Note that the rule-based methods [4], [22] diverged

significantly. iRBM and diBM are from [22].

dataset, since the distribution of the artificial data matchesthat of the real data and since it learned well to capturethe distribution between contexts and objects and when anew context is needed, it generalizes well to problems withsimilar distributions.

IV. CONCLUSION

We have proposed a learning based approach to incre-mental model building on robots continually interacting withnew types of environments (contexts). To the best of ourknowledge, this is the first work formulating this as alearning problem. We have used Latent Dirichlet Allocationto generate data with known contexts and trained RecurrentNeural Networks to estimate when to add a new context.We evaluated our system on artificial and real datasets,showing that the network performance well on test data (98%accuracy) with good generalization performance on the realdataset.Our work can be extended by, e.g., defining context onmore detailed scene models (e.g., with spatial, temporalor categorical relations), adapting the model for differenttypes of contexts (e.g., social, temporal etc.), and formulatingconstruction of a latent hierarchy as a learning problem aswell.

ACKNOWLEDGMENT

This work was supported by the Scientific and Techno-logical Research Council of Turkey (TUBITAK) through

project called “Context in Robots” (project no 215E133). Wegratefully acknowledge the support of NVIDIA Corporationwith the donation of the Tesla K40 GPU used for thisresearch.

REFERENCES

[1] W. Yeh and L. W. Barsalou, “The situated nature of concepts,” TheAmerican journal of psychology, pp. 349–384, 2006.

[2] L. W. Barsalou, “Simulation, situated conceptualization, and predic-tion,” Philosophical Transactions of the Royal Society B: BiologicalSciences, vol. 364, no. 1521, pp. 1281–1289, 2009.

[3] M. G. Ortiz and J.-C. Baillie, “Incremental training of restrictedboltzmann machines using information driven saccades,” in IEEEICDL-EpiRob, 2014.

[4] H. Celikkanat, G. Orhan, N. Pugeault, F. Guerin, E. Sahin, andS. Kalkan, “Learning context on a humanoid robot using incrementallatent dirichlet allocation,” IEEE TCDS, vol. 8, no. 1, pp. 42–59, 2016.

[5] K. Canini, L. Shi, and T. Griffiths, “Online inference of topics withlatent dirichlet allocation,” in Artificial Intelligence and Statistics,2009, pp. 65–72.

[6] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextuallyguided semantic labeling and search for three-dimensional pointclouds,” IJRR, vol. 32, no. 1, pp. 19–34, 2013.

[7] H. Celikkanat, G. Orhan, and S. Kalkan, “A probabilistic concept webon a humanoid robot,” IEEE TAMD, vol. 7, no. 2, pp. 92–106, 2015.

[8] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes forobject detection,” IEEE PAMI, vol. 27, no. 11, pp. 1778–1792, 2005.

[9] X. Li, J.-F. Martınez, G. Rubio, and D. Gomez, “Context reasoningin underwater robots using mebn,” arXiv preprint arXiv:1706.07204,2017.

[10] X. Wang and E. Grimson, “Spatial latent dirichlet allocation,” in NIPS,2008, pp. 1577–1584.

[11] J. Philbin, J. Sivic, and A. Zisserman, “Geometric LDA: A generativemodel for particular object discovery,” in BMVC, 2008.

[12] F. Mastrogiovanni, A. Scalmato, A. Sgorbissa, and R. Zaccaria,“Robots and intelligent environments: Knowledge representation anddistributed context assessment,” Automatika, vol. 52, no. 3, pp. 256–268, 2011.

[13] W. Hwang, J. Park, H. Suh, H. Kim, and I. H. Suh, “Ontology-based framework of robot context modeling and reasoning for objectrecognition,” in Conf. on Fuzzy Systems and Knowledge Discovery,2006.

[14] S. Blumenthal and H. Bruyninckx, “Towards a domain specific lan-guage for a scene graph based robotic world model,” arXiv preprintarXiv:1408.0200, 2014.

[15] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula,“Robobrain: Large-scale knowledge engine for robots,” arXiv preprintarXiv:1412.0691, 2014.

[16] M. Tenorth and M. Beetz, “Knowrobknowledge processing for au-tonomous personal robots,” in IEEE/RSJ IROS, 2009.

[17] J. Yu, J. Gwak, S. Lee, and M. Jeon, “An incremental learningapproach for restricted boltzmann machines,” in ICCAIS. IEEE, 2015.

[18] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchicaldirichlet processes,” Journal of the American Statistical Association,vol. 101, no. 476, pp. 1566–1581, 2006.

[19] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, “Nested hierarchicaldirichlet processes,” IEEE PAMI, vol. 37, no. 2, pp. 256–270, 2015.

[20] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” PNAS, vol.101, no. Suppl 1, pp. 5228–5235, 2004.

[21] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d sceneunderstanding benchmark suite,” in IEEE CVPR, 2015, pp. 567–576.

[22] F. I. Dogan, H. Celikkanat, and S. Kalkan, “A deep incrementalboltzmann machine for modeling context in robots,” ICRA, 2018.

[23] D. Markovic and C. Gros, “Power laws and self-organized criticalityin theory and nature,” Physics Reports, vol. 536, no. 2, pp. 41–74,2014.

[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[25] K. Cho, B. Van Merrienboer, D. Bahdanau, and Y. Bengio, “Onthe properties of neural machine translation: Encoder-decoder ap-proaches,” arXiv preprint arXiv:1409.1259, 2014.

[26] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.