Uncertainty Estimation of Deep Neural Networks

University of South Carolina University of South Carolina

Scholar Commons Scholar Commons

Theses and Dissertations

2018

Uncertainty Estimation of Deep Neural Networks Uncertainty Estimation of Deep Neural Networks

Chao Chen University of South Carolina - Columbia

Follow this and additional works at: https://scholarcommons.sc.edu/etd

Part of the Computer Sciences Commons

Recommended Citation Recommended Citation Chen, C.(2018). Uncertainty Estimation of Deep Neural Networks. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/5035

This Open Access Dissertation is brought to you by Scholar Commons. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact [email protected].

https://scholarcommons.sc.edu/

https://scholarcommons.sc.edu/etd

https://scholarcommons.sc.edu/etd?utm_source=scholarcommons.sc.edu%2Fetd%2F5035&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/142?utm_source=scholarcommons.sc.edu%2Fetd%2F5035&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarcommons.sc.edu/etd/5035?utm_source=scholarcommons.sc.edu%2Fetd%2F5035&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Uncertainty Estimation of Deep Neural Networks

by

Chao Chen

Bachelor of ScienceSun Yat-sen University 2010

Master of ScienceCentral Michigan University 2012

Submitted in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy in

Computer Science

College of Engineering and Computing

University of South Carolina

2018

Accepted by:

Gabriel Terejanu, Major Professor

Jianjun Hu, Committee Member

Jijun Tang, Committee Member

John Rose, Committee Member

Juan Caicedo, Committee Member

Cheryl L. Addy, Vice Provost and Dean of the Graduate School

c© Copyright by Chao Chen, 2018All Rights Reserved.

ii

Dedication

To my family.

iii

Abstract

Normal neural networks trained with gradient descent and back-propagation have

received great success in various applications. On one hand, point estimation of the

network weights is prone to over-fitting problems and lacks important uncertainty

information associated with the estimation. On the other hand, exact Bayesian neural

network methods are intractable and non-applicable for real-world applications.

To date, approximate methods have been actively under development for Bayesian

neural networks, including but not limited to: stochastic variational methods, Monte

Carlo dropouts, and expectation propagation. Though these methods are applicable

for current large networks, there are limits to these approaches with either under-

estimation or over-estimation of uncertainty. Extended Kalman filters (EKFs) and

unscented Kalman filters (UKFs), which are widely used in data assimilation com-

munity, adopt a different perspective of inferring the parameters. Nevertheless, EKFs

are incapable of dealing with highly non-linearity, while UKFs are inapplicable for

large network architectures.

Ensemble Kalman filters (EnKFs) serve as great methodology in atmosphere and

oceanology disciplines targeting extremely high-dimensional, non-Gaussian, and non-

linear state-space models. So far, there is little work that applies EnKFs to estimate

the parameters of deep neural networks. By considering neural network as a non-

linear function, we augment the network prediction with parameters as new states

and adapt the state-space model to update the parameters. In the first work, we

describe the ensemble Kalman filter, two proposed training schemes for training both

fully-connected and Long Short-term Memory (LSTM) networks, and experiment

iv

with 10 UCI datasets and a natural language dataset for different regression tasks.

To further evaluate the effectiveness of the proposed training scheme, we trained

a deep LSTM network with the proposed algorithm, and applied it on five real-

world sub-event detection tasks. With a formalization of the sub-event detection

task, we develop an outlier detection framework and take advantage of the Bayesian

Long Short-term Memory (LSTM) network to capture the important and interesting

moments within an event.

In the last work, we propose a framework for student knowledge estimation using

Bayesian network. By constructing student models with Bayesian network, we can

infer the new state of knowledge on each concept given a student. With a novel

parameter estimate algorithm, the model can also indicate misconception on each

question. Furthermore, we develop a predictive validation metric with expected data

likelihood of the student model to evaluate the design of questions.

v

Table of Contents

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation of Uncertainty Estimation . . . . . . . . . . . . . . . . . . 1

1.2 Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 (Non)linear Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . 10

1.7 Structure of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2 Uncertainty Estimation of Bayesian Neural Networks 13

2.1 Approximation in Bayesian Neural Networks . . . . . . . . . . . . . . 13

2.2 Ensemble Kalman Filter for Bayesian Neural Networks . . . . . . . . 14

2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

Chapter 3 Approximate Long Short Term Memory Networkfor Outlier Detection . . . . . . . . . . . . . . . . . . . 28

3.1 Sub-event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 LSTM Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Window Embeddings for Outlier Detection . . . . . . . . . . . . . . . 33

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 4 Student Knowledge Estimation with Bayesian Network 45

4.1 Student Knowledge Estimation . . . . . . . . . . . . . . . . . . . . . 45

4.2 Bayesian Network Background . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Educational Component . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

vii

List of Tables

Table 2.1 Average test RMSE of the proposed algorithm on different layersand ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Table 2.2 Average test LL of the proposed algorithm on different layersand ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Table 2.3 Significance tests of the average test RMSE among VI, PBP andthe proposed algorithm with different ensembles and layers. . . . . 23

Table 2.4 Significance tests of the average test RMSE among MC-dropout,Deep Ensembles, and the proposed algorithm with different en-sembles and layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Table 2.5 Significance tests of the average test LL among VI, PBP and theproposed algorithm with different ensembles and layers. . . . . . . 23

Table 2.6 Significance tests of the average test LL among MC-dropout,Deep Ensembles, and the proposed algorithm with different en-sembles and layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Table 2.7 Number of parameters of selected networks. . . . . . . . . . . . . . 27

Table 3.1 Basic information of the five events. . . . . . . . . . . . . . . . . . 35

Table 3.2 Evaluation metrics on different algorithms. . . . . . . . . . . . . . 40

Table 4.1 Concepts are used for the concept inventory . . . . . . . . . . . . . 52

Table 4.2 Summary of students answers for each question in each class(Numbers of students who selected the correct choice are indi-cated by circle) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 4.3 Conditional probabilities for a question related to two concepts . . 61

Table 4.4 Conditional probabilities for question 19 . . . . . . . . . . . . . . . 61

viii


Table 4.6 P-value for each question . . . . . . . . . . . . . . . . . . . . . . . 62



ix

List of Figures

Figure 1.1 Comparasion between a common MLP network and a BayesianMLP network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Figure 1.2 Uncertainty information associated with a non-linear function:y = x+ 0.3sin(2π(x+ ε)) + 0.3sin(4π(x+ ε)) + ε . . . . . . . . . 2

Figure 2.1 Training RMSE for different models. . . . . . . . . . . . . . . . . 25

Figure 2.2 Testing RMSE for different models. . . . . . . . . . . . . . . . . . 26

Figure 2.3 Testing LL for different models. . . . . . . . . . . . . . . . . . . . 26

Figure 3.1 A RNN with an input layer (blue), a hidden layer (red), and anoutput layer (green). Units within the dotted regions are optional. 32

Figure 3.2 Gated mechanism of a LSTM cell as described by [39]. . . . . . . 34

Figure 3.3 Architecture of the network used in this study. . . . . . . . . . . . 38

Figure 3.4 Predicted sub-events with the proposed algorithm for the 2013Boston marathon event. The distance lines above the bluethreshold line indicate identified outliers, and the red color in-dicates the Boston bombing moment (identified=40, true=16,σ2ε=2.13). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 3.5 Predicted sub-events with the proposed algorithm for the 2013Superbowl event. The distance lines above the blue thresholdline indicate identified outliers, and the red color indicates thepower outbreak (identified=33, true=18, σ2

ε=2.19). . . . . . . . . 40

Figure 3.6 Predicted sub-events with the proposed algorithm for the 2013OSCAR event. The distance lines above the blue thresholdline indicate identified outliers, and the red color indicates theOSCAR starting moment (identified=39, true=25, σ2

ε=1.98). . . . 41

x

Figure 3.7 Predicted sub-events with the proposed algorithm for the 2013AllStar event. The distance lines above the blue threshold lineindicate identified outliers, and the red color indicates the All-Star starting moment (identified=33, true=22, σ2

ε=1.56). . . . . . 42

Figure 3.8 Predicted sub-events with the proposed algorithm for the 2013Zimmerman trial news event. The distance lines above the bluethreshold line indicate identified outliers, and the red color in-dicates the verdict moment (identified=23, true=17, σ2

ε=1.21). . . 43

Figure 3.9 Performance of the algorithm on the marathon event for differ-ent ensemble size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Figure 3.10 Performance of the algorithm on the Boston marathon eventfor different sigma value. . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 4.1 Relationships of concepts and questions . . . . . . . . . . . . . . . 53

Figure 4.2 Bayesian Network Model of Student Knowledge for Statics Con-cept Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 4.3 Question 19 related to C3 . . . . . . . . . . . . . . . . . . . . . . 61

Figure 4.4 Question 22 related to C3 . . . . . . . . . . . . . . . . . . . . . . 62

Figure 4.5 Question 11 related to concept C3 . . . . . . . . . . . . . . . . . 63

Figure 4.6 Question 20 related to concept C3 . . . . . . . . . . . . . . . . . 64

Figure 4.7 Question 9 related to concept C2 . . . . . . . . . . . . . . . . . . 65

Figure 4.8 Histogram of blind guessing in addition to the actual perfor-mance of students score. . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 4.9 Question 7 related to concept C2 . . . . . . . . . . . . . . . . . . 67

Figure 4.10 Histogram of blind guessing in addition to the actual perfor-mance of students score. . . . . . . . . . . . . . . . . . . . . . . . 67

xi

Chapter 1

Introduction

1.1 Motivation of Uncertainty Estimation

The recent resurgence of neural network trained with backpropagation has established

state-of-the-art results in a wide range of domains. However, backpropagation-based

neural networks (NN) offers no estimation of uncertainty and thus lacks important

confidence bounds for practical applications. In contrast, probabilistic view of neural

network can provide uncertainty estimates that can be used for decision making. The

difference between a conventional multiple perception network (MLP) and a Bayesian

MLP network is depicted in Figure 1.1a and 1.1b.

(a) A common MLP network. (b) A Bayesian MLP network.

Figure 1.1: Comparasion between a common MLP network and a Bayesian MLPnetwork.

The importance of estimating uncertainty information can be illustrated in Fig-

ure 1.2. While the value predicted at x = 1.2 is high and might trigger a certain

decision, one also needs to consider the uncertainty associated with that value, which

in this case is also high. Uncertainty information is important in safety-critical ap-

1

plications where predictions from neural networks act on acceleration, braking and

steering systems. In such situations instead of driving the actuators, the automated

system may decide to pass the control to the human driver because of inherent large

uncertainties in the model. A good example to illustrate the importance of un-

certainty quantification is related to autonomous driving. With traditional neural

networks, we can make an inference whether to brake or not for a self-driving car.

Nonetheless, a probabilistic perspective can not only provide the inference but also

the confidence level to lead to the decision. Similar to this example, the uncertainty

estimate is also valuable for other real-world applications such as image classification,

speech recognition, and medical diagnostics.

Figure 1.2: Uncertainty information associated with a non-linear function: y = x +0.3sin(2π(x+ ε)) + 0.3sin(4π(x+ ε)) + ε

Uncertainty can be generally categorized as data uncertainty and model uncer-

tainty, while the latter can be further introduced as model parameters uncertainty

and model structure uncertainty. Based upon these two types of uncertainty, we can

deduce the predictive uncertainty. Bayesian framework, described in the following

section, serves as a guidance to infer the predictive uncertainty.

2

1.2 Bayesian Framework

In an attempt to model a system with experimental data, we might be uncertain what

the proper parameters or model structure are. A mature framework of dealing with

such uncertainty is developed and termed as Bayesian modeling. Bayesian modeling

offers us the theory and guidance to develop tools to model the real-world systems.

In the followings, a formal introduction of Bayesian modeling is provided.

Assuming there are training data (x1, y1), (x2, y2), . . . , (xn, yn), our goal is to make

an inference of y∗ with a new observation x∗. A probabilistic way of modeling it is

to define a predictive distribution p(y|x) and use this function to make an inference

given a new data point. A frequentist way is to optimize a suitably chosen function,

such as squared loss function, to get the optimized parameter vectors. In contrast,

the Bayesian perspective is to define a prior on the parameters before observing the

data, and make an inference on the new data point by integrating over the parameters

space.

Let us assume the parameters w follow a distribution p(w). We further define a

likelihood function p(y|x,w), which represents the likelihood of generating output y

with specific parameters setting w. With the prior and likelihood functions, we can

obtain the posterior distribution:

p(w|D) = p(D|w)p(w)p(D) . (1.1)

where D is the training data, and p(D) is the model evidence

p(D) =∫p(D|w)p(w)dw. (1.2)

To make a prediction to a new data point x∗, we have

p(y∗|x∗, D) =∫p(y∗|x∗, w)p(w|D)dw. (1.3)

The model evidence is computed by marginalizing over the parameters, and it is thus

also termed as marginal likelihood. Analytical solutions of the model evidence exist

3

if the prior is conjugate for the likelihood function. However, for complex models

such as Bayesian Neural Networks (BNNs), we need to use approximate approaches

to get the model evidence. Approximate methods include variational inference (VI),

expectation propagation (EP), Kalman filter (KF), and Markov chain Monte Carlo

(MCMC).

1.3 Bayesian Neural Networks

Point estimation of neural networks are associated with many disadvantages, includ-

ing but not limited to, the lack of uncertainty estimation, tendency of overfitting

small data, and tuning of many hyper-parameters.

In backpropagation NNs, the lack of uncertainty information is due to the weights

that are treated as point estimates tuned with gradient-descent methods. By contrast,

Bayesian neural networks (BNNs) [96, 83] can cope with some of these problems by

assigning a prior distribution on the parameters [33, 45, 40, 11]. Nonetheless, the

Bayesian inference in BNNs is intractable, and researchers have developed various

approximate methods to estimate the uncertainty of the weights.

Adding prior distributions on the weights acts as regularization, BNNs can avoid

over-fitting and offer uncertainty estimates associated with predictions. They were

firstly introduced in the 1990s [83, 96] and recently re-draw many research interests

from the community [72]. Given a training example (xn, yn) , xn is a D dimensional

vector xn ∈ <D, and yn is a scalar target variable yn ∈ <. For a M -layer feedforward

neural network, Si denotes the number of hidden neurons in layer i, andW = {W i}Mi=1

represents Si × (Si−1 + 1) weights between two consecutive layers.

yn = f(xn,W ) + εn, where εn ∼ N(0, γ−1) (1.4)

where f represents the network output and W is a set of weights between two consec-

utive layers.

4

y denotes target vectors y1, y2, ..., yn ∈ <N , and X denotes feature vectors

x1, x2, ..., xn ∈ <N ×D.

We first define the following prior distribution on the network weights with pre-

cision λ

p(w|λ) =M∏m=1

Sm∏i=1

Sm−1+1∏j=1

N(wi,j,m|µi,j,m, λ−1i,j,m). (1.5)

Also we have the likelihood function with precision γ as follows

p(y|w,X, γ) =N∏n=1

N(yn|f(xn, w), γ−1). (1.6)

Based upon (1.5) and (1.6), we get the following posterior distribution

p(w|D,λ, γ) ∝ p(w|λ)p(y|w,X, γ). (1.7)

However, the posterior distribution is non-linear and intractable in most cases,

we need to solve it with approximate techniques. In the past, approximate methods

including Laplace approximation [28] and Hybrid Monte Carlo [96] have been used

to tackle these problems, but at a cost on the accuracy or computation efficiency.

Hinton and Van Camp [50], proposed a pioneering work that constrained the amount

of information in the weights with respect to the expected squared of error in the

network. For a linear output network with one hidden layer, they derived the analyt-

ical objective with the minimum description length principle which is actually a VI

interpretation. Nonetheless, the derivation of (1.8) is challenging for most NN archi-

tecture since the expected log likelihood is intractable. Thus the analytic solution is

motived for small dataset and impractical for modern big data scenarios.

q(W |β) =M∏m=1

Sm∏i=1

Sm−1+1∏j=1

q(wi,j,m|µi,j,m, β−1i,j,m) =

∏i,j,m

N(wi,j,m;µi,j,m, β−1i,j,m). (1.8)

1.4 Variational Inference

Built upon Hinton and Van Campś [50] work, Graves [40] adopted a stochastic varia-

tional method instead of computing the analytical solution to estimate the expected

5

log likelihood. As explained in the paper, this approach searched the variational

posterior distribution of the weights q(w|β) instead of the intractable distribution

p(w|D,α), and then drew samples from it to calculate the estimate. By drawing

samples from the variational distribution, the method is amenable to solving big

data and complex network architectures.

The variational posterior can be obtained by minimizing the Kullback-Leibler

(KL) divergence between a simpler form that approximates the distribution (e.g.

Gaussian distribution) and the true posterior on the weights.

(α, β)∗ = arg min(α,β)

KL[q(w|β)||p(w|D,α)] (1.9)

= arg min(α,β)

∫q(w|β) log q(w|β)

p(w|D,α)dw (1.10)

∝ arg min(α,β)


p(w|α)p(D|w)dw (1.11)

= arg min(α,β)


p(w|α)dw −∫q(w|β) log p(D|w)dw (1.12)

= arg min(α,β)

KL[q(w|β)||p(w|α)]− Eq(w|β)[log p(D|w)] (1.13)

The resulting cost function (1.13) is termed as variational free energy [97, 154]

or the expected lower bound [56]. It is composed of a model complexity cost and

likelihood cost [11] that aims at finding a simpler form of variational posterior that

can maximize the likelihood.

In terms of the approximate posterior distribution, we can make an inference for

a new observation x∗ as follows

p(y∗|x∗, D) =∫p(y∗|x∗, w)q(w|β)dw.

However, Graves [40] simply used a diagonal covariance matrices in his work and

ignored the correlations over the weights, thus it yielded poor results in practice [45].

To improve the prediction capability of VI on BNNs, Blundell et al. [11] applied

re-parameterization tricks to approximate the cost function with unbiased Monte

Carlo gradients. Furthermore, they assigned a mixture of Gaussian prior on weights

6

that resembled a spike-and-slab prior and optimized the prior variance with cross

validation. As shown in the paper, the authors claimed that their VI approach was

matchable to the standard dropout method [49] in spite of doubling the size of the

parameters.

In parallel, Gal and Ghahramani [34] developed theoretical proof and demon-

strated that dropout could be considered as a BNN. In their work, they pointed out

that Monte Carlo (MC) dropout, regularized the deep models in the same way as the

VI techniques did.

Dropout was originally introduced in [49], and further developed by [131]. In

their work, the authors [131] randomly dropped some network units with dropout

rate p during the training phase, and used a single thinned network with expected

weights (1− p)W for predictions in the testing phase.

The approximate distribution q(wm|β) in MC dropout for layer m is defined as

Wm = Qm · Vm (1.14)

Vm = diag([vm,i]Smi=1) (1.15)

Where

vm,i ∼ Bernoulli(pm) for m = 1, ...,M, i = 1, ..., Sm−1 (1.16)

Here pm is the dropout rate for each layer, and Qm is the variational parameters

to be optimized. vm,i is a random binary variable that represents whether a unit i in

layer m− 1 dropped out as an input to the mth layer.

The second term in (1.13) is intractable, and an unbiased estimator is obtained

7

with Monte Carolo integration over w

Eq(w|β)[log p(D|w)] =∫

log p(D|w)q(w|β)dw (1.17)

=∫ N∑

i=1log p(yi|xi, w)q(w|β)dw (1.18)

=N∑i=1

∫log p(yi|xi, w)q(w|β)dw (1.19)

= 1J

J∑j=1

N∑i=1

logP (yi|xi, wj) (1.20)

Furthermore, the approximate predictive distribution is computed as

q(y∗|x∗, D) =∫p(y∗|x∗, w)q(w|α)dw (1.21)

And the first two moments are estimated as

Eq(y∗|x∗)(y∗) ≈1T

T∑t=1

f(x∗, wt), where wt ∼ q(w|β) (1.22)

Eq(y∗|x∗)((y∗)T (y∗)) ≈ τ−1ID + 1T

T∑t=1

f(x∗, wt)T f(x∗, wt) (1.23)

V arq(y∗|x∗)(y∗) ≈ τ−1ID + 1T

T∑t=1

f(x∗, wt)T f(x∗, wt) (1.24)

− Eq(y∗|x∗)(y∗)T Eq(y∗|x∗)(y∗) (1.25)

Variational inference was also proved to be theoretically equivalent to the dropout

method that was widely used as a regularization technique for deep NNs [34]. Fur-

thermore, the authors developed tools to extract model uncertainty from dropout

NNs. However, variational estimation typically underestimate the uncertainty of the

posterior because they ignore the posterior correlation between the weights.

1.5 Expectation Propagation

In addition to variational inference, expectation propagation were applied to estimate

the uncertainty of NNs. Hernández-Lobato and Adams [45] developed a scalable al-

gorithm which propagated probabilities forward through the network to obtain the

8

marginal likelihood and then obtained the gradients of the marginal probability in

the backward step. Similarly, Soudry et al. [130], described an expectation propaga-

tion algorithm aiming at approximating the posterior of the weights with factorized

distribution in an online setting.

In addition to VI, several studies explore expectation propagation (EP) to ap-

proximate p(w|D). In a recent study, Hernandez-Lobato [45] proposed a two-step

backpropagation alike algorithm termed as probabilistic backpropagation (PBP) for

NN update. In the first step, the algorithm computes the logarithm of the marginal

probability log(p(y|x)) as the objective function. In the second step, PBP propagates

back the gradients of the objective function with regard to the means and variances

of the approximate posterior through each layer using chain rule. The main idea of

PBP is to approximate p(w|D) with (1.27), while (1.27) has the same form as the

prior (1.26). The parameters in the approximate posterior can be updated with EP

which minimizes the Kullback-Leibler (KL) divergence between p(w|D) and q(w).

p(w) = N(w|µ, σ2) (1.26)

q(w) = N(w|µ′ , σ2′) (1.27)

Specific, the approximate posterior can be factorized into

q(W, γ, λ) = [M∏m=1

Sm∏i=1

Sm−1+1∏j=1

q(wi,j,m)]× q(γ)× q(λ) (1.28)

where wi,j,m is the weight from the jth node on the m− 1th layer to the ith node on

the mth layer, q(w) is a Gaussian distribution with µ and σ2, γ is the noise precision,

λ is the prior precision, and both q(γ) and q(λ) are gamma distributions.

Nevertheless, this approach is still in active development [13]. Currently, the PBP

algorithm simply factorized the posterior with simple Gaussian distributions, aiming

at making interpolations between dispersed modes instead of seeking modes.

9

1.6 (Non)linear Kalman Filters

Kalman filter is a common approach for parameter estimation in linear dynamic

systems. There are a number of work conducted to estimate the parameters of NNs

with EKFs [118, 128], which perform parameter estimation for non-linear dynamical

systems. However, EKFs were criticized for larger errors of the posterior mean and

covariance introduced by the first-order linearization of the nonlinear system [148].

Instead, UKFs were explored to estimate the parameters of non-linear models and

were claimed to have better performance to capture the higher order accuracy for the

posterior mean and covariance [62].

UKF can approximate the state distribution with carefully selected determinis-

tic samples from a Gaussian random variable, and yield superior estimates of the

posterior mean and covariance to the 3rd order of Taylor expansion [148]. More im-

pressively, UKF consumes the same time complexity as EKF. The key component

of UKF is termed as unscented transformation (UT), which is used to compute the

statistics of a random variable that propagates through a non-linear models. For a

m-dimension random variable x with mean x and covariance Px, the mean and co-

variance of the target y = u(x) can be estimated with the following procedure [148,

141].

The main step in UT is to sample 2m+ 1 sigma vectors Xk = (Xjk, w

jk)|j=0,1,...,2m

10

with a specific scheme

Xj =

x+ (

√α2(m+ k)Px)j for j = 1, ...,m

x− (√α2(m+ k)Px)j for j = m+ 1, ..., 2m

(1.29)

wµj =

λ

α2(m+k) for j = 0

12α2(m+k) for j = 1, ..., 2m

(1.30)

wcj =

λ

α2(m+k) + (1− α2 + β) for j = 0

12α2(m+k) for j = 1, ..., 2m

(1.31)

where λ and κ are scaling parameters, α determines the divergence of the selected

points around x, (√α2(m+ k)Px)j is the jth row of the matrix square root.

By propagating xj through the non-linear function, the next step is to estimate

the target yj. After obtaining the samples, it is straightforward to get the weighted

sample mean and covariance of the posterior sigma points.

y ≈2m∑j=0

wµj yj (1.32)

Py ≈2m∑j=0

wcj(yj − y)(yj − y)T (1.33)

In terms of the UT, UKF defines the state random variable as an augmented state

xak = [xTk vTk nTk ]T , and chooses the sigma points of the variable to compute the

sigma matrix Xak .

1.7 Structure of the dissertation

This dissertation is organized as follows:

(i) Chapter 2 describes the proposed algorithms for training neural networks and

introduces two training schemes for updating the noise covariance. In this

chapter, we explored training a fully-connected network with scheme 1 on a

11

synthetic dataset. Then we applied both scheme 1 and scheme 2 on training

fully-connected networks, and compared them with state-of-the-art models of

approximating Bayesian neural networks on ten UCI datasets. Furthermore, we

trained both fully-connected and LSTM networks and applied them on modeling

a natural language dataset.

(ii) In Chapter 3, the work is further extended for training deep LSTM networks

and demonstrates its effectiveness in an outlier detection task. Specifically, we

formalized sub-event detection as an outlier detection problem, then we applied

the proposed algorithm to model the predictive distributions of five real datasets

on Twitter network.

(iii) Chapter 4 introduces a framework to estimate student knowledge with Bayesian

network models. Additionally, the proposed framework is also capable of iden-

tifying misconceptions with a modified EM algorithm for parameter learning,

and evaluating question design with a proposed index.

12

Chapter 2

Uncertainty Estimation of Bayesian Neural

Networks

2.1 Approximation in Bayesian Neural Networks

EnKFs are developed by geoscientists for data assimilation tasks dealing with mil-

lions of features [31]. By adopting Monte Carlo samplings, the filters can generate

samples representing the means and covariances of the dynamic systems. As scalable

algorithms, EnKFs can save computation and storage of inverting large matrices, and

demonstrate effectiveness of handing non-Gaussian and nonlinear systems [65].

As introduced in Chapter 1, both EKFs and UKFs are applied to estimate the

state or parameters of nonlinear systems. Compared to these Kalman filter variants,

researchers find much better performance of EnKFs in parameter estimation [52].

In contemporary deep learning research, the dimensionality of the parameter space

rapidly grows, which can be elegantly solved by using EnKFs. To date, there has

been very little attention paid to applying EnKFs for parameter estimation in neural

networks. In an attempt to introduce the EnKFs to the deep learning community,

we evaluate the performance of using an EnKF model for parameter estimation of

neural networks and apply it for general regression tasks.

13

2.2 Ensemble Kalman Filter for Bayesian Neural Networks

2.2.1 Ensemble Kalman Filter with Stochastic Update

A linear Kalman Filter model can be described as

xk = Fxk−1 + εxk−1 (2.1)

dk = Dxk + εdk−1. (2.2)

The above functions can also be interpreted as state space models, where xk is

the hidden state at time k, and dk is the observation at time k. For linear Gaussian

models, exact solutions can be derived to update the posterior mean and posterior

covariance. However, when the model is non-Gaussian or nonlinear, or even the

state dimensions or measurement dimensions are high, we need to seek approximate

solutions.

By sampling the filtering distribution, we can obtain a set of samples and propa-

gate them through the forecast step as well as the update step to yield approximate

posterior estimates. In this way, we can reduce the space and time required for

computing the error covariance and innovation covariance.

Suppose we sample from the filtering distribution p(xk−1|d1:k−1) and measurement

forecast distribution p(dk|xk−1) and get an ensemble of data pairs

(x(1)k−1|k−1, d

(1)k ), ..., (x(n)

k−1|k−1, d(n)k ).

In the EnKF, we apply the forecast step (2.1) on each sample x(i)k−1|k−1 and get

x(i)k|k−1 = Fx

(i)k−1|k−1 + ε

x(i)k−1, i = 1, ..., N (2.3)

εx(i)k−1 ∼ N(0,Σx), (2.4)

Then we update each predicted state x(i)k|k−1 to get a posterior state estimate x(i)

k|k with

14

a perturbed measurement d(i)k|k−1

x(i)k|k = x

(i)k|k−1 + Lk(dk − d(i)

k|k−1), i = 1, ..., N (2.5)

d(i)k|k−1 = Dx

(i)k|k−1 + ε

d(i)k−1 (2.6)

εd(i)k−1 ∼ N(0,Σd). (2.7)

And we can estimate the Kalman gain L with sample covariance to get an approximate

L

L = Pk|k−1DT (DPk|k−1D

T + Σd)−1 (2.8)

Pk|k−1 ≈1

N − 1

M∑i=1

(d(i)k − dk)(d

(i)k − dk)T . (2.9)

By using the Monte Carlo schema and adding the perturbed noise, the EnKF is

claimed to capture the non-Gaussianity and nonlinearity [65].

2.2.2 Bayesian Neural Networks Update with EnKF

A neural network can be represented as

yk = f(xk, w) + ε, k = 1, 2, ...,M (2.10)

where (xk, yk) is the training data, w is the parameter vector (weights), f is a nonlinear

neural network mapping, and ε is the noise which compensates for the difference

between outputs of neural network and real target values.

Let D indicate the training data, D = {(xk, yk)}k=1,2,...,M . Given a new input x∗,

we are interested in the predictive distribution

p(y∗|D, x∗) =∫p(y∗|x∗, w)p(w|D)dw . (2.11)

Here p(w|D) is the conditional distribution of w given the training data D, which

can be obtained via Bayes’ rule:

p(w|D) = p(D|w)p(w)p(D) . (2.12)

15

p(w) is the prior distribution of the weights w, p(D) is the evidence, and p(D|w) is

the likelihood which can be obtained though Eq. (2.10).

To evaluate the integral in Eq. (2.11), we need to find a solution for Eq. (2.12).

Since the neural network is a nonlinear mapping function, a common way to solve

Eq. (2.12) is using Monte Carlo methods. Suppose N samples {wj}j=1,...,N of p(w|D)

are available, and δ(·) represents the dirac function. Then p(y∗|D, x∗) can be obtained

as follows:

p(y∗|D, x∗) = 1N

N∑j=1

p(y∗|x∗, wj) (2.13)

= 1N

N∑j=1

δ(y∗ − f(x∗, wj)) (2.14)

Let {y∗j}j=1,...,N denotes samples of p(y∗|D, x∗), then we have y∗j = f(x∗, wj)

In this paper, we use EnKF to estimate the uncertainty of the weights w. The

corresponding dynamic system is shown in Eq. (2.15) and Eq. (2.16).

wk = wk−1 (2.15)

yk = f(xk, wk) + ε (2.16)

w has a prior distribution p(w) = N (w; 0, σ2wIp) and ε is a white noise with distribu-

tion p(ε) = N (ε; 0, σ2ε Iq). Here p and q represent the dimensionality of features and

targets, respectively.

Suppose the batch size is s and the number of weights is l. Since weights w is the

quantity that needs to be estimated, we augment the output of neural network with

w to form an augmented state variable Uk = [Fk, w]. Here Fk includes all the outputs

of the kth batch {f(xk,i, w)}i=1,...,s. The matching measurement model is given by

16

Eq. (2.17).

Yk = HUk + ε (2.17)

ε ∼ N (0, σ2ε Isq)

H = [Isq, 0sq×l]

Yk = [yTk,1, yTk,2, ..., yTk,s]T

Uk = [f(xk,1, w)T , f(xk,2, w)T , ..., f(xk,s, w)T , w1, w2, ...wl]T

Maximizing Hyper-parameters with Maximum Likelihood Estimation

Before inference, two hyperparameters σ2w and σ2

ε need to be determined. A common

way is to maximize evidence p(D|σ2w, σ

2ε ) with respect to σ2

w and σ2ε .

p(D|σ2w, σ

2ε ) =

∫p(D|w, σ2

ε )p(w|σ2w)dw (2.18)

This has been successfully applied to Bayesian linear regression. However, for nonlin-

ear models, it is difficult to evaluate the integral above. Here, we fix σ2w and estimate

σ2ε by maximizing p(D|σ2

ε ). Assume N samples are drawn from p(w), M data points

are used for training network and the dimensionality of D is d. Under the assumption

that the data points are generated independently, we have

p(D|σ2ε ) =

M∏j=1

p(yj|xj, σ2ε ). (2.19)

For simplicity, we omit xj in p(yj|xj, σ2ε ). Each p(yj|σ2

ε ) is calculated as follows

p(yj|σ2ε ) =

∫p(yj|w, σ2

ε )p(w)dw

= 1N

N∑i=1

p(yj|wi, σ2ε )

= 1N

N∑i=1N (yj; fi, σ2

ε Iq)

= E[N (yj; fi, σ2

ε Iq)]

(2.20)

17

Here fi = f(x,wi). Thus the log-evidence is given by

lnp(D|σ2ε ) = ln

M∏j=1

p(yj|σ2ε )

=M∑j=1

lnp(yj|σ2ε )

=M∑j=1

lnE[N (yj; fi, σ2

ε Iq)]

Since log is a concave function, according to Jensen’s inequality, we have

lnE[N (yj; fi, σ2

ε Iq)]≥ E

[lnN (yj; fi, σ2

ε Iq)]. (2.21)

Thus we obtain the lower bound of lnp(D|σ2ε ):

lnp(D|σ2ε ) ≥

M∑j=1

E[lnN (yj; fi, σ2

ε Iq)]

(2.22)

= 1N

M∑j=1

N∑i=1

lnN (yj; fi, σ2ε Iq)

= 1N

M∑j=1

N∑i=1

(−q2 ln2π − q

2 lnσ2ε

− 12σ2

ε

(yj − fi)T (yj − fi))

=− qM

2 ln2π − qM

2 lnσ2ε

− 12Nσ2

ε

M∑j=1

N∑i=1

(yj − fi)T (yj − fi)

Maximizing the lower bound of log-evidence with respect to σ2ε we obtain

σ2ε = 1

qMN

M∑j=1

N∑i=1

(yj − fi)T (yj − fi). (2.23)

Once the variance of the noise is determined, the EnKF algorithm presented in

the previous step is applied to obtain samples from the posterior distribution of the

weights.

18

Maximizing Hyper-parameters with Full Bayes

Another way to estimate the model error is to use full Bayes approach. Specifically,

we assume an inverse gamma distribution for σ2 and the following likelihood function

p(σ2ε ) = IG(α, β) (2.24)

p(y|w,X, σ2ε ) =

N∏n=1

N(yn|f(xn, w), σ2ε ). (2.25)

Furthermore, we can get the predictive distribution

p(y∗|x∗, D) =∫p(y∗|x∗, w, σ2

ε )p(w, σ2ε |D)dwdσ2

ε . (2.26)

And the posterior distribution for w and σ2ε can be estimated with Bayes’ rule:

p(w, σ2ε |D) = p(y|w,X, σ2

ε )p(w)p(σ2ε )

p(y|X) . (2.27)

However, we need to approximate p(w, σ2ε |D) and p(y|X) since they are intractable

to compute. In this paper, we propose a novel method to estimate these quantities

with EnKF.

In addition to the weight samples, we also sample N points σ2ε

(1), ..., σ2

ε(N) from

(2.24). For each w(j) and σ2ε

(j), we can forward propagate a data point x to get an

estimate y(j)

y(j) = f(x,w(j)) + ε(j), ε(j) ∼ N(0, σ2ε

(j)). (2.28)

Then the algorithm 1 works in two phases, including a training phase and a testing

phase. In the training phase, we augment y, w, and σ2ε into a new state S, and update

S with the mini-batch data and the EnKF algorithm. In this way, we can train the

neural network and estimate the optimal w and σ2ε . Then in the testing phase, we

use the optimized parameters to make an inference on each mini-batch testing data.

19

Similar to (2.17), we can get the following equations for each mini-batch data:

U = HS + εp, εp ∼ N(0, rζ) (2.29)

H = [Inq, 0nq×(l+1)]

S = [y1, ..., yn, w1, ..., wl, σ2ε ]

U = [U1, ..., Un]T .

where n is the batch size and l is the number of weights, rζ is a small perturbed value,

and it can also be described as

U1

U2

...

Un

=

1 0 · · · 0 0 · · · 0

0 1 · · · 0 0 · · · 0... . . . ...

0 0 · · · 0 0 · · · 1

×

y1

...

yn

w1

...

wl

σ2ε

+ εp. (2.30)

Algorithm 1 enkf_bnnInput: features x, targets yOutput: updated weights wDraw N samples for w from prior p(0, σ2

wIl)Draw N samples for σ2

ε from an inverse Gamma distribution IG(α, β).for Each mini-batch input xk do

for Each sample w(j) and σ2(j)ε do

y(j)k = f(w(j), xk) + ε(j), ε(j) ∼ N(0, σ2(j)

ε )end forCreate the new state S = [y, w, logσ2

ε ]Update S ′ = enkf(S,H, rζ , yobs)Get updated w′ and logσ2′

ε from S′

Update σ2′ε = exp(logσ2′

ε ) return w′ and σ2′ε .

end for

20

2.3 Experiments

2.3.1 Regression Analysis on UCI datasets

To evaluate our algorithm, we run tests on 10 UCI datasets recommended by

Hernández-Lobato and Adams [45] to evaluate PBP. The datasets were later used

by Gal and Gharamani [34] and Lakshminarayanan et al. [72] for their approximate

algorithms. Initially, we applied the same protocol to get the evaluation metrics for

the comparison. That is, we built a feed-forward network with 1 hidden layer, 100

hidden nodes for the larger protein and year prediction MSD datasets, and 50 hidden

nodes for the smaller datasets. We randomly permuted the data, and use 90% for

training and 10% for testing. We ran 1 epoch for the Year Prediction MSD dataset,

5 epoches for the protein dataset, and 20 epoches for the rest datasets. Additionally,

we set σ2w, α, β, rζ , N , n to 0.02, 50, 1, 0.01, 200, and 32, respectively. Then we get

samples of w and σ2ε from a normal distribution parameterized by N(0, 0.02Il) and

an inverse Gamma distribution IG(50, 1).

Table 2.1: Average test RMSE of the proposed algorithm on different layers andensemble size.

Dataset N Q Avg. Test RMSE and Std. ErrorsVI PBP MC Dropout Deep Ensembles EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing 506 13 4.32±0.29 3.01±0.18 2.97±0.85 3.28±1.00 3.92±0.92 3.34±0.17 2.81±0.51Concrete Strength 1,030 8 7.19±0.12 5.67±0.09 5.23±0.53 6.03±0.58 5.13±0.69 5.21±0.43 5.22±0.49Energy Efficiency 768 8 2.65±0.08 1.80±0.05 1.66±0.19 2.09±0.29 2.1±0.31 1.88±0.19 1.75±0.28

Kin8nm 8,192 8 0.10±0.00 0.10±0.00 0.10±0.00 0.09±0.00 0.11±0.01 0.10±0.00 0.09±0.00Naval Propulsion 11,934 16 0.01±0.00 0.01±0.00 0.01±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00

Power Plant 9,568 4 4.33±0.04 4.12±0.03 4.02±0.18 4.11±0.17 4.16±0.20 4.11±0.14 4.01±0.23Protein Structure 45,730 9 4.84±0.03 4.73±0.01 4.36±0.04 4.71±0.06 4.75±0.03 4.51±0.11 4.53±0.10Wine Quality Red 1,599 11 0.65±0.01 0.64±0.01 0.62±0.04 0.64±0.04 0.69±0.06 0.63±0.05 0.61±0.08

Yacht Hydrodynamics 308 6 6.89±0.67 1.02±0.05 1.11±0.38 1.58±0.48 4.61±1.28 2.95±0.17 3.01±0.49Year Prediction MSD 515,345 90 9.03±NA 8.88±NA 8.85±NA 8.89±NA 10.01±NA 8.91±NA 8.88±NA

Furthermore, we also explored a deep architecture with 5 hidden layers and in-

creased the ensemble size from 200 to 1000. As shown in Table 2.1 we can observe our

algorithm out-performs the state-of-the-art models in 5 of 10 datasets in terms of av-

erage test RMSE. With respect to RMSE, we find the following statistical differences

as shown in Table 2.3 and 2.4: (1) EnKF-1000 is statistically better than VI for 5

datasets (Boston Housing, Concrete Strength, Energy, Efficiency, Protein Structure,

21

Table 2.2: Average test LL of the proposed algorithm on different layers and ensemblesize.

Dataset N Q Avg. Test LL and Std. ErrorsVI PBP MC Dropout Deep Ensembles EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing 506 13 -2.90±0.07 -2.57±0.09 -2.46±0.25 -2.41±0.25 -2.88±0.50 -2.53±0.05 -2.41±0.31Concrete Strength 1,030 8 -3.39±0.02 -3.16±0.02 -3.04±0.09 -3.06±0.18 -3.20±0.42 -3.13±0.19 -3.01±0.15Energy Efficiency 768 8 -2.39±0.03 -2.04±0.02 -1.99±0.09 -1.38±0.22 -2.42±0.14 -2.26±0.29 -2.29±0.12

Kin8nm 8,192 8 0.90±0.01 0.90±0.01 0.95±0.03 1.20±0.02 1.09±0.52 1.24±0.48 1.12±0.37Naval Propulsion 11,934 16 3.73±0.12 3.73±0.01 3.80±0.05 5.63±0.05 5.20±0.26 5.48±0.13 5.81±0.13

Power Plant 9,568 4 -2.89±0.01 -2.84±0.01 -2.80±0.05 -2.79±0.04 -2.97±0.15 -2.81±0.05 -2.77±0.06Protein Structure 45,730 9 -2.99±0.01 -2.97±0.00 -2.89±0.01 -2.83±0.02 -3.13±0.38 -3.01±0.31 -3.02±0.25Wine Quality Red 1,599 11 -0.98±0.01 -0.97±0.01 -0.93±0.06 -0.94±0.12 -0.97±0.09 -0.91±0.11 -0.92±0.12

Yacht Hydrodynamics 308 6 -3.43±0.16 -1.63±0.02 -1.55±0.12 -0.18±0.21 -2.94±0.26 -1.52±0.34 -1.48±0.28Year Prediction MSD 515,345 90 -3.62±NA -3.60±NA -3.59±NA -3.35±NA -4.82±NA -4.21±NA -4.23±NA

and Yacht Hydrodynamics), and (2) MC-dropout, Deep Ensembles, and PBP are sta-

tistically better than EnKF only for the Yacht Hydrodynamics dataset. According to

Table 2.2, our algorithm yields the best average test LL in 6 of 10 datasets. With re-

spect to LL, we find the following statistical differences as shown in Table 2.5 and 2.6:

(1) EnKF-1000 is statistically better than VI for 3 datasets (Boston Housing, Naval

Propulsion, Yacht Hydrodynamics) and it is statistically better than MC-dropout

and PBP for the Naval Propulsion dataset, and (2) Deep Ensembles is statistically

better than EnKF-1000 for Energy Efficiency and Yacht Thermodynamics.

Besides these, we have not found any other statistical differences in performance

which suggests that overall EnKF can performs better than VI however has similar

performance with the state-of-the-art in capturing prediction uncertainty. This sug-

gests that EnKF can be used as an alternative to training deep neural networks and

capturing prediction uncertainty using the Bayesian formalism without the use of

gradient information in obtaining the posterior distribution over the weights as com-

pared with MC-dropout, Deep Ensembles, and PBP. All that is required is simulating

the neural network N times, where N is the number of ensembles, and readjusting the

weights based on the Kalman update step. This has the potential to bring us closer

to real-time training of deep neural networks.

22

Table 2.3: Significance tests of the average test RMSE among VI, PBP and theproposed algorithm with different ensembles and layers.

RMSE VI PBPDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing -0.41 -2.92 -2.57 0.97 1.33 -0.37Concrete Strength -2.94 -4.44 -3.91 -0.78 -1.05 -0.90Energy Efficiency -1.72 -3.74 -3.09 0.96 0.41 -0.18

Kin8nm 1.00 - - 1.00 - -Naval Propulsion - - - - - -

Power Plant -0.83 -1.51 -1.37 0.20 -0.07 -0.47Protein Structure -2.12 -2.89 -2.97 0.63 -1.99 -1.99Wine Quality Red 0.66 -0.39 -0.50 0.82 -0.20 -0.37

Yacht Hydrodynamics -1.58 -5.70 -4.67 2.80 10.89 4.04Year Prediction MSD - - - - - -

Table 2.4: Significance tests of the average test RMSE among MC-dropout, DeepEnsembles, and the proposed algorithm with different ensembles and layers.

RMSE MC Dropout Deep EnsemblesDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing -0.76 -0.43 0.16 -0.47 -0.06 0.42Concrete Strength 0.11 0.03 0.01 1.00 1.14 1.07Energy Efficiency -1.21 -0.82 -0.27 -0.02 0.61 0.84

Kin8nm -1.00 - - -2.00 - -Naval Propulsion - - - - - -

Power Plant -0.52 -0.39 0.03 -0.19 0.00 0.35Protein Structure -7.80 -1.28 -1.58 -0.60 1.60 1.54Wine Quality Red -0.97 -0.16 0.11 -0.69 0.16 0.34

Yacht Hydrodynamics -2.62 -4.42 -3.06 -2.22 -2.69 -2.08Year Prediction MSD - - - - - -

Table 2.5: Significance tests of the average test LL among VI, PBP and the proposedalgorithm with different ensembles and layers.

LL VI PBPDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing 0.04 4.30 1.54 -0.61 0.39 0.50Concrete Strength 0.45 1.36 2.51 -0.10 0.16 0.99Energy Efficiency -0.21 0.45 0.81 -2.69 -0.76 -2.05

Kin8nm 0.37 0.71 0.59 0.37 0.71 0.59Naval Propulsion 5.13 9.89 11.76 5.65 13.42 15.95

Power Plant -0.53 1.57 1.97 -0.86 0.59 1.15Protein Structure -0.37 -0.06 -0.12 -0.42 -0.13 -0.20Wine Quality Red 0.11 0.63 0.50 0.00 0.54 0.42

Yacht Hydrodynamics 1.61 5.08 6.05 -5.02 0.32 0.53Year Prediction MSD - - - - - -

23

Table 2.6: Significance tests of the average test LL among MC-dropout, Deep En-sembles, and the proposed algorithm with different ensembles and layers.

LL MC Dropout Deep EnsemblesDataset EnKF-200-1 EnKF-1000-1 EnKF-1000-5 EnKF-200-1 EnKF-1000-1 EnKF-1000-5

Boston Housing 0.75 0.27 -0.13 0.84 0.47 0.00Concrete Strength 0.37 0.43 -0.17 0.31 0.27 -0.21Energy Efficiency 2.58 0.89 2.00 3.99 2.42 3.63

Kin8nm -0.27 -0.60 -0.46 0.21 -0.08 0.22Naval Propulsion -5.29 -12.06 -14.43 1.62 1.08 -1.29

Power Plant 1.08 0.14 -0.38 1.16 0.31 -0.28Protein Structure 0.63 0.39 0.52 0.79 0.58 0.76Wine Quality Red 0.37 -0.16 -0.07 0.20 -0.18 -0.12

Yacht Hydrodynamics 4.85 -0.08 -0.23 8.26 3.35 3.71Year Prediction MSD - - - - - -

2.3.2 Regression Analysis on Natural Language Data

To further evaluate the proposed algorithm, we ran it on a regression analysis on the

raw Cornell film review dataset [107] which contains 5000 movie reviews. Similar to

Gal and Gharamani [33], we experimented three different sets of network topologies.

The first set of networks include an embedding layer, a dense layer with 128 hidden

nodes, and a second dense layer with 1 output unit. The second set of networks

include an embedding layer, a LSTM layer with 128 hidden nodes, and a dense layer

with 1 output unit. The third set of networks replaces the LSTM layer with a GRU

layer and keeps the rest of the layers in the second set of networks. The batch size

is defined as 128, the weight decay is set to 1e-4, and the optimizer is set to Adam.

The padding sequence of the LSTM or GRU layer is set to 100, indicating we take

every 100 words as an input. To get a good estimate of the noise precision τ and the

dropout rate, we adopt a grid search approach and find the best precision value 1.0

and best dropout 0.1 from [0.01, 0.05, 0.1, 0.5, 1.0] and [0.05, 0.1, 0.25, 0.5, 0.75].

For the proposed algorithm, we also experimented with both a dense layer and a

LSTM layer. We kept the same setting for σ2w, α, β, rζ , N , n, and define the loop

size the same as the padding sequence for the LSTM layer.

The referenced models used for this task include a naive LSTM network, a vari-

ational network with embedding dropout, a variational LSTM (VLSTM) network

24

without embedding dropout, and a variational GRU (VGRU) network with embed-

ding dropout. Results are provided in Figure 2.1-2.3. According to Figure 2.1, our

proposed algorithms tend to avoid quickly overfitting to the training data which is

commonly shown for the dropout-based models. This is further illustrated by the

RMSE of the testing data, which indicates that EnKF LSTM performs best on the

new data. Furthermore, we observe that the proposed training scheme consistently

performs better on LSTM networks than the fully connected network, indicating other

advanced network architectures may be studied with this proposed scheme for bet-

ter inference. Furthermore, Figure 2.3 shows the proposed EnKF LSTM algorithm

achieves better log likelihood which provides some evidence that the model can be

used for better uncertainty estimate on approximate Bayesian neural networks.

Figure 2.1: Training RMSE for different models.

The parameters of several selected networks of the UCI datasets and the Cornel

film review dataset are described in Table 2.7. According to the table and above

results, we observe good regularization of the proposed algorithm on training large

25

Figure 2.2: Testing RMSE for different models.

Figure 2.3: Testing LL for different models.

collection of parameters.

26

Table 2.7: Number of parameters of selected networks.

Network # ParametersNaval Propulsion 1 Layer 850Naval Propulsion 5 Layers 10850

Year Prediction MSD 1 Layer 19100Year Prediction MSD 5 Layers 45500

Cornell Film Dense 25729Cornell Film LSTM 205441

2.4 Discussion

In this chapter, we proposed two schemes to optimize covariance of the measure-

ment noise. That is, maximizing model evidence (MME) of an approximate evidence

function in each update step during the training phase, and a full Bayes approach

that draws covariance samples from an inverse Gamma distribution. To validate the

proposed algorithms, we experimented with 10 UCI datasets and a natural language

dataset for regression analysis. According to average testing model errors and data

likelihood on UCI datasets, our proposed algorithms yielded better performance on

at least half of the datasets. With deeper layers and larger ensemble size, we also

observed slightly increasing performance. More importantly, the natural language

dataset results revealed that our proposed algorithm could compete with dropout-

based algorithms. In the next chapter, we applied the proposed MME scheme on

deep LSTM architecture, and evaluated it on five real-world datasets for sub-event

detection.

27

Chapter 3

Approximate Long Short Term Memory

Network for Outlier Detection

3.1 Sub-event Detection

Launched in 2006, Twitter serves as a microblogging platform in which people can

publish at most 140 character-long tweets or 10,000 character-long direct messages [91].

Due to its popularity, portability, and ease of use, Twitter quickly has grown into a

platform for people sharing daily life updates, chatting, and recording or spreading

news. As of September 2015, Twitter announced that there were more than 320 mil-

lion monthly active users worldwide1. In comparison to conventional news sources,

Twitter favors real-time content and breaking news, and it thus plays an important

role as a dynamic information source for individuals, companies, and organizations [6].

Since its establishment, Twitter has generously opened a portion of its data to

the public and has attracted extensive research in many areas [120, 71, 53]. In many

studies, the primary task is to identify the event-related tweets and then exploit

these tweets to build domain knowledge-related models for analysis. As defined by

Atefeh [6], events are generally considered as ”real-world occurrences that unfold

over space and time”. Compared to many data sources, tweets serve as a massive

and timely collection of facts and controversial opinions related to specific events [6].

Furthermore, events discussed on Twitter vary in both scale and category, while some

may reach to global audiences such as presidential elections [149], and others, such

1https://about.twitter.com/company

28

as wildfire [111, 106], appeal to local users. In general, studies of Twitter events can

be categorized into natural events [120], political events [143], social events [76], and

others [90].

Originated from the Topic Detection and Tracking (TDT) program, detection of

retrospective or new events has been addressed over two decades from a collection of

news stories [2]. Historically, there exist a number of systems developed to automat-

ically detect events from online news [8, 90, 78, 157].

An event usually consists of many sub-events, which can describe various facets

of it [111]. Furthermore, users tend to post new statuses of an event to keep track

of the dynamics of it. Within an event, some unexpected situations or results may

occur and surprise users, such as the bombing during the Boston Marathon and the

verdict moment of the Zimmerman trial. By building an intelligent system, we can

identify these sub-events to quickly respond to them, thus avoiding crisis situations

or maximizing marketing impact.

3.2 Literature Review

Traditionally, unsupervised models and supervised models have been widely applied

to detect events from news sources. Clustering methods have been a classic approach

for both Retrospective Event Detection (RED) and New Event Detection (NED) since

1990s. According to Allan et al. [2], they designed a single pass clustering method

with a threshold model to detect and track events from a collection of digitalized news

sources. Chen and Roy [22] also applied clustering approaches such as DBSCAN to

identify events for other user-generated content such as photos.

Additionally, supervised algorithms such as Naïve Bayes, SVM, and gradient

boosted decision trees, have been proposed for event detection. Becker et al. [9]

employed the NaÃŕve Bayes classifier to label the clustered tweets into event-tweets

or non-events tweets with derived temporal features, social features, topical features,

29

and Twitter-centric features, while the tweets are grouped using an incremental clus-

tering algorithm. Sakaki et al. [120] applied the support vector machines to classify

tweets into tweets related to target events or unrelated tweets with three key features.

Subsequently, they designed a spatial-temporal model to estimate the center of an

earthquake and forecast the trajectory of a hurricane using Kalman filtering and par-

ticle filtering. Popescu and Pennacchiotti [114] proposed a gradient boosted decision

tree based model integrated with a number of custom features to detect controversial

events from Twitter streams.

Furthermore, ensemble approaches are also employed to address the event detec-

tion problem. Sankaranarayanan et al. [122] first employed a classification scheme

to classify tweets into different groups, and then applied a clustering algorithm to

identify events.

As argued by Meladianos et al. [91], sub-event detection has been receiving more

and more attention from the event research community. At the time being, there are

a number of studies dealing with sub-event detection in an offline mode [161]. Zhao et

al. [159] adopted a simple statistical approach to detect sub-events during NFL games

when tweeting rate suddenly rose higher than a prior threshold. Chakrabarti and

Punera [20] developed a two-phase model with a modified Hidden Markov Model to

identify sub-events and then derived a summary of the tweets stream. However, their

approach has a severe deficiency because it fails to work properly under situations

when unseen event types are involved. Zubiaga et al. [161] compared two different

approaches for sub-event detection. The first approach measured recent tweeting

activities and identified sub-events if there was a sudden increase of the tweeting

rate by at least 1.7 compared to the previous period. The second approach relied on

all previous tweeting activities and detected sub-events if the tweeting rate within 60

seconds exceeded 90% of all previously tweeting rates. As claimed by the authors, the

latter outlier-based approach outperformed the first increase-based approach since it

30

neglected situations when there existed low tweeting rates preceded by even lower

rates [161].

Nichols et al. [101] provided both an online approach and an offline approach to

detect sub-events as well as summarizing important events moments by comparing

slopes of statuses updates with a specific slope threshold, which was defined as the

sum of the median and three times the standard deviation (median + 3*standard

deviation) in their experiment. Shen et al. [126] incorporated ”burstiness” and ”co-

hesiveness” properties of tweets into a participant-based sub-event detection frame-

work, and developed a mixture model tuned by EM which yielded the identification

of important moments of an event. Chierichetti et al. [23] proposed a simple lin-

ear classifier, particularly to say, a logistic regression classifier, to capture the new

sub-events with the exploration of the tweet and retweet rates as the features.

In this work, we propose a novel algorithm termed EnKF_LSTM, which trains

a LSTM network with ensemble Kalman filter, and apply it to an outlier detection

task. The goal is to model the evolution of the probability distribution of the ob-

served features at time t using Recurrent Neural Networks (RNNs).The probability

distribution is then used to determine whether an observation is an outlier.

RNNs are sequence-based networks designed for sequence data. These models

have been successfully applied in areas including speech recognition, image caption

generation, and machine translation [119, 64, 139]. Compared to feed-forward net-

works, RNNs can capture the information from all previous time steps and share the

same parameters across all steps. The term “recurrent” means that we can unfold

the network, and at each step the hidden layer performs the same task for different

inputs. A typical RNN architecture is illustrated in Figure 3.1.

Standard RNN is limited by the gradient-vanishing problem. To cope with this is-

sue, LSTM [51] networks have been developed to maintain the long term dependency

by specifying gating mechanism for the passing of information through the hidden

31

Figure 3.1: A RNN with an input layer (blue), a hidden layer (red), and an outputlayer (green). Units within the dotted regions are optional.

layers. Namely, memory blocks replace the traditional hidden units, and store infor-

mation in the cell variable. There are four components for each memory block, which

include a memory cell, an input gate, an output gate, and a forget gate.

We propose a Bayesian LSTM where the uncertainty in the weights is estimated

using EnKF. To mitigate the underestimation of error covariance due to various

sources such as model errors, nonlinearity, and limited ensemble size, in this study

we optimize the covariance inflation using maximum likelihood estimation. To assess

the proposed algorithm, we apply it to outlier detection in five real-world events

retrieved from the Twitter platform.

In the following sections we introduce the LSTM networks, window embedding,

and a framework for outlier detection. This will be followed by the subevent detection

application in Twitter streams, where the problem specifics and numerical results are

presented in the experiment section.

3.3 LSTM Networks

Given an observed sequence of features, y∗1 . . . y∗t , the goal is to contruct the predictive

probability density function (pdf) p(yt+1|y1:t = {y∗1 . . . y∗t }) using a Bayesian LSTM.

This pdf is then used to determine whether the next observation y∗t+1 is an outlier (∗

denotes the actual observation).

32

In LSTM each hidden unit in Figure 3.1 is replaced by a memory cell. Each

memory cell is composed of an input gate, a forget gate, an output gate, and an

internal state. The input data can be propagated through the network as described

by the following formula:

it = σ(Wixxt +Wimmt−1 + bi) (3.1)

ft = σ(Wfxxt +Wmfmt−1 + bf )

ct = ft � ct−1 + it � g(Wcxxt +Wcmmt−1 + bc)

ot = σ(Woxxt +Wommt−1 + bo)

mt = ot � h(ct)

yt = Wymmt + by

Here, σ is the logistic sigmoid function, i, f , o, and c represent the three gates

and the internal state, W is the weight matrix, b represents the bias term, m is the

cell output activation vector, � is element-wise product, g and h are tanh activation

functions, and x and y represent the input and the output vector, respectively.

To train a LSTM network with EnKF, the function f in (2.16) is the above gated

mechanism. Given an input xk, we can estimate yk by propagating the data through

the LSTM network and update the prediction with a model noise as the output. Then

we can use (2.17) to augment the state, and infer the posterior of weights using the

EnKF as shown in Algorithm 2.

3.4 Window Embeddings for Outlier Detection

3.4.1 Outlier Detection

The inferred distribution of the weights induces a predictive distribution for the next

observable p(yt+1|y1:t = {y∗1 . . . y∗t }). We can use this probability distribution to label

33

Figure 3.2: Gated mechanism of a LSTM cell as described by [39].

Algorithm 2 enkf_lstmInput: features x, targets yOutput: updated weights wDraw N samples for w from prior p(0, σwIl).Compute σ2

ε according to Eq. (2.23).for Each batch input xk do

for Each sample wi dofk,i = lstm(wi, xk)

end forAugment Fk with w to form Uk.Update Uk using EnKF, U ′k = enkf(Uk, H, σ2

ε , yk)Get posterior samples w from U

′k

end forreturn w

the actual observation y∗t+1 as outlier. Since each observation is a multi-dimensional

feature with dimension q, we can use the Chi-squared test of the squared Maha-

lanobis distance [151]. The main idea is to identify when a data point falls outside of

the multidimensional uncertainty even when the marginal uncertainties capture the

observational data.

The Mahalanobis distance between the actual observation y∗t+1 and the predicted

uncertainty approximated using a Gaussian distribution

p(yt+1|y1:t) ≈ N (yt+1;µt+1,Σt+1) is given by

md =√

(y∗t+1 − µt+1)TΣ−1t+1(yt+1 − µt+1) , (3.2)

34

Table 3.1: Basic information of the five events.Event Collection Starting Time Event Time Collection Ending Time Key Words/Hashtags

2013 Boston Marathon 04/12/2013 00:00:00 04/15/2013 14:49:00 04/18/2013 23:59:59 Marathon, #marathon2013 Superbowl 01/31/2013 00:00:00 02/03/2013 20:38:00 02/06/2014 23:59:59 Superbowl, giants, ravens, harbaugh

2013 OSCAR 02/21/2013 00:00:00 02/24/2013 20:30:00 02/27/2013 23:59:59 Oscar, #sethmacfarlane, #academyawards2013 NBA AllStar 02/14/2013 00:00:0 02/17/2013 20:30:00 02/20/2013 23:59:59 allstar, all-starZimmerman Trial 07/12/2013 11:30:00 07/13/2013 22:00:00 07/15/2013 11:30:00 trayvon, zimmerman

where the sample mean and covariance matrix are obtained using the propagated

ensemble memebers to the observable.

When the square of the Mahalanobis distance passes the following test, the ob-

servations is considered to not be an outlier and a plausible outcome of the model.

Here the degree of freedom used to obtain χ20.05 is q.

m2d ≤ χ2

0.05 (3.3)

3.4.2 Subevent Detection

An event is confined by space and time. Specifically, it consists of a set of subevents,

depicting different facets of an event [111]. As an event evolves, users usually post

new statuses to capture new states as subevents of the main issue [91]. Within an

event, some unexpected situations or results may occur and surprise users, such as

the bombing during the Boston Marathon and the power outage during the 2013

Superbowl. Subevent detection provides a deeper understanding of the threats to

better manage the situation within a crisis [112].

By formalizing it as an outlier detection task, we built dynamic models to detect

subevents based upon the retrieved Twitter data and the proposed window embedding

representation described in the following sections.

3.4.3 Data

We collected the data from Jan. 2, 2013 to Oct. 7, 2014 with the Twitter streaming

API and selected five national events for the outlier detection task. The five events

include the 2013 Boston Marathon event, the 2013 Superbowl event, the 2013 OSCAR

35

event, the 2013 NBA AllStar event, and the Zimmerman trial event. Each of these

events consists of a variety of subevents, such as the bombing for the marathon

event, the power outrage for the Superbowl event, the nomination moment of the

best picture award, the ceremony for the NBA AllStar MVP, and the verdict of the

jury for the Zimmerman trial event.

For these case studies, we filtered out relevant tweets with event-related keywords

and hashtags, preprocessed the data to remove urls and mentioned users. The basic

information of each event is provided in Table 3.1.

3.4.4 Window Embedding

In computational linguistics, distributed representations of words have shown some

advantages over raw co-occurrence count since they can capture the contextual infor-

mation of words. As categorized by Boroni et al. [84], distributed semantic models

can be termed as count models or prediction models. On one hand, count mod-

els, including LSA, HAL, and Hellinger PCA, can efficiently use the statistics of the

co-occurrence information but limited to capture complex patterns beyond word sim-

ilarities. On the other hand, prediction models, such as NNLM and word2vec, can

capture complex patterns of the words but limited to use the statistics information

of the words. To cope with limits of each approach, Pennington et al. [108] proposed

a weighted least squares objective J as shown follows:

J =V∑

i,j=1f(Xij)(wTi wj + bi + bj − logXij)2 (3.4)

where Xij is the number of times word j in the context of word i, wi and bi are

the word vector and bias of word i, wj and bj are the context word vector and bias

of word j, and f is a pre-defined weighting scheme as follows.

36

f(x) =

(x/xmax)α if x < xmax

1 otherwise

Vector representations can be used as features and they have been successfully

applied in many natural language processing applications [108]. Through some ex-

periments, we decided to use the 100 dimension GloVe vector representation that

were trained with 27 billion tweets. We further used the Probabilistic PCA to reduce

the vector dimensionality into d latent components that could capture at least 99%

variability of the original information.

Here, we define sentence embedding as the average of its word vectors. Given a

sentence, it consists of n words represented by vectors ed1, ed2, ..., edn, and the sentence

embedding sdi is defined as ∑ni=1 e

di /n. Furthermore, we define a window embedding

wdt as the average of its sentence vectors. For a given time window, it is composed of

m sentence vectors sd1, sd2, ..., sdm, and a window embedding wdt is defined as∑mi=1 s

di /m.

As we use a moving window approach, we grouped every l-size window wd1, wd2, ..., w

dl

into a training input X, and label wdl+1 as the training input Y . Based upon the

grouped data, we can train our proposed multivariate EnKF-LSTM model. With

some experiments, we chose 5 as the number of latent components d, 5 minutes as

the time window t, and 32 as the grouping size l.

3.4.5 Implementation

The implemented 2 layer network is shown in Figure 3.3. The input layer consists of

5 nodes, each hidden layer consists of 32 LSTM cells, and the output layer consists of

5 output nodes. In this implementation, we include the forget gate proposed by [37].

The implementation is based upon Tensorflow,and it could be easily extended for

deeper architectures or variants of LSTMs.

37

Figure 3.3: Architecture of the network used in this study.

3.5 Results

The outlier detection results are provided in Figure 3.4-3.8. In terms of the results,

we observe 40, 33, 39, 33, and 23 identified sub-events, respectively. Of those sub-

events, 16, 18, 25, 22, and 17 are verified as true sub-events. We set the initial

sigma value of the noise covariance matrix in the EnKF update step to 1.0, and then

further optimized them to 2.13, 2.19, 1.98, 1.56, and 1.21 with Maximum Likelihood

Estimation.

To further evaluate our model, we compared it with Gaussian Process (GP) and

MC dropout [34]. The comparison result is provided in Table 3.2. The GP model

yielded the best recall value in three of the five events, indicating that it captured most

true sub-events. On the other hand, it also misidentified many normal time windows

as sub-events, thus yielding many false positives and low precision. Compared to

the GP model, our proposed enkf_lstm algorithm reliably captured many true sub-

events and yield the best precision in these five events. Though, on the other hand, it

missed capturing some true sub-events and had worse recall performance in four of the

five events. In terms of the F1 score, our proposed algorithm has the best performance

38

Figure 3.4: Predicted sub-events with the proposed algorithm for the 2013 Bostonmarathon event. The distance lines above the blue threshold line indicate identifiedoutliers, and the red color indicates the Boston bombing moment (identified=40,true=16, σ2

ε=2.13).

in two of the five cases. The MC dropout model, however, has the worst performance

for this specific outlier detection task. Since MC dropout is mathematically equivalent

to variational inference, which under-estimates the uncertainty, the model mislabels

many normal time windows as outliers.

For the proposed algorithm, ensemble size N and the initial sigma value of the

noise covariance matrix σ2ε are two important hyper-parameters. To further evaluate

their effects on the performance, we provided a sensitivity analysis of the hyper-

parameters for the 2013 Boston marathon event. Based upon Figure 3.9, the algo-

rithm yielded the best result with an ensemble size at 200. In general, the evaluation

metrics increase until 200 and then slightly decrease, implying the proposal algo-

rithm can capture the dynamics of the posterior weights with a medium sample size.

39

Figure 3.5: Predicted sub-events with the proposed algorithm for the 2013 Superbowlevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the power outbreak (identified=33, true=18, σ2

ε=2.19).

Table 3.2: Evaluation metrics on different algorithms.

Event Model Precision Recall F1 ScoreGP 37.5 64.3 47.4

ENKF LSTM 40.0 57.1 47.12013 Boston MarathonMC Dropout 19.0 28.6 22.9

GP 43.8 55.3 48.8ENKF LSTM 54.5 47.4 50.72013 SuperbowlMC Dropout 18.2 21.1 19.5

GP 56.3 61.4 58.7ENKF LSTM 64.1 56.8 60.22013 OSCARMC Dropout 25.5 27.3 26.4

GP 55.2 68.3 61.0ENKF LSTM 66.7 53.7 59.52013 NBA AllStarMC Dropout 40.5 41.5 41.0

GP 57.7 62.5 60.0ENKF LSTM 60.9 70.8 65.5Zimmerman TrialMC Dropout 26.3 41.7 32.3

40

Figure 3.6: Predicted sub-events with the proposed algorithm for the 2013 OSCARevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the OSCAR starting moment (identified=39, true=25,σ2ε=1.98).

According to Figure 3.10, the evaluation metrics peaked at 0.05 and then slightly

decreased with larger value.

3.5.1 Discussion

In this work, we proposed a novel algorithm to estimate the posterior weights for

LSTMs, and we further developed a framework for outlier detection. Based upon

the proposed algorithm and framework, we applied them for five real-world outlier

detection tasks using Twitter streams. As shown in the above section, the proposed

algorithm can capture the uncertainty of the non-linear multivariate distribution and

outperform Gaussian process and MC dropout in terms of precision. We also eval-

uated the sensitivity of the algorithm on different ensemble size and variance value

41

Figure 3.7: Predicted sub-events with the proposed algorithm for the 2013 AllStarevent. The distance lines above the blue threshold line indicate identified outliers,and the red color indicates the AllStar starting moment (identified=33, true=22,σ2ε=1.56).

of the prior distribution of the weights with the Boston marathon data. However,

the performance of the model is further affected by several other hyper-parameters,

including the batch size, the number of layers, the number of nodes in each layer,

and the choice of window size, and the sensitivity of these hyper-parameters will be

evaluated in future research.

42

Figure 3.8: Predicted sub-events with the proposed algorithm for the 2013 Zim-merman trial news event. The distance lines above the blue threshold line indicateidentified outliers, and the red color indicates the verdict moment (identified=23,true=17, σ2

ε=1.21).

Figure 3.9: Performance of the algorithm on the marathon event for different ensemblesize.

43

Figure 3.10: Performance of the algorithm on the Boston marathon event for differentsigma value.

44

Chapter 4

Student Knowledge Estimation with Bayesian

Network

4.1 Student Knowledge Estimation

Intelligent Tutoring Systems (ITS) have been studied since 1980s [113] and research

from this area is becoming more important because of the advancement of computa-

tion and the increase of class size. As Butz et al. [15] explained, ITSs are computer-

based systems that can provide functionalities such as estimation of students’ un-

derstanding and individualized instruction in a similar way to traditional one-to-one

tutoring. It is of particular importance when the enrollment of post-secondary stu-

dents keeps growing 1 while the instructors have limited time for providing feedback.

Knowledge of the students are hidden from direct measurements; however, ITS

can help us to estimate the latent knowledge of students from quizzes. So far, there

are a number of ITSs that are used for different domains, such as BITS [15], Andes

[145], ViSMod [156], KERMIT[48].

As Butz et al. [15] claimed, there exists four common components of traditional

ITSs, including knowledge domain, student model, teaching strategies, and user in-

terface. Specifically, student models are pre-defined user models that are used to

track the states and needs of each student. The teaching strategies are the instruc-

tion styles of the system, such as the way of providing recommendations, while the

user interface of the system provides the capacity to interact with users.

1http://nces.ed.gov/fastfacts/display.asp?id=98

45

Of all components, student models are considered the key component of any adap-

tive tutoring system [93] due to their capability of storing information (e.g. example

and learning styles) about the students. Based upon student models, we can further

estimate the knowledge of each student and provide an individualized and optimum

learning path for the students. As explained by Chrysafiadi and Virvou [24], student

models can be used to estimate the knowledge level and cognitive states of students,

identifying learning styles and preferences, selecting proper learning methods (e.g.

providing tutorials), recognizing weaknesses and strengths to further recommend in-

dividualized feedback.

Due to the difficulty of model construction and initialization, many researchers

realize the complexity of the issue and put forward many possible approaches [93].

The typical approaches of constructing student models and initializing parameters

are using expert knowledge, data-driven estimations, or a synthesis of those two.

In this study, we address three issues involved in student knowledge modeling.

That is, estimation of knowledge level of students, distractors or misconceptions

identification, and evaluation of questions design. We address the first issue by con-

structing Bayesian Student models to estimate the posterior of the knowledge of a

student for a specific concept given assessment answers. We address the second issue

by proposing a novel optimization procedure. And we address the third issue by

designing a novel index to evaluate whether a question is proper or not.

Concept inventories (CIs) are commonly used to make inferences about students’

knowledge. Concept inventories are used for different branches of science including,

but not limited to, Electromagnetics [102], Discrete Mathematics [3], Statistics [138],

Electric Circuits [104], Signals and Systems [147], Thermodynamics [92], Strength of

Materials [116], Fluid Mechanics [88, 57], Dynamics [41], Chemistry [69, 29], Biology

[63, 35, 4, 12, 27],Biomechanics [67], Geoscience [79], Calculus [30], Astronomy [7].

The Statics concept inventory used in this work is designed by Paul S. Steif [134,

46

135].

Misconceptions of (scientific) knowledge have been studied with both qualitative

approaches and quantitative approaches. As mentioned in the literature [140, 21], the

development of two-tier multiple-choice diagnostic instruments serves as an effective

way of identifying misconceptions in the process of learning. Two-tier multiple-choice

diagnostic instruments, as discussed by Chandrasegaran [21], contain one tier of sev-

eral content questions and a second tier of possible conceptions and/or misconceptions

for the answers to the first-tier questions. To gain points for the question, a student

needs to answer correctly for both tiers of questions. Other qualitative methods, in-

cluding designs of Force Concept Inventory [123], or questionaire [73], or protocolss

[146], have been explored to study misconceptions. Corter et al.[26] proposed a work

to perceive misconceptions, errors, or biases by directly building a Bayesian Causal

Network and by modeling the misconceptions as latent variables. In their work, they

also specify the initial structure and parameters of the network and fit the data to

the network to get optimized parameters using the Expectation Maximization (EM)

algorithm. Similarly, Goguadazi et al. [38] also explicitly defined the misconception

as nodes in the Bayesian Network.

According to Liu et al [82], they incorporated the misconception as a model com-

ponent in the knowledge component model, and claimed a dramatic increase in the

overall fit of the improved model to the data. Compared to previous work, we de-

sign a data-driven approach to identify the misconceptions. In other words, we build

a Bayesian Network without incorporating the misconception nodes, and propose a

novel optimization procedure to identify the misconceptions.

4.2 Bayesian Network Background

In the Bayesian framework, Bayes’ theorem is widely used to update the posterior

based upon the likelihood function and a pre-defined prior probability. Specifically,

47

the posterior can be computed as:

p(θ|D) = p(D|θ)p(θ)p(D) (4.1)

While p(D|θ) is the likelihood function, and p(D) is the prior probability.

Compared to the freqentist framework, the Bayesian framework introduces the

concept of prior beliefs. The Bayesian framework, furthermore, can iteratively up-

date the prior probability and posterior probability in a procedure termed Bayesian

updating. During the iterative procedure, a posterior probability is computed after

some evidence is observed and this posterior probability is considered as an updated

prior to repeat the new updating cycle.

To better understand the Bayes’ rule and the Bayesian updating procedure, we

provide a simple example here. Assume a two-choice question related to a particular

concept of interest. Defining "C" as the concept being known, and assuming that

the belief prior to any testing is a 25 % probability of the student knowing the

concept, we can say that P (C) = 0.25. The conditional probability of the student

selecting choice "A" or "B" depending on knowing or not knowing the concept should

also be defined. This will depend on the design of the question. For example, a

distractor representing a common misconception might have a higher probability of

being selected by students that do not know the concept [98]. For this example let’s

assume that "A" is the right choice, and there is a 90% probability of being selected

by students that know the concept i.e. P (A|C) = 0.9, accordingly there is a chance

of 10% for selecting the wrong choice i.e. P (B|C) = 0.1 . If the students do not know

the concept, we can assume that each choice has the same chance of being selected

(P (A|C) = 0.5, and P (B|C) = 0.5). Based on the answer of the student, and the

prior information highlighted in the previous paraghraph, it is possible to infer if the

48

student knows concept "C". If the student selects "A" (right answer):

P (C|A) = P (A|C)P (C)P (A|C)P (C) + P (A|C)P (C)

= 0.9× 0.250.9× 0.25 + 0.5× 0.75 = 0.375

(4.2)

Here our belief about the probability of the student knowing the concept changed

from 25% (P (C) = 0.25) to 37.5 % (P (C|A) = 0.375). However, if the student

answers "B":P (C|B) = P (B|C)P (C)

P (B|C)P (C) + P (B|C)P (C)= 0.1× 0.25

0.1× 0.25 + 0.5× 0.75 = 0.0625(4.3)

In this case the probability of the student knowing the concept dropped to 6.25%.

Obtained posterior can be used as the prior for the next question. For example

consider another question with two choices and A as the correct choice. If the student

who answered the first question correctly selects the correct choice again, the belief

about whether he or she knows the concept, P (C|AA), is increased from 0.375 to

0.519.

P (C|AA) = 0.9× 0.3750.9× 0.375 + 0.5× 0.625 = 0.519 (4.4)

According to Jensen [61], a Bayesian Network (BN) provides a graphic and math-

ematical depiction of the joint probability over a group of random variables. Before

we introduce the power of a BN, we need to illustrate the concept of the Joint Proba-

bility Distribution (JPD). As discussed in [15], a JPD is defined as a function p over

a set of random and discrete variables V = v1, v2, ..., vn if they meet the following

properties:

• 0 <= p(v) <= 1.0, for each v ∈ dom(V )

• ∑v∈dom(V ) p(v) = 1

Where dom(V ) is the Cartesian product of each variable domain in U .

49

One of the main advantages of using BNs is to get a compact form of the joint

distribution on V . Obviously, the operation of obtaining the joint distribution directly

on V takes O(2n) for n variables. However, the utilization of BNs can speed up the

acquisition of JPD based upon the notion of conditional independence [15]. Specific,

random variables v1 and v3 are conditionally independent given v2 if

p(v1|v2v3) = p(v1|v2) (4.5)

The compact form of the joint distribution is then acquired with the chain rule.

According to Jensen [61], the JPD, p(U), is specified as follows based upon the chain

rule:

p(U) =n∏i=1

p(vi|pa(vi)) (4.6)

Where pa(vi) are the parents of vi in BN.

As Jensen [61] claimed, a BN consists of four basic elements. That is, it includes

a set of variables as well as edges between variables, a number of mutually exclusive

states for each variable, a Directed Acyclic Graph (DAG) constructed by the variables

and edges, and a conditional probability table attached to each variable. If no parents

exist for node A, then the conditional probability attached to the node becomes

the prior. In this study, all the basic elements of the network are pre-specified.

Particularly, we use domain expertise to construct the network structure, and estimate

the conditional tables as well as priors with an integration of expert knowledge and

a stepwise procedure introduced later in this section.

The Expectation Maximization (EM) algorithm is used to estimate the parameters

of BN with missing observed data [61]. In general, the EM algorithm is an iterative

approach to estimate the maximum likelihood of the parameters with an expectation

step and a maximization step. In the expectation step, we compute the expectation

of the data by using current parameters θ0ld, and later we obtain an updated set of

50

parameters θnew by maximizing the expectation with regard to each old parameter.

Then the algorithm runs in a repetitive way with these two steps until it converges

or reaches the pre-specified iteration steps.

A formal description of the algorithm for EM can be found in Jensen [61] and it is

also provided in Algorithm 3. In the algorithm, p(vi|pa(vi)) represents the conditional

probability for variable vi in its kth state, given the jth configuration of the parents

of vi, and sp(vi) represents the state space of variable vi.

Algorithm 3 EM algorithm for Bayesian Network [61]Choose initial parameters θold.Define stopping criteria ε > 0.Set t := 0.while | log2 p(D|θt)− p(D|θt−1)| > ε doE step: Compute the expectation of the likelihood function:

Eθt [N(vi, pa(vi))|D] =∑d∈D

p(vi, pa(vi)|d, θt)

M step: Estimate θijk with the expected counts using maximum likelihood:

θijk = Eθt [N(vi = k, pa(vi) = j)|D]∑|sp(vi)|h=1 Eθt [N(vi = h, pa(vi) = j)|D]

Set θt+1 := θ, t := t+ 1.end while

4.3 Methodology

4.3.1 Student Model Knowledge

Inspired from the force concept inventory [47, 46, 54], Statics concept inventory is

developed to assess conceptual understanding, and identify the students’ misconcep-

tions regarding basic concepts in statics. Four clusters of concepts are introduced by

Steif [133, 135] and used in this paper. These clusters of concepts are summarized

in Table 4.1. The CI contains 27 questions, and to answer them correctly, a student

51

Table 4.1: Concepts are used for the concept inventory

Concept Definition

C1Forces are always in equal and opposite

pairs acting between bodies, which are usually in contact

C2

Distinctions must be drawn between a force, a momentdue to a force about a point, and a couple. Twocombinations of forces and couples are statically

equivalent to one another if they have the same netforce and moment.

C3

The possibilities of forces between bodies that areconnected to, or contact, one another can be reducedby virtue of the bodies themselves, the geometry of the

connection and/or assumptions on friction.

C4

Equilibrium conditions always pertain to the externalforces acting directly on a chosen body, and a body isin equilibrium if the summation of forces on it is zero

and the summation of moments on it is zero.

should know one or some of the mentioned concepts. The model used in this paper

is illustrated in Figure 4.1. The arrows show the logical connection between concepts

and questions. For example we believe that to learn concept C2, students first need

to learn concept C3. Also, to answer some of questions, students need to learn more

than one concept.

Construction of the student model is based upon the integration of BN and under-

standing of the student curricular structure. In the knowledge tracking domain, the

knowledge of a student, which is also the understanding of particular concepts (e.g.

1st cluster of concepts, and 2nd cluster of concepts) is treated as hidden variables

with a known state and an unknown state. The hidden variables are investigated by

using observed variables which are the answers to the concept inventory questions

(e.g. q_q1) in Figure 4.2.

An instance of our student model is provided in Figure 4.2. According to the

figure, each node represents either a concept or a question, and each edge of the graph

represents a connection between concepts and questions or a connection between

concepts. With the post-test data, we insert the answers to questions as evidence

52

C1C2

C3 C4 Q 1

Q 4

Q 7Q10

Q13

Q16

Q19

Q22 Q25

Q 2

Q 5

Q 8

Q11

Q14

Q17

Q20

Q23

Q26

Q 3

Q 6

Q 9

Q12

Q15

Q18

Q21

Q24

Q27

Figure 4.1: Relationships of concepts and questions

into the model and estimate the posterior of each concept for each student. The blue

color of a concept node indicates the probability of knowing the concept given the

answers to the 27 questions.

Knowledge tracking based upon BN are limited by three factors, including the

selection of nodes, structure building of the network, and initialization of the priors

and conditionals [87]. In this work, we integrate concept inventory to select the

concepts for the model. The structure of the network is specified in terms of the

expert knowledge and concept inventory. Due to its factorization of complex joint

distribution into marginal distributions, we initialize the priors and conditionals over

all nodes and edges using our expertise. Somehow the way of initializing the model

neglects the data characteristics, which fails to differentiate the students’ performance

53

known 16%unknown84%

c_c1known 40%unknown60%

c_c3

known 15%unknown85%

c_c2known 16%unknown84%

c_c4

a 0%b 0%c 0%d 100%e 0%

q_q1

a 0%b 0%c 0%d 100%e 0%

q_q2

a 0%b 100%c 0%d 0%e 0%

q_q3

a 100%b 0%c 0%d 0%e 0%

q_q4

a 100%b 0%c 0%d 0%e 0%

q_q5

a 0%b 0%c 0%d 0%e 100%

q_q6

a 0%b 0%c 100%d 0%e 0%

q_q7a100%b 0%c 0%d 0%e 0%

q_q8a 0%b 100%c 0%d 0%e 0%

q_q9a 0%b 100%c 0%d 0%e 0%

q_q10a 0%b 0%c 0%d 100%e 0%

q_q11a 0%b 100%c 0%d 0%e 0%

q_q12

a 0%b 0%c 0%d 100%e 0%

q_q13

a 0%b 0%c100%d 0%e 0%

q_q14

a 0%b 0%c 0%d 0%e 100%

q_q15

a100%b 0%c 0%d 0%e 0%

q_q16

a 0%b 0%c100%d 0%e 0%

q_q17

a 0%b 100%c 0%d 0%e 0%

q_q18

a100%b 0%c 0%d 0%e 0%

q_q19

a 0%b 0%c 100%d 0%e 0%

q_q20a 0%b 0%c 0%d 0%e 100%

q_q21a 0%b 0%c 0%d 100%e 0%

q_q22a 0%b 0%c 0%d 0%e 100%

q_q23a 0%b 0%c 0%d 0%e 100%

q_q24

a 0%b 100%c 0%d 0%e 0%

q_q25

a100%b 0%c 0%d 0%e 0%

q_q26a 0%b 0%c 100%d 0%e 0%

q_q27

Figure 4.2: Bayesian Network Model of Student Knowledge for Statics Concept In-ventory

54

in each semester. Thus we adopt a novel optimization procedure which is a data-

driven approach to obtain the optimized parameters for the prior of the first cluster

of concepts and all the conditionals between the concepts and questions.

Specifically, the initial prior probability for knowing the first cluster of concepts

(p(c_c1) is defined as 0.7, since it is a pre-requisite concept that is learned in the

Linear Algebra class before the Statics class. In this work, we also explore the rest of

the concepts, such as the second cluster of concepts and the third cluster of concepts,

which are developed based upon the first cluster of concepts. The transition prob-

ability between pairs of concepts is determined by the capability of the instructors

of communicating the knowledge to the students. In this study, we assume the in-

structor is capable of communicating those concepts effectively to the students, thus

we assign a high probability (e.g. 0.95) to p(c_c2 = known|c_c1 = known). On the

other hand, if a student has no knowledge of c_c1 (p(c_c1 = unknown)), then the

probability of not knowing c_c2 (p(c_c2 = unknown|c_c1 = known)) is set as a high

value (e.g. 0.95) as well. In this case, we assume that the lack of understanding of a

basic concept will impede the understanding of a more advanced concept. Similarly,

we define all other conditional probabilities between the concepts and questions.

According to our model, it should be kept in mind that the probability of knowing

a more advanced concept (e.g. 3rd cluster of concepts) is lower in the absence of any

testing. Specifically, when the prior of knowing the 1st cluster of concepts is set as 0.7,

the prior of knowing the rest of the clusters of concepts are 0.6485, 0.68, and 0.62015,

respectively. The objectives of this research can be discussed in aspects. First, we

can estimate the level of understanding of each individual student for the instructed

concepts with a Bayesian approach by analyzing the evidences from the concept in-

ventory tests. Secondly, we can reveal the misconceptions of the questions in the tests

when the concept is unknown to the students using our proposed parameter learning

algorithm. The initial conditional probability between the concepts and questions is

55

equally distributed over the answers (e.g. p(q_q1 = X|unknown) = 0.2 where X is

either A, B, C, D, and E) when the concept is unknown to the students. However,

this setting ignores the misconceptions of the students at the moment of assessment

because some answers are more distracting than others. Based upon the proposed

parameter learning procedure, we can learn the optimized conditional probabilities

using the data and recognize the misconceptions of the students to develop remedial

interventions. Thirdly, we can investigate the design of the questions from the per-

spective of instructors to explore whether a question is a well designed question or a

badly designed (or too difficult) question. To address this goal, we use the likelihood

plots and identify the badly-designed questions if we have a p-value larger than a

pre-specified threshold such as 0.05. More detailed discussion about the second goal

and the third goal are provided in the following sections.

4.3.2 Misconception Identification

Conditional probabilities can be used to identify the distractors. When a student

does not know the concept, there is a higher chance to select a choice other than the

correct one.

To get more informative priors and conditionals, we learn the parameters from

experimental data with the EM algorithm provided in the JSMILE APIs. However,

current JSMILE APIs only infers entire conditionals which include the probabilities

of answering the quizzes when the concept is known. This may result in logical

inconsistencies for extreme datasets. A typical case is that all students answer the

questions incorrectly. To cope with the issue and obtain proper parameters, we

establish a novel optimization procedure shown in Algorithm 4 to estimate the prior

of the first cluster of concepts and conditionals between concepts and questions.

56

Algorithm 4 Optimization procedure for parameter learningSet an initial model evidence log2p(D|θ0).Define stopping criteria ε > 0.Set t := 0.while | log2 p(D|θtC)− log2 p(D|θt−1

C )| > ε doEstimate θQ with EM.Update p(Q = X|C = known) for question Q and concept C.Estimate θC with EM.Set θt+1

C = θC , t = t+ 1.end whileUpdate p(Q = X|C = known) for question Q and concept C.

4.3.3 Ill-designed Question Identification

To assess the validity of the optimized Bayesian network model, M , we propose

a predictive validation metric evaluated on a hold-out dataset. Using the training

dataset D6=i,j consisting of student’s answers to 26 out of the 27 questions in the

concept inventory, we can infer the posterior probabilities associated with knowing the

four concepts for each student, Sj. Student’s answer, xij, to the validation question,

Qi, is used to evaluate the following likelihood function obtained from the posterior

predictive distribution corresponding to the ith question.

p(Qi = xij|Sj, D6=i,j,M) (4.7)

This likelihood function can be used to define a measure of how well the model

fits all N students’ answers for the ith question. This goodness-of-fit measure can be

expressed using the following expected log-likelihood under the assumption that all

students are equally likely.

E[log p(Qi = xi|S,D 6=i,M)] =N∑j=1

p(Sj) log p(Qi = xij|Sj, D6=i,j,M)

= 1N

N∑j=1

log p(Qi = xij|Sj, D6=i,j,M) (4.8)

By itself, this goodness-of-fit measure is non-informative. A reference measure is

required to assess whether the model can explain the validation data. Note that by

57

generating a random answer rij from a discrete uniform distribution corresponding

to Qi, and for each student Sj, we can generate a similar measure as in Eq. 4.8 for

how well the model fits random answers.

E[log p(Qi = ri|S,D 6=i,M)] = 1N

N∑j=1

log p(Qi = rij|Sj, D6=i,j,M) (4.9)

It is expected that a model that has predictive capability will better fit the ac-

tual answers than the synthetically generated random answers. Thus, the measure in

Eq. 4.8 is expected to be larger than the one in Eq. 4.9. However, the expected log-

likelihood in Eq. 4.9 is a random variable with a distribution induced by the discrete

uniform distribution over the answers of Qi. As a result, the proposed predictive vali-

dation metric takes the form of a standard hypothesis test - where the null hypothesis

is that the model has no predictive capability and the alternative is that the model

has predictive capability. The null hypothesis is rejected at the significance level α.

P

(E[log p(Qi = ri|S,D6=i,M)] ≥ E[log p(Qi = xi|S,D 6=i,M)]

)< α (4.10)

With the proposed index, we can evaluate whether a question is designed properly

or not. Overall, there are two types of problems of designing the questions. Under

the first type of design problem, the conditional probability of the choice when the

student does not know the concept is the same as the correct choice. It means even if

a student does not know the concept, he or she is still very likely to select the correct

choice. In these cases, in the case of correct answer, it is not very obvious whether the

student knows the concept or not. Under the second type of design problem, very few

students can answer correctly either because the problem is too hard or it is designed

improperly. Both types of design problems can be detected by a high p-value larger

than 0.05.

58

4.4 Numerical Results

4.4.1 Data

Our datasets include three years of post-test data of student performance at the

University of South Carolina for the Statics class from 2006 to 2008. The tests are

either paper and pencil or web-based. Each test contains the same 27 questions

related to the CI.

There are 27, 26, and 34 students for 2006, 2007, and 2008, respectively. How-

ever, there are five students enrolled in 2006 with missing information and thus we

removed them for our analysis. Finally, we have 82 students in total, and we used 48

students from 2006 and 2007 for training the model, and the remaining 34 students

for estimation of student knowledge, misconception identification, and ill-designed

questions detection.

The results are summarized in Table 4.2. Correct answers of each question are

indicated by circle.

Table 4.3 illustrates the conditional probability obtained for a question that is

related to two concepts, C1 and C4. This table considered the data from classes of

2006 and 2007.

4.4.2 Misconceptions

Conditional probabilities can be used to identify the distractors. A student who does

not know the concept will select an incorrect answer, and this can show the miscon-

ceptions of the students. Consider the following examples. Question 19 shown in

figure 4.3 has conditional as shown in Table 4.4. The correct answer is choice a and

the distractor is choice d. In another example, question 22 shown in figure 4.4, the

correct answer is choice b, but the distractor is choice e. The reason is that some

students who do not know the concept C3 simply multiply the weight of 20 N by

59

Table 4.2: Summary of students answers for each question in each class (Numbers ofstudents who selected the correct choice are indicated by circle)

Class of 2006 & 2007 Class of 2008Question a b c d e a b c d e

Q1 11 6 22 8 1 9 3 17 5 0Q2 12 12 5 10 9 12 8 3 4 7Q3 10 13 14 5 6 10 16 5 1 2Q4 7 15 9 7 10 6 9 9 1 9Q5 18 6 8 12 4 11 3 3 11 6Q6 12 12 4 8 12 9 7 4 5 9Q7 8 13 3 17 7 4 15 3 7 5Q8 18 10 12 4 4 14 5 5 5 5

Q9 9 11 4 13 11 6 12 4 4 8

Q10 7 2 27 10 1 6 4 11 8 5

Q11 3 3 38 3 1 5 1 22 5 1

Q12 2 7 25 12 2 7 5 12 7 3

Q13 2 18 4 19 5 2 10 2 15 5

Q14 4 22 5 9 8 3 10 9 4 8

Q15 1 12 1 8 26 0 10 0 10 14Q16 11 12 20 4 1 3 11 14 6 0Q17 10 8 22 4 3 15 2 13 4 0Q18 16 10 11 6 5 12 8 11 2 1

Q19 21 3 3 15 6 14 1 6 10 3

Q20 31 3 3 1 9 19 1 2 7 5

Q21 23 7 4 11 2 17 3 4 6 4

Q22 4 10 6 8 19 3 7 9 6 9

Q23 3 6 10 10 17 3 4 4 4 19Q24 5 7 9 23 2 5 4 5 19 1Q25 5 8 4 19 10 4 3 2 15 10Q26 17 6 11 10 2 13 1 11 8 1Q27 13 2 19 7 4 7 1 11 10 5

friction coefficient which is 0.5. Both questions have good distractors, because the

instructor can understand what the misconception is. In the first one, the miscon-

ception is behavior of connection. A pinned connection cannot resist in front of the

60

Table 4.3: Conditional probabilities for a question related to two concepts

C1 known unknownC4 known unknown known unknowna 0.05 0.6638 0.9982 0.0478b 0.05 1.9e-6 0.0004 0.3965c 0.05 1.9e-6 0.0004 0.1585d 0.05 2.1e-6 0.0004 0.3176e 0.8 0.3361 0.0004 0.0794

Question 19: The force F is known and the other loads on the plate areunknowns to be determined. Consider drawing a free body diagram ofthe plate, including the unknown reaction of the pin.

Figure 4.3: Question 19 related to C3

Table 4.4: Conditional probabilities for question 19

C3 known unknowna 0.8 0.2578b 0.05 0.0967c 0.05 0.0968d 0.05 0.4515e 0.05 0.0972

rotation. In the second question, the distractor represents one of the most common

misconceptions in friction questions [137]. The tangential force is obtained equal to

the normal force times the friction coefficient, while its magnitude is greater than the

force needed to maintain the equilibrium.

4.4.3 Ill-designed questions

The p-values for each question are calculated using 50, 000 sets of random samples

corresponding to students’ answers to the ith question. Table 4.6 summarizes the

61

Question 22: Three blocks are stacked on top of one another on atable. Then, the horizontal forces shown are applied. The frictioncoefficient is 0.5 between all contacting surfaces. (This is both thestatic and kinetic coefficient of friction.)Which of the following represents the horizontal component of theforce acting on the lower face of the top (20 N) block?

Figure 4.4: Question 22 related to C3



Table 4.6: P-value for each question

Question p-value Question p-value Question p-valueQ1 0.0247 Q10 0.0270 Q19 0.0001Q2 0.5362 Q11 0.0000 Q20 0.0001Q3 0.0324 Q12 0.0097 Q21 0.0002Q4 0.2665 Q13 0.0001 Q22 0.3012Q5 0.0336 Q14 0.0716 Q23 0.0466Q6 0.6519 Q15 0.0000 Q24 0.0004Q7 0.0111 Q16 0.0005 Q25 0.0067Q8 0.2156 Q17 0.0106 Q26 0.0158Q9 0.8412 Q18 0.0188 Q27 0.0053

p-values for all the questions in the concept inventory. Note that for 7 questions the

model does not exhibit predictive capability at 0.05 significance level.

Here are some examples of the first type of design problems. For question 11

which is shown in figure 4.5, there is 68% chance of selecting the correct choice when

62

Question 11: The platform is kept in the position shown by a roller,link and hydraulic cylinder. The coefficient of friction between theroller and the dump is 0.6. What is the direction of the roller on theplatform at the point of interest?

Figure 4.5: Question 11 related to concept C3





a student does not know the related concept. Another example could be question

20, shown in figure 4.6. As it shown in Table 4.8, the chance of selecting the correct

answer, when a student does not know the concept is 61%. Another good example

of these kinds of questions is question 10.

Here is an example of the second type of design problems. Question 9 is a good

example of those hard questions, only 4 students among 34 could answers this question

correctly and its p-value is around 0.84. It shows that the model is not able to make

63

Question 20: Part 1 and part 2 are welded to each other. Forces Fand G are known, and the other loads are unknowns to be determined.Consider drawing a free body diagram of part 1, including the unknownreaction of part 2 on part 1. Unknowns A, B, and C could be positive,negative or zero.Which of the following is the correct free body diagram for the forcesand/or moments exerted by part 2 on part 1 at welded section?


prediction for students in this question. They are several possible reasons for that.

First, looking at the optimized conditional probabilities, it is one of those questions

where the chance of selecting the correct answer when a student does not know

the concept is more than other choices, p=0.31. Furthermore there are only three

questions related to concept C2, which may not be enough to predict the performance

of students in this question.

A good design problem can be illustrated by Question 7. Question 7, related

to concept C2, is one of the questions with a low p-value. Figure 4.10 shows the

related histogram and p-value for the class of 2008, considering classes of 2007 and

2006 for optimization. The value of p-value shows indicates strong evidence against

the null hypothesis that the model cannot predict the performance of students. In

other words, assuming the model inference about student knowledge is acceptable,

the question and its related concept is designed in a way that can be used to infer

about student knowledge in concept C2.

64

Question 9: The two forces with magnitudes 7 N and 10 N act in thedirections shown through points A and B, which are denote with dots.These forces keep the member in equilibrium while it is subjected toother forces acting in the plane (shown at the right).Assuming the other forces stay the same, what load(s) could replacethe 7 N and 10 N forces and maintain equilibrium ?


4.5 Educational Component

Student knowledge estimation plays a core role in the process of student learning.

Proper and prompt estimation can help provide important information to both in-

structors and students for remedial intervention. In this study, we explore using

concept inventory for student knowledge estimation. Meanwhile, our system can help

each student identify the individualized misconception of each question. By proposing

a question design metric, we can further measure the effectiveness of each question.

We implemented the backend system with Java and JSMILE APIs. With the sys-

tem, we can construct student models with prior knowledge of the structure of a given

concept inventory. By feeding student exam data into the system, we can obtain per-

sonalized student models and provide individualized suggestion for intervention. The

implemented system is publically available at https://bitbucket.org/uqlab/scilaf.

65

Figure 4.8: Histogram of blind guessing in addition to the actual performance ofstudents score.

4.6 Conclusions

In this work, we firstly developed a data-driven approach to assess the latent student

knowledge by constructing Bayesian student models. Then we put forward a novel

algorithm to identify the misconceptions for each quiz question, and thus we could

provide individualized and remedial interventions for each student. In the end, we

proposed a novel index to evaluate the student models as well as measuring the design

of each question. According to the results, we identified common distractors found

for the concept inventory data. As the model is capable of discovering individualized

misconception, it can provide timely intervention after each test. Furthermore, the

measurement index showed that 20 of the 27 questions exhibit predictive capability

with the student model, while several inproper-designed questions were discussed.

66

Question 7: A 200 N-mm couple acting counter-clockwise keeps themember in equilibrium while it is subjected to other forces acting inthe plane (shown schematically at the left). The four dots denoteequally spaced points along the member. Assuming the other forcesstay the same, what load(s) could replace the 200 N-mm couple andmaintain equilibrium?


Figure 4.10: Histogram of blind guessing in addition to the actual performance ofstudents score.

67

Bibliography

[1] Hamed Abdelhaq, Christian Sengstock, and Michael Gertz. “EvenTweet: On-line Localized Event Detection from Twitter”. In: Proc. VLDB Endow. 6.12(Aug. 2013), pp. 1326–1329. issn: 2150-8097. doi: 10.14778/2536274.2536307.url: http://dx.doi.org/10.14778/2536274.2536307.

[2] James Allan. “Introduction to topic detection and tracking”. In: Topic detec-tion and tracking. Kluwer Academic Publishers, 2002, pp. 1–16. isbn: 0-7923-7664-1.

[3] Vicki L Almstrum et al. “Concept inventories in computer science for thetopic discrete mathematics”. In: ACM SIGCSE Bulletin. Vol. 38(4). ACM.2006, pp. 132–145.

[4] Dianne L Anderson, Kathleen M Fisher, and Gregory J Norman. “Develop-ment and evaluation of the conceptual inventory of natural selection”. In:Journal of research in science teaching 39.10 (2002), pp. 952–978.

[5] Rebecca A Atadero et al. “Project-Based Learning in Statics: Curriculum,Student Outcomes, and On-going Questions”. In: age 24 (2014), p. 1.

[6] Farzindar Atefeh and Wael Khreich. “A Survey of Techniques for Event Detec-tion in Twitter”. In: Comput. Intell. 31.1 (2015), pp. 132–164. issn: 0824-7935.doi: 10.1111/coin.12017.

[7] Janelle Margaret Bailey. Development of a Concept Inventory to Assess Stu-dents’ Understanding and Reasoning Difficulties About the Properties and For-mation of Stars. 2006. url: http://hdl.handle.net/10150/193643.

[8] Nilesh Bansal and Nick Koudas. “BlogScope: Spatio-temporal Analysis of theBlogosphere”. In: Proceedings of the 16th International Conference on WorldWide Web. WWW ’07. New York, NY, USA: ACM, 2007, pp. 1269–1270. isbn:978-1-59593-654-7. doi: 10.1145/1242572.1242802. url: http://doi.acm.org/10.1145/1242572.1242802.

68

[9] H. Becker, M. Naaman, and L. Gravano. “Beyond trending topics: Real-worldevent identification on Twitter”. In: Fifth International AAAI Conference onWeblogs and Social Media. 2011.

[10] David M. Blei. “Probabilistic Topic Models”. In: Commun. ACM 55.4 (Apr.2012), pp. 77–84. issn: 0001-0782. doi: 10.1145/2133806.2133826. url:http://doi.acm.org/10.1145/2133806.2133826.

[11] Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: Pro-ceedings of the 32Nd International Conference on International Conferenceon Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015,pp. 1613–1622. url: http://dl.acm.org/citation.cfm?id=3045118.3045290.

[12] Stacey Lowery Bretz and Kimberly J Linenberger. “Development of the enzyme–substrate interactions concept inventory”. In: Biochemistry and Molecular Bi-ology Education 40.4 (2012), pp. 229–233.

[13] Thang D. Bui et al. “Deep Gaussian Processes for Regression using Approx-imate Expectation Propagation”. In: ICML. Vol. 48. JMLR Workshop andConference Proceedings. JMLR.org, 2016, pp. 1472–1481.

[14] Gregoire Burel et al. “On Semantics and Deep Learning for Event Detectionin Crisis Situations”. In: ESWC 2017. Portoroz, Slovenia, 2017.

[15] C. J. Butz, S. Hua, and R. B. Maguire. “A Web-based Bayesian IntelligentTutoring System for Computer Programming”. In: Web Intelli. and AgentSys. 4.1 (Jan. 2006), pp. 77–97. issn: 1570-1263. url: http://dl.acm.org/citation.cfm?id=1239784.1239789.

[16] SM Case and DB Swanson. Item writing manual: Constructing written testquestions for the basic and clinical sciences. 2002.

[17] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. “Information Cred-ibility on Twitter”. In: Proceedings of the 20th International Conference onWorld Wide Web. WWW ’11. Hyderabad, India: ACM, 2011, pp. 675–684.isbn: 978-1-4503-0632-4. doi: 10.1145/1963405.1963500. url: http://doi.acm.org/10.1145/1963405.1963500.

[18] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. “Emerging Topic De-tection on Twitter Based on Temporal and Social Terms Evaluation”. In: Pro-ceedings of the Tenth International Workshop on Multimedia Data Mining.MDMKDD ’10. Washington, D.C.: ACM, 2010, 4:1–4:10. isbn: 978-1-4503-0220-3. doi: 10.1145/1814245.1814249. url: http://doi.acm.org/10.1145/1814245.1814249.

69

[19] Junghoon Chae et al. “Spatiotemporal social media analytics for abnormalevent detection and examination using seasonal-trend decomposition”. In:2012 IEEE Conference on Visual Analytics Science and Technology, VAST2012, Seattle, WA, USA, October 14-19, 2012. 2012, pp. 143–152. doi: 10.1109/VAST.2012.6400557. url: https://doi.org/10.1109/VAST.2012.6400557.

[20] Deepayan Chakrab. and Kunal Punera. “Event Summarization Using Tweets”.In: (2011).

[21] A. L. Chandrasegaran, David F. Treagust, and Mauro Mocerino. “The de-velopment of a two-tier multiple-choice diagnostic instrument for evaluatingsecondary school students’ ability to describe and explain chemical reactionsusing multiple levels of representation”. In: Chem. Educ. Res. Pract. 8 (32007), pp. 293–307.

[22] Ling Chen and Abhishek Roy. “Event Detection from Flickr Data ThroughWavelet-based Spatial Analysis”. In: Proceedings of the 18th ACM Conferenceon Information and Knowledge Management. CIKM ’09. Hong Kong, China:ACM, 2009, pp. 523–532. isbn: 978-1-60558-512-3. doi: 10.1145/1645953.1646021. url: http://doi.acm.org/10.1145/1645953.1646021.

[23] Flavio Chierichetti et al. “Event Detection via Communication Pattern Anal-ysis”. In: Proceedings of the Eighth International Conference on Weblogs andSocial Media, ICWSM 2014, Ann Arbor, Michigan, USA, June 1-4, 2014.2014. url: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8088.

[24] Konstantina Chrysafiadi and Maria Virvou. “Review: Student Modeling Ap-proaches: A Literature Review for the Last Decade”. In: Expert Syst. Appl.40.11 (Sept. 2013), pp. 4715–4729. issn: 0957-4174. doi: 10.1016/j.eswa.2013.02.007. url: http://dx.doi.org/10.1016/j.eswa.2013.02.007.

[25] Clyde H Coombs, John Edgar Milholland, and Frank Burton Womer. “Theassessment of partial knowledge”. In: Educational and Psychological Measure-ment 16.1 (1956), pp. 13–37.

[26] J. E. Corter et al. “Bugs and biases: Diagnosing misconceptions in the under-standing of diagrams”. In: Proceedings of the 31st Annual Conference of theCognitive Science Society. Ed. by N. A. Taatgen and H. van Rijn. Austin, TX:Cognitive Science Society, 2009, pp. 756–761.

[27] Thomas Deane et al. “Development of the biological experimental designconcept inventory (BEDCI)”. In: CBE-Life Sciences Education 13.3 (2014),pp. 540–551.

70

[28] John S. Denker and Yann LeCun. “Transforming Neural-Net Output Levels toProbability Distributions”. In: NIPS. Morgan Kaufmann, 1990, pp. 853–859.

[29] Marilu Dick-Perez et al. “A quantum chemistry concept inventory for physicalchemistry classes”. In: Journal of Chemical Education 93.4 (2016), pp. 605–612.

[30] Jerome Epstein. “Development and validation of the Calculus Concept Inven-tory”. In: Proceedings of the ninth international conference on mathematicseducation in a global community. Vol. 9. Charlotte, NC. 2007, pp. 165–170.

[31] Geir Evensen. “The Ensemble Kalman Filter: theoretical formulation and prac-tical implementation”. In: 53 (2003), pp. 343–367. doi: 10.1007/s10236-003-0036-9.

[32] Tristan Fletcher. “The Kalman Filter Explained”. 2010.

[33] Yarin Gal and Zoubin Ghahramani. “A Theoretically Grounded Application ofDropout in Recurrent Neural Networks”. In: Advances in Neural InformationProcessing Systems 29: Annual Conference on Neural Information ProcessingSystems 2016, December 5-10, 2016, Barcelona, Spain. 2016, pp. 1019–1027.

[34] Yarin Gal and Zoubin Ghahramani. “Dropout As a Bayesian Approximation:Representing Model Uncertainty in Deep Learning”. In: Proceedings of the 33rdInternational Conference on International Conference on Machine Learning -Volume 48. ICML’16. New York, NY, USA: JMLR.org, 2016, pp. 1050–1059.url: http://dl.acm.org/citation.cfm?id=3045390.3045502.

[35] Kathy Garvin-Doxas and Michael W Klymkowsky. “Understanding random-ness and its impact on student learning: lessons learned from building the Bi-ology Concept Inventory (BCI)”. In: CBE-Life Sciences Education 7.2 (2008),pp. 227–233.

[36] Arthur Gelb. “Applied Optimal Estimation”. In: The MIT Press, 1974. isbn:0262570483, 9780262570480.

[37] Felix A. Gers, JÃĳrgen Schmidhuber, and Fred Cummins. “Learning to For-get: Continual Prediction with LSTM”. In: Neural Computation 12 (1999),pp. 2451–2471.

[38] George Goguadze et al. “Evaluating a Bayesian Student Model of DecimalMisconceptions”. In: EDM. 2011.

[39] Alex Graves. “Generating Sequences With Recurrent Neural Networks.” In:CoRR (2014). url: https://arxiv.org/pdf/1308.0850.pdf.

71

[40] Alex Graves. “Practical Variational Inference for Neural Networks”. In: Pro-ceedings of the 24th International Conference on Neural Information Process-ing Systems. NIPS’11. Granada, Spain: Curran Associates Inc., 2011, pp. 2348–2356. isbn: 978-1-61839-599-3.

[41] Gary L Gray et al. “The dynamics concept inventory assessment test: Aprogress report and some results”. In: American Society for Engineering Edu-cation Annual Conference & Exposition. 2005.

[42] James H Hanson and Julia MWilliams. “Using writing assignments to improveself-assessment and communication skills in an engineering statics course”. In:Journal of engineering education 97.4 (2008), p. 515.

[43] Habibah Norehan Haron et al. “Self-regulated learning strategies between theperforming and non-performing students in statics”. In: Interactive Collabora-tive Learning (ICL), 2014 International Conference on. IEEE. 2014, pp. 802–805.

[44] Eric L. Haseltine and James B. Rawlings. “Critical Evaluation of ExtendedKalman Filtering and Moving-Horizon Estimation”. In: Industrial & Engi-neering Chemistry Research 44.8 (June 2004), pp. 2451–2460. doi: 10.1021/ie034308l. url: http://dx.doi.org/10.1021/ie034308l.

[45] José Miguel Hernández-Lobato and Ryan P. Adams. “Probabilistic Backprop-agation for Scalable Learning of Bayesian Neural Networks”. In: Proceedings ofthe 32Nd International Conference on International Conference on MachineLearning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015, pp. 1861–1869. url: http://dl.acm.org/citation.cfm?id=3045118.3045316.

[46] David Hestenes and Ibrahim Halloun. “Interpreting the force concept inven-tory”. In: The Physics Teacher 33.8 (1995), pp. 502–506.

[47] David Hestenes, Malcolm Wells, Gregg Swackhamer, et al. “Force conceptinventory”. In: The physics teacher 30.3 (1992), pp. 141–158.

[48] Randall W. Hill, Jr., andW. Lewis Johnson. “Designing an Intelligent TutoringSystem for Database Modelling”. In: Proceedings of the world conference ofartificial intelligence in education. 1993, pp. 273–281.

[49] Geoffrey Hinton et al. “Improving neural networks by preventing co-adaptationof feature detectors”. In: CoRR abs/1207.0580 (2012). url: http://arxiv.org/abs/1207.0580.

[50] Geoffrey E. Hinton and Drew van Camp. “Keeping the Neural Networks Simpleby Minimizing the Description Length of the Weights”. In: Proceedings of the

72

Sixth Annual Conference on Computational Learning Theory. COLT ’93. SantaCruz, California, USA: ACM, 1993, pp. 5–13. isbn: 0-89791-611-5. doi: 10.1145/168304.168306. url: http://doi.acm.org/10.1145/168304.168306.

[51] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In:Neural Comput. 9.9 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735. url: http://dx.doi.org/10.1162/neco.1997.9.8.1735.

[52] Anneke Hommels, Akira Murakami, and Nishimura Shin-Ichi. “Comparison ofthe Ensemble Kalman filter with the Unscented Kalman filter: application tothe construction of a road embankment”. In: Proceedings of the 19th EuropeanYoung Geotechnical Engineer Conference. Gyor, Hungary, 2009.

[53] Yuan Huang et al. “Understanding US regional linguistic variation with Twit-ter data analysis”. In: Computers, Environment and Urban Systems (2015).issn: 0198-9715. url: http://www.sciencedirect.com/science/article/pii/S0198971515300399.

[54] Douglas Huffman and Patricia Heller. “What Does the Force Concept Inven-tory Actually Measure?.” In: Physics Teacher 33.3 (1995), pp. 138–43.

[55] Jonathan Hurlock and Max L. Wilson. “Searching Twitter: Separating theTweet from the Chaff.” In: ICWSM. Ed. by Lada A. Adamic, Ricardo A.Baeza-Yates, and Scott Counts. The AAAI Press, 2011.

[56] Tommi S. Jaakkola and Michael I. Jordan. “Bayesian parameter estimation viavariational methods”. In: statistics and computing 10 (Jan. 2000), pp. 25–37.

[57] Anthony Jacobi et al. “A concept inventory for heat transfer”. In: Frontiersin Education, 2003. FIE 2003 33rd Annual. Vol. 1. IEEE. 2003, T3D–12.

[58] Bernard J. Jansen et al. “Twitter Power: Tweets As Electronic Word of Mouth”.In: J. Am. Soc. Inf. Sci. Technol. 60.11 (Nov. 2009), pp. 2169–2188. issn: 1532-2882. doi: 10.1002/asi.v60:11. url: http://dx.doi.org/10.1002/asi.v60:11.

[59] Akshay Java et al. “Why We Twitter: Understanding Microblogging Usage andCommunities”. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007Workshop on Web Mining and Social Network Analysis. WebKDD/SNA-KDD’07. San Jose, California: ACM, 2007, pp. 56–65. isbn: 978-1-59593-848-0. doi:10.1145/1348549.1348556. url: http://doi.acm.org/10.1145/1348549.1348556.

73

[60] Andrew H. Jazwinski. “Stochastic processes and filtering theory”. In: Mathe-matics in science and engineering 64. New York, NY [u.a.]: Acad. Press, 1970.isbn: 0123815509.

[61] Finn V. Jensen and Thomas D. Nielsen. Bayesian Networks and DecisionGraphs. 2nd. Springer Publishing Company, Incorporated, 2007.

[62] Simon J. Julier and Jeffrey K. Uhlmann. “Unscented Filtering and NonlinearEstimation”. In: PROCEEDINGS OF THE IEEE. 2004, pp. 401–422.

[63] Pamela Kalas et al. “Development of a meiosis concept inventory”. In: CBE-Life Sciences Education 12.4 (2013), pp. 655–664.

[64] Andrej Karpathy and Li Fei-Fei. “Deep Visual-Semantic Alignments for Gen-erating Image Descriptions”. In: IEEE Trans. Pattern Anal. Mach. Intell.39.4 (Apr. 2017), pp. 664–676. issn: 0162-8828. doi: 10.1109/TPAMI.2016.2598339. url: https://doi.org/10.1109/TPAMI.2016.2598339.

[65] Matthias Katzfuss, Jonathan R. Stroud, and Christopher K. Wikle. “Under-standing the Ensemble Kalman Filter”. In: The American Statistician 70.4(2016), pp. 350–357. doi: 10.1080/00031305.2016.1141709.

[66] Yoon Kim et al. “Character-aware Neural Language Models”. In: Proceed-ings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI’16.Phoenix, Arizona: AAAI Press, 2016, pp. 2741–2749. url: http://dl.acm.org/citation.cfm?id=3016100.3016285.

[67] Duane Knudson et al. “Development and evaluation of a biomechanics conceptinventory”. In: Sports Biomechanics 2.2 (2003), pp. 267–277.

[68] Fantian Kong et al. “Mobile Robot Localization Based on Extended KalmanFilter”. In: 2006 6th World Congress on Intelligent Control and Automation.Vol. 2. 2006, pp. 9242–9246. doi: 10.1109/WCICA.2006.1713789.

[69] Stephen Krause et al. “Development, testing, and application of a chemistryconcept inventory”. In: Frontiers in Education, 2004. FIE 2004. 34th Annual.IEEE. 2004, T1G–1.

[70] John Krumm and Eric Horvitz. “Eyewitness: Identifying Local Events viaSpace-time Signals in Twitter Feeds”. In: Proceedings of the 23rd SIGSPA-TIAL International Conference on Advances in Geographic Information Sys-tems. GIS ’15. Bellevue, Washington: ACM, 2015, 20:1–20:10. isbn: 978-1-4503-3967-4. doi: 10.1145/2820783.2820801. url: http://doi.acm.org/10.1145/2820783.2820801.

74

[71] Haewoon Kwak et al. “What is Twitter, a Social Network or a News Media?”In: Proceedings of the 19th International Conference on World Wide Web.WWW ’10. Raleigh, North Carolina, USA: ACM, 2010, pp. 591–600. isbn:978-1-60558-799-8. doi: 10.1145/1772690.1772751. url: http://doi.acm.org/10.1145/1772690.1772751.

[72] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. “Simpleand Scalable Predictive Uncertainty Estimation using Deep Ensembles”. In:Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach,CA, USA. 2017, pp. 6405–6416. url: http://papers.nips.cc/paper/7219-simple - and - scalable - predictive - uncertainty - estimation - using -deep-ensembles.

[73] Norm G. Lederman et al. “Views of nature of science questionnaire: Towardvalid and meaningful assessment of learners’ conceptions of nature of science”.In: Journal of Research in Science Teaching 39.6 (2002), pp. 497–521. issn:1098-2736. doi: 10.1002/tea.10034. url: http://dx.doi.org/10.1002/tea.10034.

[74] Kathy Lee et al. “Adverse Drug Event Detection in Tweets with Semi-SupervisedConvolutional Neural Networks”. In: Proceedings of the 26th InternationalConference on World Wide Web. WWW ’17. Perth, Australia: InternationalWorld Wide Web Conferences Steering Committee, 2017, pp. 705–714. isbn:978-1-4503-4913-0.

[75] Kyumin Lee, Brian David Eoff, and James Caverlee. “Seven Months with theDevils: A Long-Term Study of Content Polluters on Twitter.” In: ICWSM. Ed.by Lada A. Adamic, Ricardo A. Baeza-Yates, and Scott Counts. The AAAIPress, 2011. url: http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html#LeeEC11.

[76] Ryong Lee, Shoko Wakamiya, and Kazutoshi Sumiya. “Discovery of UnusualRegional Social Activities Using Geo-tagged Microblogs”. In: World Wide Web14.4 (July 2011), pp. 321–349. issn: 1386-145X.

[77] Richard B Lewis. “Creative Teaching and Learning in a Statics Class.” In:Engineering Education 81.1 (1991), pp. 15–18.

[78] Rui Li et al. “TEDAS: A Twitter-based Event Detection and Analysis System”.In: Proceedings of the 2012 IEEE 28th International Conference on Data En-gineering. ICDE ’12. Washington, DC, USA: IEEE Computer Society, 2012,pp. 1273–1276. isbn: 978-0-7695-4747-3. doi: 10.1109/ICDE.2012.125. url:http://dx.doi.org/10.1109/ICDE.2012.125.

75

[79] Julie C Libarkin and Steven W Anderson. “Development of the geoscienceconcept inventory”. In: Proceedings of the National STEM Assessment Con-ference, Washington DC. 2006, pp. 148–158.

[80] Xiao Lin and Gabriel Terejanu. “Fast Approximate Data Assimilation forHigh-Dimensional Problems”. In: 2017. url: https://arxiv.org/abs/1708.02340.

[81] Thomas A Litzinger et al. “A cognitive study of problem solving in statics”.In: Journal of Engineering Education 99.4 (2010), pp. 337–353.

[82] Ran Liu, Rony Patel, and Kenneth R. Koedinger. “Modeling Common Miscon-ceptions in Learning Process Data”. In: Proceedings of the Sixth InternationalConference on Learning Analytics & Knowledge. LAK ’16. Edinburgh, UnitedKingdom: ACM, 2016, pp. 369–377. isbn: 978-1-4503-4190-5. doi: 10.1145/2883851.2883967. url: http://doi.acm.org/10.1145/2883851.2883967.

[83] David J. C. MacKay. “A Practical Bayesian Framework for BackpropagationNetworks”. In: Neural Comput. 4.3 (May 1992), pp. 448–472. issn: 0899-7667.doi: 10.1162/neco.1992.4.3.448. url: http://dx.doi.org/10.1162/neco.1992.4.3.448.

[84] GermÃąn Kruszewski Marco Baroni, Georgiana Dinu. “Don’t count, predict!A systematic comparison of context-counting vs. context-predicting seman-tic vectors”. In: 52nd Annual Meeting of the Association for ComputationalLinguistics, ACL 2014 - Proceedings of the Conference 1 (2014), pp. 238–247.

[85] A. Marcus et al. “TwitInfo: Aggregating and visualizing microblogs for eventexploration”. In: Proceedings of the 2011 annual conference on Human factorsin computing systems. ACM. 2011, pp. 227–236.

[86] Adam Marcus et al. “Processing and Visualizing the Data in Tweets”. In:SIGMOD Record 40.4 (Dec. 2011), pp. 21–27.

[87] Dimitris Margaritis. “Learning Bayesian Network Model Structure From Data”.PhD thesis. School of Computer Science, Carnegie-Mellon University, 2003.

[88] Jay Martin, John Mitchell, and Ty Newell. “Development of a concept inven-tory for fluid mechanics”. In: Frontiers in Education, 2003. FIE 2003 33rdAnnual. Vol. 1. IEEE. 2003, T3D–23.

[89] Jay Mathews. “Just whose idea was all this testing”. In: The Washington Post14 (2006).

76

[90] Michael Mathioudakis and Nick Koudas. “TwitterMonitor: Trend Detectionover the Twitter Stream”. In: Proceedings of the 2010 ACM SIGMOD In-ternational Conference on Management of Data. SIGMOD ’10. Indianapo-lis, Indiana, USA: ACM, 2010, pp. 1155–1158. isbn: 978-1-4503-0032-2. doi:10.1145/1807167.1807306. url: http://doi.acm.org/10.1145/1807167.1807306.

[91] Polykarpos Meladianos et al. “Degeneracy-Based Real-Time Sub-Event Detec-tion in Twitter Stream.” In: ICWSM. Ed. by Meeyoung Cha, Cecilia Mascolo,and Christian Sandvig. AAAI Press, 2015, pp. 248–257. isbn: 978-1-57735-733-9.

[92] K Clark Midkiff, Thomas A Litzinger, and DL Evans. “Development of en-gineering thermodynamics concept inventory instruments”. In: Frontiers inEducation Conference, 2001. 31st Annual. Vol. 2. IEEE. 2001, F2A–F23.

[93] Eva MillÃąn and JosÃľ-Luis PÃľrez de-la Cruz. “A Bayesian Diagnostic Algo-rithm for Student Modeling and its Evaluation.” In: User Model. User-Adapt.Interact. 12.2-3 (2002), pp. 281–330.

[94] Multiple-Choice Test Preparation Manual.

[95] Mor Naaman, Hila Becker, and Luis Gravano. “Hip and trendy: Characterizingemerging trends on Twitter”. In: JASIST 62.5 (2011), pp. 902–918.

[96] Radford M. Neal. Bayesian Learning for Neural Networks. Secaucus, NJ, USA:Springer-Verlag New York, Inc., 1996. isbn: 0387947248.

[97] Radford M. Neal and Geoffrey E. Hinton. “Learning in Graphical Models”. In:ed. by Michael I. Jordan. Cambridge, MA, USA: MIT Press, 1999. Chap. AView of the EM Algorithm That Justifies Incremental, Sparse, and Other Vari-ants, pp. 355–368. isbn: 0-262-60032-3. url: http://dl.acm.org/citation.cfm?id=308574.308679.

[98] Jeffrey L Newcomer. “Inconsistencies in students’ approaches to solving prob-lems in Engineering Statics”. In: 2010 IEEE Frontiers in Education Conference(FIE). IEEE. 2010, F3G–1.

[99] Jeffrey L Newcomer and Paul S Steif. “Student explanations of answers toconcept questions as a window into prior misconceptions”. In: Proceedings.Frontiers in Education. 36th Annual Conference. IEEE. 2006, pp. 6–11.

[100] Jeffrey L Newcomer and Paul S Steif. “Student thinking about static equilib-rium: Insights from written explanations to a concept question”. In: Journalof Engineering Education 97.4 (2008), pp. 481–490.

77

[101] Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. “Summarizing SportingEvents Using Twitter”. In: Proceedings of the 2012 ACM International Con-ference on Intelligent User Interfaces. IUI ’12. Lisbon, Portugal: ACM, 2012,pp. 189–198. isbn: 978-1-4503-1048-2. doi: 10.1145/2166966.2166999. url:http://doi.acm.org/10.1145/2166966.2166999.

[102] Branislav M Notaros. “Concept inventory assessment instruments for elec-tromagnetics education”. In: Antennas and Propagation Society InternationalSymposium, 2002. IEEE. Vol. 1. IEEE. 2002, pp. 684–687.

[103] Brendan O’Connor, Michel Krieger, and David Ahn. “TweetMotif: ExploratorySearch and Topic Summarization for Twitter.” In: ICWSM. Ed. by William W.Cohen and Samuel Gosling. The AAAI Press, 2010. url: http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#OConnorKA10.

[104] Tokunbo Ogunfunmi and Mahmudur Rahman. “A concept inventory for anelectric circuits course: Rationale and fundamental topics”. In: Proceedings of2010 IEEE International Symposium on Circuits and Systems. IEEE. 2010,pp. 2804–2807.

[105] Levent Ozbek and Umit Ozlale. “Employing the extended Kalman filter inmeasuring the output gap”. In: Journal of Economic Dynamics and Control29.9 (Sept. 2005), pp. 1611–1622.

[106] Leysia Palen et al. “Twitter-based Information Distribution during the 2009Red River Valley Flood Threat”. In: Bulletin of the American Society forInformation Science and Technology (2010).

[107] Bo Pang and Lillian Lee. “Seeing stars: Exploiting class relationships for sen-timent categorization with respect to rating scales”. In: Proceedings of ACL.2005, pp. 115–124.

[108] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove:Global Vectors for Word Representation.” In: EMNLP. Vol. 14. 2014, pp. 1532–1543.

[109] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. “Streaming First StoryDetection with application to Twitter.” In: HLT-NAACL. The Associationfor Computational Linguistics, 2010, pp. 181–189. url: http://dblp.uni-trier.de/db/conf/naacl/naacl2010.html#PetrovicOL10.

[110] Timothy A Philpot et al. “Using games to teach statics calculation procedures:Application and assessment”. In: Computer Applications in Engineering Edu-cation 13.3 (2005), pp. 222–232.

78

[111] Daniela Pohl, Abdelhamid Bouchachia, and Hermann Hellwagner. “AutomaticSub-event Detection in Emergency Management Using Social Media”. In: Pro-ceedings of the 21st International Conference on World Wide Web. WWW ’12Companion. Lyon, France: ACM, 2012, pp. 683–686. isbn: 978-1-4503-1230-1.

[112] Daniela Pohl, Abdelhamid Bouchachia, and Hermann Hellwagner. “Social Me-dia for Crisis Management: Clustering Approaches for Sub-Event Detection”.In: Multimedia Tools and Applications (2013).

[113] M. C. Polson and J. J. Richardson, eds. Foundations of Intelligent TutoringSystems. Hillsdale, NJ, USA: L. Erlbaum Associates Inc., 1988. isbn: 0-805-80053-0.

[114] Ana-Maria Popescu and Marco Pennacchiotti. “Detecting Controversial Eventsfrom Twitter”. In: Proceedings of the 19th ACM International Conference onInformation and Knowledge Management. CIKM ’10. Toronto, ON, Canada:ACM, 2010, pp. 1873–1876. isbn: 978-1-4503-0099-5. doi: 10.1145/1871437.1871751. url: http://doi.acm.org/10.1145/1871437.1871751.

[115] Kevin Rawson and Tom Stahovich. “Predicting course performance from home-work habits”. In: Proceedings of the 2013 American society for engineeringeducation annual conference and exposition. 2013.

[116] Jim Richardson et al. “Development of a concept inventory for strength ofmaterials”. In: Frontiers in Education, 2003. FIE 2003 33rd Annual. Vol. 1.IEEE. 2003, T3D–29.

[117] Robert Rippey. “Probabilistic testing”. In: Journal of Educational Measure-ment 5.3 (1968), pp. 211–215.

[118] Isabelle Rivals and Léon Personnaz. “A recursive algorithm based on the ex-tended Kalman filter for the training of feedforward neural models”. In: Neu-rocomputing 20.1-3 (1998), pp. 279–294.

[119] Hasim Sak, Andrew W. Senior, and Françoise Beaufays. “Long short-termmemory recurrent neural network architectures for large scale acoustic model-ing”. In: INTERSPEECH 2014, 15th Annual Conference of the InternationalSpeech Communication Association, Singapore, September 14-18, 2014. 2014,pp. 338–342.

[120] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. “Earthquake ShakesTwitter Users: Real-time Event Detection by Social Sensors”. In: Proceedingsof the 19th International Conference on World Wide Web. WWW ’10. Raleigh,North Carolina, USA: ACM, 2010, pp. 851–860. isbn: 978-1-60558-799-8. doi:

79

10.1145/1772690.1772777. url: http://doi.acm.org/10.1145/1772690.1772777.

[121] Hanan Samet et al. “Reading News with Maps by Exploiting Spatial Syn-onyms”. In: Commun. ACM 57.10 (Sept. 2014), pp. 64–77. issn: 0001-0782.doi: 10.1145/2629572. url: http://doi.acm.org/10.1145/2629572.

[122] Jagan Sankaranarayanan et al. “TwitterStand: News in Tweets”. In: Proceed-ings of the 17th ACM SIGSPATIAL International Conference on Advances inGeographic Information Systems. GIS ’09. Seattle, Washington: ACM, 2009,pp. 42–51. isbn: 978-1-60558-649-6. doi: 10.1145/1653771.1653781. url:http://doi.acm.org/10.1145/1653771.1653781.

[123] Antti Savinainen and Philip Scott. “Using the Force Concept Inventory tomonitor student learning and to plan teaching”. In: Physics Education 37.1(2002), p. 53. url: http://stacks.iop.org/0031-9120/37/i=1/a=307.

[124] Erich Schubert, Michael Weiler, and Hans-Peter Kriegel. “SigniTrend: Scal-able Detection of Emerging Topics in Textual Streams by Hashed SignificanceThresholds”. In: Proceedings of the 20th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining. KDD ’14. New York, NewYork, USA: ACM, 2014, pp. 871–880. isbn: 978-1-4503-2956-9. doi: 10.1145/2623330.2623740. url: http://doi.acm.org/10.1145/2623330.2623740.

[125] David A. Shamma, Lyndon Kennedy, and Elizabeth F. Churchill. “Tweet theDebates: Understanding Community Annotation of Uncollected Sources”. In:Proceedings of the First SIGMM Workshop on Social Media. WSM ’09. Bei-jing, China: ACM, 2009, pp. 3–10. isbn: 978-1-60558-759-2. doi: 10.1145/1631144.1631148. url: http://doi.acm.org/10.1145/1631144.1631148.

[126] Chao Shen et al. “A Participant-based Approach for Event SummarizationUsing Twitter Streams”. In: Human Language Technologies: Conference ofthe North American Chapter of the Association of Computational Linguistics,Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia,USA. 2013, pp. 1152–1162. url: http://aclweb.org/anthology/N/N13/N13-1135.pdf.

[127] Emir H Shuford Jr, Arthur Albert, and H Edward Massengill. “Admissibleprobability measurement procedures”. In: Psychometrika 31.2 (1966), pp. 125–145.

[128] Sharad Singhal and Lance Wu. “Advances in Neural Information ProcessingSystems 1”. In: ed. by David S. Touretzky. San Francisco, CA, USA: MorganKaufmann Publishers Inc., 1989. Chap. Training Multilayer Perceptrons withthe Extended Kalman Algorithm, pp. 133–140. isbn: 1-558-60015-9.

80

[129] Edward Snelson and Zoubin Ghahramani. “Variable Noise and DimensionalityReduction for Sparse Gaussian processes”. In: UAI ’06, Proceedings of the 22ndConference in Uncertainty in Artificial Intelligence, Cambridge, MA, USA,July 13-16, 2006. 2006.

[130] Daniel Soudry, Itay Hubara, and Ron Meir. “Expectation Backpropagation:Parameter-Free Training of Multilayer Neural Networks with Continuous orDiscrete Weights.” In: NIPS. Ed. by Zoubin Ghahramani et al. 2014, pp. 963–971. url: http://dblp.uni-trier.de/db/conf/nips/nips2014.html#SoudryHM14.

[131] Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networksfrom Overfitting”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 1929–1958.

[132] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. “TrainingVery Deep Networks”. In: Proceedings of the 28th International Conferenceon Neural Information Processing Systems - Volume 2. NIPS’15. Montreal,Canada: MIT Press, 2015, pp. 2377–2385. url: http : / / dl . acm . org /citation.cfm?id=2969442.2969505.

[133] Paul S Steif. “An articulation of the concepts and skills which underlie en-gineering statics”. In: Frontiers in Education, 2004. FIE 2004. 34th Annual.IEEE. 2004, F1F–5.

[134] Paul S Steif. “Comparison between performance on a concept inventory andsolving of multifaceted problems”. In: Frontiers in Education, 2003. FIE 200333rd Annual. Vol. 1. IEEE. 2003, T3D–17.

[135] Paul S Steif. “Initial data from a statics concept inventory”. In: Proceedingsof the 2004, American Society of Engineering Education Conference and Ex-position, St. Lake City, UT. 2004.

[136] Paul S Steif and John A Dantzler. “A statics concept inventory: Developmentand psychometric analysis”. In: Journal of Engineering Education 94.4 (2005),p. 363.

[137] Paul S Steif and Mary A Hansen. “New practices for administering and analyz-ing the results of concept inventories”. In: Journal of Engineering Education96.3 (2007), p. 205.

[138] Andrea Stone et al. “The statistics concept inventory: A pilot study”. In:Frontiers in Education, 2003. FIE 2003 33rd Annual. Vol. 1. IEEE. 2003,T3D–1.

81

[139] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence Learningwith Neural Networks”. In: Proceedings of the 27th International Conferenceon Neural Information Processing Systems. NIPS’14. Montreal, Canada: MITPress, 2014, pp. 3104–3112. url: http://dl.acm.org/citation.cfm?id=2969033.2969173.

[140] Pinchas Tamir. “Some issues related to the use of justifications to multiple-choice answers”. In: Journal of Biological Education 23.4 (1989), pp. 285–292.doi: 10.1080/00219266.1989.9655083.

[141] Gabriel A. Terejanu. “Unscented kalman filter tutorial”. In: Workshop onLarge-Scale Quantification of Uncertainty. Sandia National Laboratories. 2009,pp. 1–6.

[142] Michael E. Tipping and Chris M. Bishop. “Probabilistic Principal ComponentAnalysis”. In: Journal of the Royal Statistical Society, Series B 61 (1999),pp. 611–622.

[143] Andranik Tumasjan et al. “Election Forecasts With Twitter”. In: Social Sci-ence Computer Review 29.4 (Nov. 2011), pp. 402–418. issn: 1552-8286. doi:10.1177/0894439310386557.

[144] George Valkanas and Dimitrios Gunopulos. “How the Live Web Feels AboutEvents”. In: Proceedings of the 22Nd ACM International Conference on Infor-mation & Knowledge Management. CIKM ’13. San Francisco, California, USA:ACM, 2013, pp. 639–648. isbn: 978-1-4503-2263-8. doi: 10.1145/2505515.2505572.

[145] Kurt Vanlehn et al. “The Andes Physics Tutoring System: Lessons Learned”.In: Int. J. Artif. Intell. Ed. 15.3 (Aug. 2005), pp. 147–204. issn: 1560-4292.url: http://dl.acm.org/citation.cfm?id=1434930.1434932.

[146] Stella Vosniadou and William F. Brewer. “Mental models of the earth: Astudy of the conceptual change in childhood”. In: Cognitive Psychology (1992),pp. 535–585. doi: doi:10.1016/0010-0285(92)90018-W.

[147] Kathleen E Wage et al. “The signals and systems concept inventory”. In: IEEETransactions on Education 48.3 (2005), pp. 448–461.

[148] Eric A. Wan and Rudolph Van Der Merwe. “The Unscented Kalman Filter forNonlinear Estimation”. In: 2000, pp. 153–158.

[149] Hao Wang et al. “A System for Real-time Twitter Sentiment Analysis of 2012U.S. Presidential Election Cycle”. In: Proceedings of the ACL 2012 SystemDemonstrations. ACL ’12. Jeju Island, Korea: Association for Computational

82

Linguistics, 2012, pp. 115–120. url: http://dl.acm.org/citation.cfm?id=2390470.2390490.

[150] Xiaofeng Wang, Donald E. Brown, and Matthew S. Gerber. “Spatio-temporalmodeling of criminal incidents using geographic, demographic, and twitter-derived information.” In: ISI. Ed. by Daniel Zeng et al. IEEE, 2012, pp. 36–41. isbn: 978-1-4673-2105-1.

[151] Rik Warren, Robert E. Smith, and Anne K. Cybenko. Use of MahalanobisDistance for Detecting Outliers and Outlier Clusters in Markedly Non-normalData: A Vehicular Traffic Example. Tech. rep. Air Force Materiel Command,2011.

[152] Jianshu Weng and Bu-Sung Lee. “Event Detection in Twitter”. In: Proceedingsof the Fifth International Conference on Weblogs and Social Media, Barcelona,Catalonia, Spain. 2011. url: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2767.

[153] Yiming Yang, Tom Pierce, and Jaime Carbonell. “A Study of Retrospectiveand On-line Event Detection”. In: Proceedings of the 21st Annual Interna-tional ACM SIGIR Conference on Research and Development in InformationRetrieval. SIGIR ’98. Melbourne, Australia: ACM, 1998, pp. 28–36. isbn: 1-58113-015-5. doi: 10.1145/290941.290953. url: http://doi.acm.org/10.1145/290941.290953.

[154] J. S. Yedidia, W. T. Freeman, and Y. Weiss. “Constructing Free-energy Ap-proximations and Generalized Belief Propagation Algorithms”. In: IEEE Trans.Inf. Theor. 51.7 (July 2005), pp. 2282–2312. issn: 0018-9448. url: http ://dx.doi.org/10.1109/TIT.2005.850085.

[155] Zhijun Yin et al. “Geographical topic discovery and comparison”. In: Pro-ceedings of the 20th international conference on World wide web. ACM. 2011,pp. 247–256.

[156] Juan Diego Zapata Rivera. “Learning Environments Based on Inspectable Stu-dent Models”. AAINQ83573. PhD thesis. Saskatoon, Canada: University ofSaskatchewan, 2003. isbn: 0-612-83573-1.

[157] Chao Zhang et al. “GeoBurst: Real-Time Local Event Detection in Geo-TaggedTweet Streams.” In: SIGIR. Ed. by Raffaele Perego et al. ACM, 2016, pp. 513–522. isbn: 978-1-4503-4069-4. url: http://dblp.uni-trier.de/db/conf/sigir/sigir2016.html#ZhangZYZZKWH16.

[158] Chao Zhang et al. “TrioVecEvent: Embedding-Based Online Local Event De-tection in Geo-Tagged Tweet Streams”. In: Proceedings of the 2017 ACM

83

SIGKDD International Conference on Knowledge Discovery and Data Min-ing. KDD 2017. Halifax, Nova Scotia, Canada: ACM, 2017.

[159] Siqi Zhao et al. “Human as Real-Time Sensors of Social and Physical Events:A Case Study of Twitter and Sports Games”. In: CoRR abs/1106.4300 (2011).

[160] Xiangmin Zhou and Lei Chen. “Event Detection over Twitter Social MediaStreams”. In: The VLDB Journal 23.3 (June 2014), pp. 381–400. issn: 1066-8888. doi: 10.1007/s00778-013-0320-3. url: http://dx.doi.org/10.1007/s00778-013-0320-3.

[161] Arkaitz Zubiaga et al. “Towards Real-time Summarization of Scheduled Eventsfrom Twitter Streams”. In: Proceedings of the 23rd ACM Conference on Hy-pertext and Social Media. HT ’12. Milwaukee, Wisconsin, USA: ACM, 2012,pp. 319–320. isbn: 978-1-4503-1335-3. doi: 10.1145/2309996.2310053. url:http://doi.acm.org/10.1145/2309996.2310053.

84

Uncertainty Estimation of Deep Neural Networks

Documents