NATIONAL UNIVERSITY OF SINGAPORE Department of Physics Word Vectorisation for Long-Short-Term-Memory (LSTM) model on Chatbot and Analysis of Model’s Dynamical Patterns Lim Yanxiang Louis A0140537L Supervisors: Orkan Arkan (Director of EY Data & Analytics) Dr Hong Cao (Head of Data Science, EY) Dr Feng Ling (Assistant Professor in NUS) 5 th April 2019
128
Embed
Word Vectorisation for Long-Short-Term-Memory (LSTM) model ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NATIONAL UNIVERSITY OF SINGAPORE
Department of Physics
Word Vectorisation for Long-Short-Term-Memory
(LSTM) model on Chatbot and Analysis of Model’s
Dynamical Patterns
Lim Yanxiang Louis
A0140537L
Supervisors:
Orkan Arkan (Director of EY Data & Analytics)
Dr Hong Cao (Head of Data Science, EY)
Dr Feng Ling (Assistant Professor in NUS)
5th April 2019
Page i
Abstract Word Vectorization for Long-Short-Term-Memory (LSTM) model on Chatbot and Analysis of Model’s
Dynamical Patterns
Lim Yanxiang Louis, National University of Singapore (Singapore)
Chat services are needed in almost every business. Industries are using rule based chatbots to automate
chat services, however, they are faced with limitations. In this study, we build a generative based
chatbot using the Ubuntu Dialogue Corpus. This type of chatbot have the potential to answer all
technical questions about the Ubuntu operating system. We first analysed the corpus and sent it
through a Natural Language Processing (NLP) pipeline. We applied 3 different pre-trained word
embedding models (word2vec, GloVe and fastText), which vectorizes the words corpus and trained it on
a Long Short-Term Memory (LSTM) model. We studied the results of the embeddings and found that the
weight distribution became more heterogeneous during the training process. GloVe performed the best
in terms of accuracy for both fundamental and technical analysis. We tried to draw analogies between
neural activations and Ising spins by analysing the distribution of the model's activation.
Page ii
Acknowledgements An end to my formal education leads to the start to of my career. This is at this moment when life
begins.
It was an insane final year.
Before the semester started, I had to go through a lot of administrative work to coordinate between
NUS and EY, to understand the criteria of the final year project since the scope of this project is
unconventional to the physics department. All these were done when I was still in Munich, Germany
doing my exchange program. I had to consider the time difference when making a call to back to
Singapore to discuss about the project.
In the first semester, more than 12 hours were spent almost every weekday in the science library to
learn more about data science. I went from thinking that “pandas” referred to an adorable animal in
China to using this library almost every day.
In the second semester, I was fortunate enough find an internship in e-commerce company, Castlery,
where I worked 3 days a week in the business intelligence/ data analyst team, while balancing school
and this final year project.
The decision to learn more about data science put me out of my comfort zone and made life a lot
tougher than how it already was. However, I am very thankful that I have done it, broadening my
knowledge on this field.
To my supervisors at EY,
Orkan, thank you for introducing the world of data science to me, advising me on the initial steps and
planning such an interesting and relevant industrial project for me.
Dr Hong Cao, thank you for voluntarily joining the project. You got me to think about the implications of
data science from a business perspective and how to make my project valuable to both the academic
and industrial setting.
To my co-supervisor, Dr Feng Ling, thank you for spending so much time and effort on top of your busy
schedule to make sure that the project was on track. Really enjoyed the times when we brain-stormed
on ideas to make the project more interesting by including physics elements to the project.
A special thanks to my physics and Sheares hall senior, Jia Hui, although you have graduated even
before I joined NUS, you still came back to help the juniors. This includes imparting your knowledge on
data science, patiently guiding me to understand the theories I applied in this project and assisting me
whenever I had any problem with my codes.
To my mentor at Castlery, Manuel, thank you for making the internship such a valuable one, imparting
relevant knowledge to me. Appreciate the times we had small talks during lunch and after work,
discussing about the project and giving me some of your thoughts and feedbacks. You have been an
inspiration for me to learn more about data science.
Page iii
My friends in physics, notably Sherman, Abby and Jasper (semi physics), thank you for camping in the
library with me for crazy hours and finishing off the day with Bishan chicken rice. Not to forget Waxin
and Shouzan who were always there to entertain my nonsense.
I would like to thank all my friends and family who have been understanding as I disappeared to work
on what is important to me. All these would not have been be possible without all of you. Special
mention to Edward. You were always so willing to understand more about my project even though you
did not really know what was going on, just so that you can push me on to achieve more.
Finally, thank you Min Jee for taking care me during this period. You looked out for my well-being,
planning exercise classes and ensuring I was well nourished during the busiest months. The thoughts of
going out with you during the weekend was what kept me motivated to complete the tasks at hand.
Page iv
Contents Abstract .......................................................................................................................................................... i
Acknowledgements ....................................................................................................................................... ii
2.3.4 Natural Language Processing in this study ......................................................................... 14
3 Vectorising of text ............................................................................................................................... 15
4.3 Long Short-Term Memory (LSTM) .............................................................................................. 24
4.4 Training ....................................................................................................................................... 29
• Term Frequency-Inverse Document Frequency (TF-IDF)
Uses statistic to reflects the importance of a word to a text in a sentence or text. The TF-IDF statistic
increases when a word is more frequently found in a text, however, loses its importance if it is
frequently found in all text, suggesting that it may be a stop word.
(𝑇𝐹 − 𝐼𝐷𝐹) = 𝑡 𝑙𝑜𝑔 𝐷
𝑑
➢ t - number of times the word appears in that particular document (term frequency in the input) ➢ d - number of text documents the term appears in ➢ D - total number of text documents
2.3.3 Semantic Analysis Assigning meaning to words, sentences and texts. Structures are created to represent the meaning of
words and phrases, however, there is no optimal solution to automatically derive the meaning from text
despite intensive research by scientists.
2.3.4 Natural Language Processing in this study Natural Language Processing itself is a field that many data scientists spend their career research on. To
keep the project manageable, we only did morphological analysis to our text data. Due to the
complexity of many of these techniques we performed the following NLP processes:
We tokenized the text (“Hello”, “I”, “am”, “Louis”), remove stop words (“a”, “the”, “and”) and special
characters (“&”, “@”, “,”) since we only want to keep the context of the sentence.
We import a Python library called Natural Language Toolkit (NLTK) and fit our data into these processes
A snapshot of sentences after going through the NLP pipeline can be seen in Figure 2.3.4.1
Figure 2.3.4.1 Section of the Ubuntu Dialogue Corpus after tokenising, removing of stop words and special characters.
Page 15
3 Vectorizing of text We need to transform our text data into numbers for our computer to process.
3.1 Introduction Words are not naturally understood by computers. By transforming words into a numerical form, we can
apply mathematical rules and do matrix operations on them to obtain an output.
The most basic way to numerically represent words is through one-hot encoding. This means that every
unique word in the dataset is represented by a one vector in the vector space, with 0s everywhere else.
The dimension of the vector will then be the number of unique words. This results in an enormous
vector that captures no relational information [24][26][28].
Figure 3.1.1 Visualisation of one-hot encoded vector
As seen in the diagram, every word is of the same distance, hence, synonyms and antonyms are treated
to be the same.
3.2 Word Embeddings Word embeddings is a real number, vector representation of a word. Ideally, words with similar
meaning will be close together when being represented on the vector space. The goal is to capture the
relationship in that space. With words in densely populated space, we can represent word vector in a
much smaller space as compared to one hot encoded vector that could go up to millions of dimensions
[26][27][28].
In this study we will be implementing 3 popular word embedding techniques, namely, word2vec, GloVe
and fastText.
3.2.1 Word2vec Word2vec was created by a team of researchers at Google, led by Tomáš Mikolov. It is the most popular
method for training embedding [26][27][28][29[[30][32][33].
This model involves a statistical computation to learn from a text corpus. It is a predictive model that
learns their vectors to improve their predictive ability by reducing its loss function. It is also the first
model that considers the closeness of word meaning in a vector space.
Page 16
There are 2 methods that this model takes during training.
1. Continuous Bag-of-Words (CBOW)
As briefly explained in 2.3.2 what a bag-of-words is, this method determines the context of a word
by the surrounding words, or continuous-bag-of-words. It then learns an embedding by predicting
the current words based on the context.
2. Continuous Skip-Gram
This method also learns an embedding by predicting the surrounding words given the context. In
continuous skip-gram, the model uses the current word to predict the surrounding window of
context words.
According to the google team, CBOW is faster than skip-gram. However, skip-gram performs better at
infrequent words
In this study, we used the Google’s pre-trained model. Its word vectors embed a vocabulary of 3 million
words and phrases trained on approximately 100 billion words from the Google News dataset. There
was no explicit detail on whether Google used CBOW or skip-gram to train the model. Each vector has
300 dimensions.
3.2.2 GloVe GloVe, derived from Global Vectors, is another model for word embedding. This model is created by a
team of researchers from the Stanford Artificial Intelligence Laboratory in the computer science
department of Stanford University [26][27][28][31][33].
An extension of word2vec, instead of being a predictive model, GloVe is a count-based model.
Initially, a sparse matrix (large matrix with mostly zero terms) of words × context with the count of word
frequency in the corpus is constructed. Context refers to the word next to (before, or after) the word of
interest. For example, the sentence “The boy ate at the table” with a window size of 2 would become a
co-occurrence matrix as seen in table 3.2.2.1
Table 3.2.2.3.2.1 Word context co-occurrence matrix for the sentence “The boy ate at the table”
the boy ate at table
the 2 1 2 1 1
boy 1 1 1 1 0
ate 2 1 1 1 0
at 1 1 1 1 1
table 1 0 0 1 1
When many sentences are added together, this matrix will grow into a sparse matrix with many 0
entries. This matrix is manipulated based on the hyperparameters set to shift the weights on certain
words. The word context co-occurrence matrix is then deconstructed into a word feature matrix and
feature context matrix as shown in figure 3.2.2.1
Page 17
Figure 3.2.2.1 Matrix illustration for the construction of a GloVe embedding [28]
The row of the word feature matrix is the representation of the GloVe vector of each word.
GloVe vectors are very good at global information, but do not perform that well when capturing
meanings of words.
In this study, we used the GloVe’s pre-trained model. Its word vectors embed a vocabulary of 2.2 million
words and phrases trained on approximately 840 billion words from the Common Crawl dataset. Each
vector has 300 dimensions.
The Common Crawl dataset from text all over the web from a non-profit organization, Common Crawl.
3.2.3 fastText fastText is another word embedding method created by Facebook's AI Research (FAIR) lab. Like GloVe, it
is another extension of the word2vec model [26][27][28][32].
Unlike the previous 2 models which treat words as the smallest unit to train on, fastText treats each
word as composed of character N-gram as explained in 2.3.2, but for word level. For example, the word
vector for “hello” is a sum of the vectors: “he”, “hel”, “hell”, “hello”, “ello”, “llo”, “lo”, “ell”, “el”, “ll”.
This feature allows fastText to support words that are not trained in its vocabulary. For instance, the if
the model has been trained on the word ‘apple’ but has not been trained on the word ‘pineapple’, It will
be able to form a relationship between those 2 words and give a meaning to the word. Hence, fastText
is known to best handle rare words.
In this study, we used the fastText’s pre-trained model. Its word vectors embed a vocabulary of 2 million
words and phrases trained on approximately 600 billion words from the Common Crawl dataset. Each
vector has 300 dimensions.
3.2.4 Embedding Comparison Cosine similarity or cosine proximity is a measure of closeness between 2 non-zeros vectors. It is the
result of an inner product space, measuring the cosine angle between the 2 vectors.
𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑐𝑜𝑠 𝜃 = |𝐴 ∙ 𝐵
‖𝐴‖‖𝐵‖| = ||
∑ 𝐴𝑖𝐵𝑖𝑛𝑖=1
√∑ 𝐴𝑖2𝑛
𝑖=1 √∑ 𝐵𝑖2𝑛
𝑖=1
||
➢ A and B – first and second vector respectively
Page 18
➢ 𝜃 – cosine angle
For each word embedding model, we searched for words that are closest to the word ‘hello’ in terms of
cosine similarity. The similar words and cosine similarity value for each model are presented below.
We mapped the similarity between the word ‘Hello’ and ‘Bye’. These 2 words should be almost parallel
but in opposite direction because of the meaning of the words, hence should have a cosine similarity
value that is close to 1.
Cosine similarity values will be close to 0 when 2 vectors are orthogonal and have no relationship with
each other.
Hence, cosine similarity is a measure of how related 2 words are rather than how similar 2 words are.
Figure 3.2.4.3 fastText top 10 most related words to ‘Hello’
Similarity between ‘Hello’ and ‘Bye’ (cosine similarity)
0.5045497
Table 3.2.4.1 summarizes these results and compares the cosine similarity word representation between
the 3 word embedding models
Table 3.2.4.1 Cosine similarity between the vector representation of the word ‘Hello’ against the words ‘Bye’ and ‘Hi’ across the 3 word embedding models
3.3 Implementation In the preprocessing stage, the list was grouped into a question and answer format. This data feed into
the 3 word embedding models mentioned. Since the models were already trained on other sources, this
process of using a trained data for another dataset is referred to as transfer learning. For all 3 models
each word is vectorized into a 300-dimension vector. Each question and answer were truncated to a
length of 14 words. Reason to why this is done will be explained in section 4.2 on vanishing and
exploding gradient problem. Finally, to indicate to the model that the sentence has ended, we filled the
last vector with a sentend (short for sentence end) vector. This vector is a 300-dimension vector filled
with value 1. For those vectors with less than 14 words, we filled the shortage with the sentend vector
such that every question and answer input is of length 15.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CO
SIN
E SI
MIL
AR
ITY
RELATED WORDS
fastText 10 most closely related words to 'Hello'
word2vec GloVe fastText
'Hello' against 'Bye' 0.34806252 0.53188723 0.5045497
'Hello' against 'Hi' 0.618879318 0.800006747 0.883805335
Page 21
An example of how the word “thanks” look like through a 300-dimension GloVe embedding:
Figure 3.3.1 Histogram for the GloVe embeddings of the word 'thanks'
This is how the sentend vector look like:
Figure 3.3.2 The sentend vector visualised on JupyerLab
Page 23
4 Deep Learning Framework Before we will now dive into the deep learning framework used in this model, we need to understand
the problem setting. The chatbot is a sequence-to-sequence problem. We will be using a recurrent
neural network (RNN) or more specifically long-short term memory (LSTM) model in this problem.
4.1 Sequence-to-sequence (Seq2Seq) We first look at the problem setting of this project. In language study, the placement of words affects
the meaning of a sentence. For instance, the sentence “you are happy” and “are you happy” may
contain the same words however they have different meanings. Even though we have removed the
special character ‘?’ in the latter statement, we can still interpret that the second statement is a
question to understand if the second person is feeling happy while the first statement is indicating that
the second person is feeling happy. From this example, we need to construct a model that accounts for
where a token is inserted into the sentence. Hence this is a time series problem that requires a model to
accept a time series input and return a time series output. This kind of problem is referred to as a
sequence-to-sequence learning model. The initial sequence is fed into the encoder which considers the
order of the input is then sent into a model before delivering an output which is returned at the decoder
with the correct word order. This is illustrated in Figure 4.1.1 where the sentence “how is nus” followed
by the sentend vector mentioned in chapter 3.3 is used to indicate the end of the sentence is loaded
into the encoder. The decoder then gives the sequential output “it is awesome” and SENTEND after
passing through the black box which represents the model we will be using. This is a simplified model
and, in our study, as mentioned in chapter 3, we are using an encoder and decoder that handles 15
vectors for its input and output [34].
Figure 4.1.1 An illustration of the sequence-to-sequence problem setting, starting with an encoder and ending with a decoder
4.2 Recurrent Neural Network (RNN) We will now look at the model which was depicted by the black box in section 4.1.
Before heading to the specific model that we will be using, we will first understand the kind of network
we will be dealing with. The most common way of modeling a seq2seq problem is with a recurrent
neural network model. Deep learning and neural networks were already introduced in section 1.2. We
now look at a type of neural network called the recurrent neural network (RNN). As most networks are
feed forward network, this kind of neural network are recurrent. This means that there are loops in the
network and the output of one unit may go back to one of the already visit units. This is illustrated in
Figure 4.2.1 where there is an arrow at the hidden layer which loops back into the hidden layer. This
loop is not present in a typical feed forward neural network. This loop can be “unfolded” into what we
see in Figure 4.2.1. “unfolded” is in quotation marks because we cannot unfold the network and this
Page 24
unfolding is only illustrated for visualization purpose. This allows us to draw a closer comparison with
the seq2seq problem statement we had before. The initial black box we described is now replace by a
hidden layer. We call each of this hidden layer a cell [34][35].
The biggest problem that exists with RNN is the vanishing and exploding gradient problem. This problem
was first discovered by scientist Sepp Hochreiter in 1991. Following which, there were many papers
written targeted for this problem. We will generally cover this problem now.
To simplify the equation of an RNN, we have the equation:
𝑂 = 𝑊𝑛𝐼
➢ I - the input
➢ O - the output
➢ W - the weight
➢ n - the number of sequence input.
If we do not limit the input and output sequence, the variable n will be a number, 0 < 𝑛 < ∞ . When
the number of inputs from the training becomes a large number, the value of Wn will tend towards
infinity if W is greater than 1 or tend towards 0 if w is less than 1. Hence, the output value will be either
infinity or 0. Due to this limitation of the model and to reduce the computational cost of our model, we
have limited the input to 15 as mentioned in section 3.3. However, we found a specific type of RNN
which addresses our first consideration. Hence, the result of our truncation is mainly to deal with
computational cost.
4.3 Long Short-Term Memory (LSTM) We will now explore the specific model we use in this study. Long short-term memory is a special type of
RNN where you connect the units in a specific way such that it avoids the vanishing and exploding
gradient problem that arise initially in a typical RNN [36].
LSTM was discovered by Hochreiter (the same scientist who discovered the vanishing and exploding
gradient problem in RNNs) and Schmidhuber in 1997. Over the years many other scientists refined and
popularized this model. It is model that works on a wide variety of problems, some of which were
mentioned earlier in chapter 1.
To understand how LSTM is different from a typical RNN, we study the architecture of each individual
cell for both models.
Figure 4.2.1 An illustration of a recurrent neural network (RNN) model, “unfolded” to demonstrate the recurrent process
Page 25
Figure 4.3.1 Illustration of a repeating module in a standard RNN model, containing a single tanh layer [36].
Figure 4.3.2 Illustration of a repeating module in an LSTM model, containing 3 sigmoid and 1 tanh layer [36].
In an artificial neural network like this LSTM, all the “memory” of the networks is in the form of vectors
that we have created in chapter 3. To “remember”, “learn” or “forget” is analogous to when a
mathematical operator acts on the “memory” vector to retain, alter or remove the values.
Figure 4.3.1 represents the cell of a standard RNN while Figure 4.3.2 represents the cell of a LSTM. In a
standard RNN, it consists only of a single tanh neural network layer. In a LSTM, there are four neural
network layers interreacting in a unique way that allows the model to “remember” in its long-term
memory and “forget” in its short-term memory, hence the name long short-term memory. The long
term “memory” is embedded in the cell state represented by C while the short-term memory, memory
of the previous (t-1) output, is embedded in the hidden state represented by h in the figures.
Page 26
Each line in the diagram refers to a “memory” and in our study is represented by a 300-dimension
vector. The pink circles with an operator inside are pointwise operators. These occurs at intersections
between 2 vector line and forces the 2 vectors to do an operation. The pointwise operator with a ‘+’
performs a vector addition while those with ‘x’ performs a pointwise product (ie, [u1,…,un] x [v1,…,vn]
=[u1v1,…,unvn]. These pointwise functions are also referred to as gates as they decide on what
information is retained, added or remove from the system. The yellow boxes represent the neural
network layers. These layers comprise of a weight and bias which are being updated from
backpropagation during the training. Merging lines without the pink circles refer to concatenation of
vectors while splitting lines represents vectors being copied and going on separate paths.
We will explore the key idea behind the LSTM model.
In Figure 4.3.3 we have the cell state. This contains the “memory” from the previous sequence. We see
that the cell state goes through the cell without going through any neural network layer, it only passes
through 2 pointwise operation, hence the information through the system can only be altered linearly at
the pointwise operation. These structures are called gates. The vector through this stream are often
unchanged and is thus responsible for the long-term memory.
Figure 4.3.3 Illustration of the path of a “cell” through a repeating unit, responsible for keeping the “memory” of the model [36].
Before we explore the function at each neural network, we will introduce the two functions that are
governing the neural network. The first function is a sigmoid function. This function takes in any value
and compresses the value to produce an output with a value between 0 and 1. The second function is a
hyperbolic function of tangent, also referred to as a tanh function. This function takes in any values and
squeezes the values between -1 to 1.
The sigmoid and tanh functions are defined by the following formulas respectively
𝑆(𝑥) =1
1 + 𝑒−𝑥=
𝑒𝑥
𝑒𝑥 + 1 , tanh 𝑥 =
sinh 𝑥
cosh 𝑥=
𝑒𝑥 − 𝑒−𝑥
𝑒𝑥 + 𝑒−𝑥=
𝑒2𝑥 − 1
𝑒2𝑥 + 1
The first layer we will be looking at is called the “forget gate layer” as shown in Figure 4.3.4. The input
through this layer is a concatenated vector from the hidden state ht-1 and from the new input vector xt.
This concatenated vector consists of 2 300-dimension vectors where xt is fed into the kernel while ht-1 is
fed into the recurrent kernel. More about the kernel and recurrent kernel will be discussed in chapter 4.
The output is governed by the sigmoid function and produces a vector with values between 0 and 1. This
vector undergoes a pointwise multiplication with the cell state as shown in Figure 4.3.3. This allows the
Page 27
cell states to either “remember”, “learn” or “forget” information from the previous cell. For instance,
when the word fed in is a new subject, we want the model to forget the gender of the old subject.
Figure 4.3.4 Illustration of the vectors path through the "forget gate" [36].
In the next step, the model needs to decide if there will be any new information that needs to be added
to the “memory” of the cell state.
There are 2-part process as shown in Figure 4.3.5. The same concatenated vector that went through the
first neural network is being duplicated and one vector goes through a sigmoid layer to decide which
values to update, producing a vector it. The second vector goes through a tanh layer to create a new
vector referred to as candidate, �̃�𝑡. A pointwise product is performed on the 2 new vectors it and �̃�𝑡 to
create a new vector to be added into the cell state. This is shown in Figure 4.3.3.
An application of this will be when the forget gates causes the system to “forget” the gender of the old
subject, the gender of the new subject is updated into the cell state.
Figure 4.3.5 Illustration of the vectors path through the "input gate" [36].
Figure 4.3.6 Illustration of the vectors through the "forget gate" and "input gate" affecting the cell state [36].
Page 28
In the final process, the same concatenated vector through a sigmoid layer. This output vector decides
that parts of the cell state that will be output. The cell state goes through a tanh function to have its
values compressed between -1 to 1. Note that this is not a neural network layer but is simply a tanh
operator acting on the vector without the influence of weights and biases. These 2 vectors undergo a
pointwise multiplication to produce the new hidden state ht.
Figure 4.3.7 Illustration of the vectors path through the "output gate" [36].
In our setup, we stacked 4 LSTM on top of each other forming 4 layers. Note that there are 4 LSTM layer
and within each layer, there are 4 neural network layers. Figure 4.3.8 illustrates the set up used in this
study.
Figure 4.3.8 Illustration of a 4-layer LSTM that we use in this study.
Page 29
4.4 Training Upon understanding the model used, we trained out data on the proposed model.
The duration which we set to train our model is called an epoch. One Epoch is defined to be when an
entire dataset is passed forward and backward through the neural network only once.
We sent each of our vectorized data into the LSTM network. At this point, out 50,000 lines that had
been combined earlier in the preprocessing stage mentioned in chapter 2, our training dataset now
contains 14938 pairs of question and answer. Our training is set to 1000 epochs through 4 LSTM layers
with 300 neurons in each neural network layer.
In general, there is no rule on how many networks and epoch to run the dataset, too many might result
in overfitting and too little might not represent the data well enough. This is based on experience,
considering processing power of the computer and the data size. The parameters chosen in this study
were taken with reference from chatbot models that experts used.
Figure 4.4.1 Snapshot of the LSTM training process, where the first mini batch for the first epoch, the estimated time of completion/arrival (ETA) is 6.33min, loss function (cosine proximity) is -0.4829, accuracy us 0.0083. For the 14938 epochs, it is split into mini batches of size 32.
Page 30
We now look at the parameters in the training.
In each neural network layer in each cell, there are 300x300 weights in the kernel and 300x300 weight in
the recurrent kernel (300-dimension input on 300 neurons). All the weights in each LSTM layer kernel
and recurrent kernel are concatenated. We can see how this is broken down in Figure 4.4.2. The total
number of weights and recurrent weights in the LSTM is 2,880,000. Each neural network layer contains
300 biases and each LSTM layer contains 300x4 = 1200 biases. There is a total of 4,800 as seen in Figure
4.4.2. More details on weights and biases can be found in section 5.2.
An activation is the output at each neuron given a set of input, defined by an activation function.
In general, at each neural network layer in the LSTM as described in 4.3, it follows the equation:
𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 = function (𝑊𝑘 × 𝑥𝑡 + 𝑊𝑟 × ℎ𝑡−1 + b)
➢ activation – output, (300, 1) matrix
➢ function – the activation function, given by a sigmoid or tanh function
➢ Wk – weight of kernel, (300, 300) matrix
➢ xt – input at time t, (300, 1) matrix
➢ Wr – weight of recurrent kernel, (300, 300) matrix
➢ ht-1 – hidden layer from time t-1, (300, 1) matrix
➢ b – bias, (300, 1) matrix
The size of the matrices is specific to this study, where (m, n) represents m number of rows and n
number of columns.
Figure 4.4.2 Distribution of weights in kernel and recurrent kernel and its biases in each layer
Page 31
5 Results In this chapter, we will discuss the results obtain from the 3 different models we built. We broke down
the analysis into fundamental analysis, studying the intrinsic results such as the theoretical accuracy,
loss functions, weights and biases of the model, and technical analysis, studying the actual response
from the chatbot models.
5.1 Fundamental Analysis We study the values of the accuracy and the loss function at each epoch obtained from training the
model and the weights and biases at every 100 epochs from a mathematical point of view.
5.1.1 Accuracy and loss In this section, we will discuss about the accuracy and loss recorded during the training of each model.
Accuracy in this section refers to the theoretical accuracy obtained during the training. It refers to the
accuracy of the words matching with the theoretical “correct” answer that was given as the output. This
is not to be confused with the experimental accuracy give in section 5.2.1. The accuracy has a maximum
score of 1 when all the words in the validation is equals to the expected output. In this study, since our
input and output is of size 15 words, if it manage to predict 9 words in the correct position of the
sentence, it will achieve a score of (9 ÷ 15) = 0.6. We will now analyse the results we achieved.
Figure 5.1.1 Graph of accuracy against epoch across the 3 models during the training
From the accuracy graph, we gathered that GloVe obtained the best accuracy. Initially, the accuracy was
an almost an exponential increase until around the 150th epoch. The accuracy spiked from
approximately 0.3 to approximately 0.7 from the 150th to 200th epoch. After around 200 epochs, its
accuracy slowly increases to slightly below 0.8 where it starts to plateau. This means that out of 15
words that was fed into the model, the model was able to predict approximately (0.8 x 15) = 12 words of
the output correctly after training. This of a very high accuracy. The model with the next highest
accuracy is word2vec. Initially, the accuracy of the word2vec model was very low, close to 0. Its gradient
was almost flat and accuracy was not improving despite the training. From the 255th epoch, its accuracy
spiked from 0.067 to 0.404 in the 266th epoch. Following which, it starts to plateau at around 0.5 for the
Page 32
rest of the training. This model was able to predict approximately (0.5 x 15) =7.5 words of the output
correctly after training. FastText fared the worst at in end of the training in terms of the training
accuracy. Despite having a higher initial accuracy than word2vec, the training was not as efficient. Unlike
the first 2 models, there was no large jump in accuracy found from the fastText training. Accuracy only
improved after the 65th epoch and its rate of increase slowed down after the 200th epoch. Towards the
end of the train, from around 800 to 1000 epoch, there was a lot of noise observed. After 1000 epochs,
its training accuracy ended at 0.26. This means that it could only predict approximate (0.26 x 15) = 4
words of the output correctly.
The loss function is calculated from the cosine proximity/similarity formula:
𝐿 = − 𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = − 𝑐𝑜𝑠 𝜃 = −𝐴 ∙ 𝐵
‖𝐴‖‖𝐵‖ = −
∑ 𝐴𝑖𝐵𝑖𝑛𝑖=1
√∑ 𝐴𝑖2𝑛
𝑖=1 √∑ 𝐵𝑖2𝑛
𝑖=1
Where L represents the loss function and A and B are the input and output vectors
In a usual cosine similarity formula, the value ‘1’ is obtained when 2 vectors are completely similar
(parallel) and 0 when they are dissimilar (orthogonal). However, unlike the usual cosine similarity
formula, there is a negative sign included here. This is to illustrate the loss function to decrease, getting
closer to -1, when the answers are getting more similar.
Figure 5.1.2 Graph of loss against epoch across the 3 models during the training
We can observe that the 3 models are decreasing in a similar shape. Like the accuracy graph, the GloVe
model performed the best (closest to -1). FastText performed better than word2vec in terms of loss.
This means that the vectors predicted by fastText were closer to each other than word2vec (have a
closer vector representation to the targeted answer), however, the final vector results produced were
not the same word, hence fastText scored a lower accuracy.
Page 33
5.1.2 Weights and Biases In this section, we analyse how the weights and biases evolve in each layer during the training and
compared the final training weights and biases across all 3 models.
The weights in the neuron applied to the input vectors are referred to as weights from the kernel while
the weights applied to the hidden state vector which stores the memory from the previous sequence’s
output is referred to as the weights from the recurrent kernel.
All the weights from the kernel and recurrent kernel are initialized with the Glorot normal initializer
(also known as Xavier normal initialization). This function is a truncated normal distribution with mean 0.
The standard deviation (𝜎) is given by:
𝜎 = √2/(𝑖𝑛 + 𝑜𝑢𝑡)
➢ in – number of input units in the weight tensor
➢ out – number of output units in the weight tensor
Glorot normal initializer is a common initializer used in neural networks [37]. If the weights of the
network start too small, the signal going through each layer will shrink and be too small. If the weights
are too large, then the signal going through the weight will grow and be too massive. Glorot normal
initializer is a method commonly used in data science with the aim to have the right size for the weight
distribution, keeping the signal in a reasonable range of values through the layers.
For all weights in the kernel, recurrent kernel and all the biases, we plot the quantity on a log scale. The
first reason for doing so is to deal with the skewness of values from the initialization. The second reason
is to observe if the data follows a power law distribution.
In the following observations, the weights and biases distribution are studied on a macro perspective.
5.1.2.1 Distributions
We plotted and anaysed how the weights and biases for each layer in each model evolved over the 1000
epochs set. In general, we observed that over the epochs, the weights, in both the kernels and recurrent
kernels, and biases spread outwards away from the initialization. The Glorot normal initializer for the
kernels and recurrent kernels flattens outwards with some values favouring the negative values while
others favouring the positive values. The 2 peaks from the initialization of the bias flattened out as well
and converge towards each other.
Page 34
Figure 5.1.2.1.1 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec model first LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the first layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 1, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 1, the weight distribution is found to be more symmetrical on both
sides. We observed that the movement of weights from the recurrent kernel is greater than the kernel.
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
A detailed study on the evolution of its distribution can be found in appendix 9.1.1.
We went on to study the weight and bias distribution after 1000 epoch compared across the 3 models.
In each layer, the weights and biases spread out differently for each model. There are no clear
qualitative differences to the model’s weights and biases.
Page 35
Figure 5.1.2.1.2 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec, GloVe and fastText model first LSTM layer, at the 1000 epochs, with its quantity plotted on a log scale
In the first layer of the 3 models, we observed that the weights from the word2vec model is the most
heterogeneous and GloVe was the least heterogeneous model in the kernel and recurrent kernel.
However, the biases for the GloVe model had the widest distribution while fastText had the narrowest
distribution.
A detailed study across each model after 1000 epochs can be found in appendix 9.1.2.
5.1.3 Fundamental Insights From fundamental analysis, the most important insight that we could draw was from the accuracy
graph. The GloVe model responded the best to the LSTM training, obtaining the best result. The most
interesting insight from the graph was the similar trend occurring in the word2vec and GloVe model, but
not for the fastText model. We observed a jump in accuracy for the first 2 models but not for the third
model. This is an interesting phenomenon that we try to study and model in chapter 6.
5.2 Technical Analysis In this section, we will discuss the result from the actual output which users will experience when they
use the chatbot.
5.2.1 Technical Insights At this stage, we want to test the accuracy of our models. Since we are building a model that mimics
intelligence, judging the accuracy of the model based on the numerical prediction would not be the
most accurate. The accuracy in this section is different from the theoretical accuracy mentioned in
section 5.1.1. The accuracy (experimental accuracy) here is a measure of accuracy on its context rather
than the accuracy of individual words.
Page 36
The best way to evaluate an artificial intelligent is to expose it to humans and gather real feedbacks.
However, since the chatbot it fed with too little data, it is still at its infancy development stage. Hence,
we personally evaluated the chatbot.
We loaded a list of 100 questions to the chatbot. The first 50 questions are questions that the chatbot
trained on, while the next 50 questions are questions that the chatbot have never encountered in its
training. We got the respond of each question from the 3 models and will be giving them a score from 1-
3. 3 for the best response given out of the 3 chatbots and 1 for the worst response. The chatbot will be
evaluated equally based on 2 criteria
1. The accuracy of the response
2. How human is the response
The first column gives us the index of the table. ‘Q’ represents question and ‘A’ represents answer. For
instance, ‘Q1’ is the index for question 1 and ‘A1’ is the index for answer 1. In the second column, we
have the questions and answers. The table is structure in such a way where we have the question,
followed by the actual answer given by the dialogue corpus in the next row.
The next 3 columns are the results from the 3 models, word2vec, GloVe and fasText respectively. The
top row, together with the question, gives the response from the model while the bottom row, together
with the answer, gives the score of each model.
Table 5.2.1.1 Truncated table showing the first question and answer pair from trained models and answer provided by the Ubuntu Dialogue Corpus
do think refer porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The R
A1
usb to ps2 converter came with the mouse in
the box 1 2 3 S
The full results of the responses and scoring is in the appendix, section 9.2.
Recalling the sentend vector that we have implemented in section 3.3, this vector will appear at the end
of the sentences. We could replace these vectors in the Chatbot integration, however, as mentioned in
1.3.2, it is not the focus of this project. As observed from each model, the sentend vector represents the
word:
● l'_Affaire, in the word2vec model
● post-exertional, in the GloVe model
● porn.The, in the fastText model
Page 37
The results generated from the 50,000 training input was not able to generate a sufficiently well trained
model. This made the grading very difficult as most of the answers were incorrect. This was especially so
in the second half of the questions with new questions. We also noticed that for the fastText model,
sentend was found not just at the end of the sentence, but also within the sentence. This is reflected by
the training accuracy, achieving the lowest score out of the 3 models.
One positive result was that we could see that the models have learnt from the training data. The
outputs were not completely random but showed sign of learning. It returned words that were within
the domain of the Ubuntu Dialogue Corpus. For example, its replies had words like ‘Ubuntu’, ‘BIT’ and
computing related subjects.
Despite the difficulty to grade the chatbot, we managed to rank the models. More details on the grading
results can be found in appendix 9.2. In general, we felt that GloVe did the best in terms of replying the
most humanly logical sentences. The scores are summarized in table 5.2.1.2.
Table 5.2.1.2 Technical analysis results for the 3 word embedding models on 100 sample question and answer pairs
word2vec GloVe fastText
Score 129 250 215
Page 38
6 Neuron Activation and Ising Spins We observed an interesting phenomenon during the training whereby there was a spike in accuracy
during the training of the word2vec and GloVe model. In this chapter, we try to give an interpretation to
this phenomenon by drawing an analogy of this phenomenon with a phase transition and modelling the
neuron activations to Ising spins.
6.1 Phase Transition and Ising Model Phase transitions refers to the change in state. Common phase transition we are familiar with includes
vapourisation and melting [38].
We explore a familiar phase transition, melting of ice.
The Helmholtz free energy of a system is given by the following equation:
𝐹 ≡ 𝑈 − 𝑇𝑆
Where F is the Helmholtz free energy, U is the internal energy of the system, T is the absolute
temperature of the surrounds and S is the entropy of the system
At very low temperature, the water is in solid form, ice, as heat is applied to the ice the temperature
increases while the molecules held within a crystal lattice vibrates faster and becomes more energetic.
Once the temperature reaches 0°C or 273.15K, this is a critical temperature where phase transition
occurs. At this point, upon heating the ice, the temperature of the ice does not change. The heating
process causes the entropy and internal energy to increase as temperature remains constant as the ice
starts to melt.
Another famous model is the Ising Model, named after physicist Ernst Ising, is a popular model for
magnetic solids. In this model, each atom has an intrinsic magnetic momentum called spin. This spin can
exist as spin “up”, which conventionally equals to 1 or “down” which equates to -1. It is being widely
used as a toy model to study phase transition behaviors in statistical mechanics. In this model, close to
the critical temperature, the correlation lengths of the system approaches infinite, while the variance
from the distribution of the spin cluster sizes is heterogeneous.
6.2 Model and Results In Figure 5.1.1.1, we saw that there was a spike in accuracy over the epoch during the training. By
drawing analogy to phase transitions, we hypothesize that before the training starts, the neural network
is in the disorder phase, as the weights are randomly initialized. It evolves towards a 'ordered' phase in
which the model has learned the patterns in data and is able to make predictions. We see that an abrupt
change in accuracy exists for word2vec and GloVe, but not for fastText. If the analogy is accurate, the
word2vec and GloVe models has a “critical epoch” of 250 and 150 respectively.
Since neural activations do not have a clear spatial allocation unlike the Ising model, we look at the
distribution of activations to explore the dynamical evolutions of the model training process.
We investigate the activation of the neurons by feeding in the sentend vector. After initialization, the
weights and biases will start distributing outwards. As these neural network layers are governed by a
sigmoid and tanh functions, the activations tend towards 0 and 1 for the sigmoid function, 1 and -1 for
the tanh function.
Page 39
From figure 6.2.1, we have plot the reverse cumulative distribution function (CDF) for the word2vec
activation at the first gate, we observe that the likelihood of occurrence after the zeroth epoch is at the
extreme values of 0 and 1.
Figure 6.2.1 Reverse CDF plot for the word2vec activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Refer to appendix 9.3.1 for the reverse CDF plots for all the other layers and models.
The 2 states of the activations are similar to that of the up and down spins in the Ising model. Hence, we
try to draw an analogy of the activations to the Ising model. To verify if phase transition occurs, we plot
the variance of the activation from the sentend over 1000 epochs.
From figure 6.2.2, we observe that maximum variance occurs at the region of the “critical epoch” for
LSTM 2 gate 1 in the word2vec model. The same observation can be made in the same layer and same
gate at the “critical epoch” for the GloVe model, illustrated in figure 6.2.4. This suggest to us that phase
transition could occur in LSTM 2 gate 1.
Instead of setting the activation on the sentend vector, we fed the vector for the word ‘java_14’ into the
word2vec and GloVe model. Figure 6.2.3 and figure 6.2.5 shows that a maximum does not occur at the
“critical epoch”.
Page 40
Figure 6.2.2 Variance plot for the word2vec activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 6.2.3 Variance plot for the word2vec activations in the first gate (sigmoid), second LSTM layer, over 1000 epoch at 100 epoch intervals, when the word ‘java_14’ is fed into the activation.
Page 41
Figure 6.2.4 Variance plot for the GloVe activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 6.2.5 Variance plot for the GloVe activations in the first gate (sigmoid), second LSTM layer, over 1000 epoch at 100 epoch intervals, when the word ‘java_14’ is fed into the activation
Page 42
Plots of variance against epoch at each gate per layer for all other layers can be found in appendix 9.3.2.
From the results, we found that the variance of these activations was not at its maximum during the
abrupt change in accuracy. This was consistent for both word2vec and GloVe model. We do not have
enough evidence to support the hypothesis that the abrupt increase was due to a phase transition.
Page 43
7 Future work and Conclusion As natural language processing is a new field of study, little is known about it. There are still many
uncertainties that experts in the field do not understand. Researchers spend their career understanding
this branch of data science. As someone new to data science, given 1 year to do this project, on top of
school and a part time internship, time was a major limitation in this project. Hence, there are some
major improvements that can be done to improve on the result of this study.
Firstly, more research could be done on the natural language processes and applied to the corpus. In
chapter 2.3, we discussed about the various forms of analysis and mentioned that we have only applied
the most basic natural language processes: tokenization, stop word removal and special character
removal. There are many more methods mentioned in chapter 2.3 which may improve the result of the
output.
Another potential improvement that we could work on would be the word embeddings used. Currently,
we are doing transfer learning, using word embedding trained on other context, Google news for
word2vec and Common Crawl for GloVe and fastText. These are general words trained on a large
corpus, however, were not specific to the topic of interest. For instance, there were many irrelevant
words such as ‘samantharonson_@’, ‘porn.the’ and ‘AP_HOCKEY_NEWS’ that were in the replies of the
chatbot. Training the word embedding on the corpus with only relevant words might produce an
improved result. However, would require a lot more time and computational power.
One of the most critical improvement that is applicable to all data science problem is to train on a larger
dataset. The luxury of a super computing lab was not available and thus the access to computational
resources were limited. We trained the model on a home computer with only 8GB ram. Each model took
approximately 5 days to train and initially, we had to keep retraining the model as we were not sure of
what data we wanted to record. Attempts to use the National Supercomputing Centre Singapore (NSCC)
computer were made. This resource is freely available to NUS students. However, at the start of the
project, there was a waterpipe leakage at their centre, causing their infrastructure to be damage and
most services to be unavailable. Subsequently, we it was back online, we tried to again, however, due to
the complexity of the python libraries, older discontinued libraries such as Theano were used in my
code, we did not have enough time to either change the code or to try and figure out how to get the
libraries installed. We were only able to train 50,000 rows of our full corpus. By training the full corpus,
it will allow the weights to better generalize the model, possibly giving a better result.
The dataset used in this study, the Ubuntu Dialogue Corpus, is meant for users to discuss about
technical issues. A lot of these responses are specific to a particular problem, having specific steps to
solve a particular issue. This is something every difficult to generalize and users will require the exact
steps to solve that specific problem. A ruled based chatbot might be more suitable for this problem
setting. A generative based chatbot might be more situatable for more general question and answer
chatbot at this stage.
In this study, we did the most basic implementation into a user interface. We had to run the code and
talk to the chatbot on the python terminal. In future, when the accuracy of the chatbot is improved with
the suggestions mentioned above, or with other modeling techniques, we could complete the chatbot
by integrating it in a Chatbot user interface, such as on telegram, or even building an entire interface for
it. This might benefit Ubuntu users, providing them the convenience of an instant customer service.
Page 44
The physics intuition to analyse the deep learning models has led us to explore possible evidence of
critical phase transition in the training process. That is, whether the transition of the model from
'unlearned' to 'learned' is a critical phase transition analogous to many physical processes. From the
limited analysis of our work, such evidence is lacking. It could be purely the absence of a 'critical' phase
transition or it could be that the models we employed simply do not perform well in the end, meaning
they have not really reached the 'learned' phase yet. Other models or numerical methods can be applied
to further explore the nature of the evolution process during model training, and that could be future
work.
We hope that in the future, as we conduct more research on the tools required to build the chatbot, we
will have a better understanding on data science and neural networks, with the possibility of marrying
physics theories to accelerate the advancement of artificial intelligence. With a fully functioning
generative based chatbot connected to an open domain, we will be able to achieve what was discussed
in section 1.1 and even more!
Page 45
8 Bibliography [1] EY - Analytics. Retrieved from https://www.ey.com/gl/en/issues/business-environment/ey-
analytics
[2] Wetstein, S. (2017). Designing a Dutch financial chatbot (Masters). VU UNIVERSITY
AMSTERDAM.
[3] Schneider, C. (2017). 10 reasons why AI-powered, automated customer service is the future.
Retrieved from https://www.ibm.com/blogs/watson/2017/10/10-reasons-ai-powered-
automated-customer-service-future/
[4] AI for Customer Service | IBM Watson. Retrieved from https://www.ibm.com/watson/ai-
7c69af7a7439 [20] Ray, S. (2017). Understanding and coding Neural Networks From Scratch in Python and R.
Retrieved from https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-
in-python-and-r/
[21] Ubuntu IRC Logs. Retrieved from https://irclogs.ubuntu.com/2007/12/12/%23ubuntu.html
[22] Lowe, R., Pow, N., V. Serban†, I., & Pineau, J. (2015). The Ubuntu Dialogue Corpus: A Large
Dataset for Research in Unstructured Multi-Turn Dialogue Systems. School of Computer Science,
McGill University, Montreal, Canada.
[23] Tatman, R. (2017). Ubuntu Dialogue Corpus. Retrieved from
https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus [24] Parrish, A. (2018). Understanding word vectors: A tutorial for "Reading and Writing Electronic
Text," a class I teach at ITP. (Python 2.7) Code examples released under CC0
https://creativecommons.org/choose/zero/, other text released under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/. Retrieved from
http://www.cs.cornell.edu/courses/cs2112/2018fa/lectures/lecture.html?id=parsing [26] NSS. (2017). Intuitive Understanding of Word Embeddings: Count Vectors to Word2Vec.
Retrieved from https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-
word2veec/
[27] Ruder, S. (2016). An overview of word embeddings and their connection to distributional
semantic models - AYLIEN. Retrieved from http://blog.aylien.com/overview-word-embeddings-
history-word2vec-cbow-glove/
[28] Heidenreich, H. (2018). Introduction to Word Embeddings | Hunter Heidenreich. Retrieved from
9.1 Weight and Bias Results As discussed in 5.3 about the weights and biases, we look at the plots of the results in detail.
9.1.1 Evolution over 1000 epochs In this section, we plot the kernel, recurrent kernel and bias individual LSTM layers for each model
across 1000 epochs. We can study how the weights and biases evolves in each layer.
9.1.1.1 Word2vec
Figure 9.1.1.1 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec model first LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the first layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 1, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 1, the weight distribution is found to be more symmetrical on both
sides. We observed that the movement of weights from the recurrent kernel is greater than the kernel.
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 49
Figure 9.1.1.1.2 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec model second LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the second layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 2, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 2, the weight distribution also not symmetrical and is also biased
more towards the negative values. We observed that the movement of weights from the recurrent
kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the second layer as
compared to the first layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 50
Figure 9.1.1.1.3 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec model third LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the third layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 3, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 3, the weight distribution also not symmetrical and is also biased
more towards the negative values. We observed that the movement of weights from the recurrent
kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the third layer as compared
to the second layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 51
Figure.9.1.1.1.4 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec model fourth LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the fourth layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 4, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 4, the weight distribution also not symmetrical, however the
weights are observed to be biased more towards positive. We observed that the movement of weights
from the recurrent kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the fourth layer as
compared to the third layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out. The distinction between the 2 peaks were not observable by the end of the training
Page 52
9.1.1.2 GloVe
Figure 9.1.1.2.1 Graphs with the distribution of kernel, recurrent kernel and biases for the GloVe model first LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the first layer of the GloVe network, we observer the weights redistributing away from the center
where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 1, the weight distribution is rather symmetrical on both positive and
negative sides. For recurrent kernel 1, the weight distribution not symmetrical and is also biased more
towards the negative values. We observed that the movement of weights from the recurrent kernel is
greater than the kernel.
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 53
Figure 9.1.1.2.2 Graphs with the distribution of kernel, recurrent kernel and biases for the GloVe model second LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the second layer of the network, we observer the weights redistributing away from the center
where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution less is vigorous as the
epochs increase. For kernel 2, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 2, the weight distribution not symmetrical and is also biased more
towards the negative values. We observed that the movement of weights from the kernel is greater
than the recurrent kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the second layer as
compared to the first layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 54
Figure 9.1.1.2.3 Graphs with the distribution of kernel, recurrent kernel and biases for the GloVe model third LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the third layer of the GloVe network, we observer the weights redistributing away from the center
where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution less is vigorous as the
epochs increase. For kernel 3, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 3, the weight distribution also not symmetrical, however the
weights are observed to be biased more towards positive values. We observed that the rate which the
weights distributes for the kernel and recurrent kernel is similar.
The weights for both kernel and recurrent kernel spread out more rapidly in the third layer as compared
to the second layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 55
Figure 9.1.1.2.4 Graphs with the distribution of kernel, recurrent kernel and biases for the GloVe model fourth LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the fourth layer of the GloVe network, we observer the weights redistributing away from the center
where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 4, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 4, the weight distribution also not symmetrical, however the
weights are observed to be biased more towards positive. We observed that the movement of weights
from the recurrent kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the fourth layer as
compared to the third layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out. The distinction between the 2 peaks were not observable by the end of the training
Page 56
9.1.1.3 fastText
Figure 9.1.1.3.1 Graphs with the distribution of kernel, recurrent kernel and biases for the fastText model first LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the first layer of the fastText network, we observer the weights redistributing away from the center
where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 1, the weight distribution is rather symmetrical on both positive and
negative sides. For recurrent kernel 1, the weight distribution also symmetrical on both positive and
negative sides. We observed that the movement of weights from the recurrent kernel is greater than
the kernel.
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 57
Figure 9.1.1.3.2 Graphs with the distribution of kernel, recurrent kernel and biases for the fastText model second LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the second layer of the fastText network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 2, the weight distribution is not symmetrical and is biased more towards the
negative values. For recurrent kernel 2, the weight distribution not symmetrical and is also biased more
towards the negative values. We observed that the movement of weights from the kernel is greater
than the recurrent kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the second layer as
compared to the first layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 58
Figure 9.1.1.3.3 Graphs with the distribution of kernel, recurrent kernel and biases for the fastText model third LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the third layer of the word2vec network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 3, the weight distribution is not symmetrical and is biased more towards the
positive values. For recurrent kernel 3, the weight distribution I rather symmetrically distributed on both
sides. We observed that the movement of weights from the recurrent kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the third layer as compared
to the second layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out.
Page 59
Figure 9.1.1.3.4 Graphs with the distribution of kernel, recurrent kernel and biases for the fastText model fourth LSTM layer, over the evolution of 1000 epochs, with its quantity plotted on a log scale
For the fourth layer of the fastText network, we observer the weights redistributing away from the
center where the values of the weights were close to 0.
The weights were most volatile at the first 100th epoch and the redistribution is less vigorous as the
epochs increase. For kernel 4, the weight distribution is not symmetrical and is biased more towards the
positive values. For recurrent kernel 4, the weight distribution also not symmetrical and the weights are
observed to be biased more towards positive. We observed that the movement of weights from the
recurrent kernel is greater than the kernel.
The weights for both kernel and recurrent kernel spread out more rapidly in the fourth layer as
compared to the third layer
The bias is initialized with 2 peaks at 0 and 1. Similar to the weights, we can see the biases redistributing
outwards with the first 100th epoch being the most volatile. We observed the peaks combining as the
biases spreads out. The distinction between the 2 peaks were not observable by the end of the training
Page 60
9.1.2 Model Comparison In this section, we plot the kernel, recurrent kernel and bias individual LSTM layers on the 1000 epoch
across the 3 different models. We can study how the weights and bias differs for each model at the end
of the training.
Figure 9.1.2.1 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec, GloVe and fastText model first LSTM layer, at the 1000 epochs, with its quantity plotted on a log scale
In the first layer of the 3 models, we observed that the weights from the word2vec model is the most
heterogeneous while the GloVe model was the least heterogeneous in the kernel and recurrent kernel.
However, the biases for the GloVe model had the widest distribution while fastText had the narrowest
distribution.
Page 61
Figure 9.1.2.2 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec, GloVe and fastText model second LSTM layer, at the 1000 epochs, with its quantity plotted on a log scale
In the second layer of the 3 models, we observed that the weights from all models distributed almost
similar in the kernel. The distribution for GloVe and fastText is similar in the recurrent kernel, whereas
the word2vec model is slightly less distributed in the positive values. The bias distribution for all 3
models is very similar.
Page 62
Figure 9.1.2.3 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec, GloVe and fastText model third LSTM layer, at the 1000 epochs, with its quantity plotted on a log scale
In the third layer of the 3 models, we observed that the weights from all models distributed almost
similar in the kernel and recurrent kernel. The bias distribution for all 3 models is also very similar.
Page 63
Figure 9.1.2.4 Graphs with the distribution of kernel, recurrent kernel and biases for the word2vec, GloVe and fastText model fourth LSTM layer, at the 1000 epochs, with its quantity plotted on a log scale
In the fourth layer of the 3 models, we observed that the weights in the kernel from fastText model is
the most heterogeneous on the positive values while word2vec model is the most heterogeneous on the
negative values. GloVe had the narrowest distribution in both the positive and negative distribution. The
weight distribution in the recurrent kernel is the widest for word2vec and the narrowest for the glove
model. The bias distribution for GloVe model was the widest in the positive values and fastText model
was the widest in the negative values
9.2 Technical Results In this section, we have the responses from the 3 different chatbot models for 100 questions. The first
50 questions are questions that the chatbot was trained on while the next 50 are questions from the
Ubuntu Dialog Corpus that the chatbot was not trained on. These 100 question and answers were
randomized.
The first column gives us the index of the table. ‘Q’ represents question and ‘A’ represents answer. For
instance, ‘Q1’ is the index for question 1 and ‘A1’ is the index for answer 1. In the second column, we
have the questions and answers. The table is structure in such a way where we have the question,
followed by the actual answer given by the dialog corpus in the next row.
The next 3 columns are the results from the 3 models, word2vec, GloVe and fasText respectively. The
top row, together with the question, gives the response from the model while the bottom row, together
with the answer, gives the score of each model. This score is evaluated based on accuracy of answer and
how human the sentence is, 3 being the best and 1 being the worst score.
Page 64
Table 9.2.1 Questions and answers from trained models and answer provided by the Ubuntu Dialogue Corpus
Seveas so I'll just take that as a 'yes', I am using Google's DNS instead of my ISPs? :) I am planning on doing the local DNS cache as you suggested as well, but just looking
to find out if the DNS 1 2 3 S
Page 68
actually uses Googles since I'm behind a router
Q15
well... lets just say that i get a big table, and its
aaaaall zeroes
Stupid_stupid try subservient_housewives
Stupid_stupid Chrome_#.#.###.###
1Gig_DIMM see anyway Dell_TrueMobile install
l'_Affaire l'_Affaire l'_Affaire l'_Affaire
l'_Affaire
there nt get again post-exertional post-exertional
sure there laptop if check you would wont not that that nt
possible post-exertional post-exertional
do debians withstar do daysdetails see
oh mean thing porn.The porn.The porn.The porn.The porn.The porn.The R
Page 71
more to your liking just as you would on any
other systeme
A26
No they shouldn't, there's no law in the
government to say 'NO OPERATING SYSTEM
WILL HAVE A BACKUP UTILITY' 1 3 2 S
Q27 Yep Should I undo that
or change it or anything?
Stupid_stupid Stupid_stupid might
Dominic_Rhodes_plow webmail_interface ftp calvin redhat see come
want l'_Affaire l'_Affaire l'_Affaire l'_Affaire
if qredit card executables check would there that policyIf post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional
if as sawed shown files hang post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional
sure would make concerned continue either post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional
do thats provide snap.do installed
near feel.Click porn.The porn.The porn.The porn.The porn.The porn.The porn.The porn.The R
Page 77
A47
true /etc/initramfs-tools/scripts/init-bottom
is empty :-( 1 2 3 S
Q48
same command? still same effect. waits like 1.5 seconds and then
do say think thing thing thing think instlled use thing
but classpnp.sys put objectshidden4fun
porn.The R
Page 83
l'_Affaire
A75 ^^^^^ 1 3 2 S
Q76
sorry subcool whats a light DM without compiz that will still have game
capability?
Stupid_stupid going Sooooo docx_format
Digital_Negative_DNG like nmap wmf
detrimentally_affects RUSH_Okay l'_Affaire
l'_Affaire l'_Affaire l'_Affaire l'_Affaire
if before linux provide Linux hr post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional
post-exertional
do quite pyNeighborhood
thing pkgfile operate AMttt
besides porn.The porn.The porn.The porn.The porn.The porn.The porn.The R
A76 XFCE is also pretty
lightweight 1 3 2 S
Q77
hi i would like to upgrade to 11.04 from 10.04 but
update-manager -d propose me only 10.10
should i change my preference to testing or
something like that ?
Pfft hollyrpeete_@ think charset_= Inbox_folder let want l'_Affaire l'_Affaire
if not same post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional post-exertional
post-exertional
do run say say to script.Click textpad locate costs.Learn porn.The porn.The porn.The porn.The porn.The porn.The R
A95 2 3 1 S
Q96
Is it possible to disable the graphics card in my AGP slot and run from my onboard graphics without removing the
do but windows get do have bucks.Click make thing ie use
porn.The porn.The porn.The porn.The R
A100 1 3 2 S
Total Score 129 250 215
Page 88
9.3 Activation Analysis We analyse the reverse cumulative distribution function (CDF) and variance for activations of the 3
models when the sentend vector is fed into the LSTM network. The peak occurred approximately at
around the 260 epoch for word2vec, 150 epoch for GloVe and fastText.
9.3.1 Reverse-CDF We plotted the reverse CDF for the 3 models for each gate in each layer at interval of 100 epochs.
Similar results were found in all 3 models.
From the graphs, apart from the 0th epoch, all the graphs have a close to flat increase from 0 to 1
(sigmoid function) or -1 to 1(tanh function). The likelihood that the activations take values of either 0 or
1(sigmoid function) or -1 to 1(tanh function) is almost 1. This means that almost all the values are close
to either 0 or 1(sigmoid function) or -1 to 1(tanh function)
word2vec
Figure 9.3.1 Reverse CDF plot for the word2vec activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervasl.
Page 89
Figure 9.3.1.2 Reverse CDF plot for the word2vec activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.1.3 Reverse CDF plot for the word2vec activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 90
Figure 9.3.1.4 Reverse CDF plot for the word2vec activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
GloVe
Figure 9.3.5 Reverse CDF plot for the GloVe activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 91
Figure 9.3.6 Reverse CDF plot for the GloVe activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.7 Reverse CDF plot for the GloVe activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 92
Figure 9.3.8 Reverse CDF plot for the GloVe activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
fastText
Figure 9.3.9 Reverse CDF plot for the fastText activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 93
Figure 9.3.20 Reverse CDF plot for the fastText activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.31 Reverse CDF plot for the GloVe activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 94
Figure 9.3.42 Reverse CDF plot for the fastText activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
9.3.2 Activation Variance For each model, we plot the variance of the neuron activations against the epoch at all gates and layers.
The activations are calculated by feeding the activation equation with the sentend vector.
If a phase transition were to occur, a maximum variance should occur approximately at the “critical
epoch” of 260 for word2vec and the “critical epoch” of 150 for GloVe. From figure 9.3.2.1 and figure
9.3.2.6, LSTM 2 Gate 1 for both word2vec and GloVe model showed signs of phase transition as the
maximum point is happening around the “critical epoch” region.
We changed the vector fed into the activation equation to the word ‘java_14’. From figure 9.3.2.2 and
9.3.2.7, the maximum variance was not occurring near the “critical epoch” region. Hence, this suggests
that having the maximum variance at the critical epoch was coincidental and there is no indication of a
phase transition
Page 95
word2vec (spike at approximately 260)
Figure 9.3.5.1 Variance plot for the word2vec activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch interval.
Figure 9.3.6.2 Variance plot for the word2vec activations in the first gate (sigmoid), second LSTM layer, over 1000 epoch at 100 epoch intervals, when the word ‘java_14’ is fed into the activation.
Page 96
Figure 9.3.2.3 Variance plot for the word2vec activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.2.4 Variance plot for the word2vec activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 97
Figure 9.3.2.5 Variance plot for the word2vec activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervasl.
GloVe (Spike at approximately 150 epoch)
Figure 9.3.2.6 Variance plot for the GloVe activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 98
Figure 9.3.2.7 Variance plot for the GloVe activations in the first gate (sigmoid), second LSTM layer, over 1000 epoch at 100 epoch intervals, when the word ‘java_14’ is fed into the activation
Figure 9.3.2.8 Variance plot for the GloVe activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 99
Figure 9.3.2.9 Variance plot for the GloVe activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.2.10 Variance plot for the GloVe activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 100
fastText (no spike)
Figure 9.3.2.11 Variance plot for the fastText activations in the first gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.2.12 Variance plot for the fastText activations in the second gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 101
Figure 9.3.2.13 Variance plot for the fastText activations in the third gate (tanh) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Figure 9.3.2.14 Variance plot for the fastText activations in the fourth gate (sigmoid) across the 4 LSTM layers, over 1000 epoch at 100 epoch intervals.
Page 102
9.4 Codes
http://www.planetb.ca/syntax-highlight-word
9.4.1 Chatbot Building Data Preparation
1. import csv
2. import numpy as np
3. from pprint import pprint
4.
5.
6. # read csv file on python
7. filepath = "dialogueText_301.csv"
8.
9.
10. with open(filepath, "r", encoding='latin1') as f:
11. reader = csv.reader(f)
12. data = [line for line in reader]
13.
14. # read only first 50000 rows
15. df=data[1:50000]
16.
17.
18. # grouping all the next neighbour text together