Analysis and Evaluation o22f the Technique Applied in Word ...icmsit.ssru.ac.th/icmsit/fmsicmsit/images/Analysis... · position (their vectors are averaged). The second architecture

ICMSIT 2017: 4th International Conference on Management Science, Innovation, and Technology 2017 Faculty of Management Science, Suan Sunandha Rajabhat University (http://www.icmsit.ssru.ac.th)

94

Analysis and Evaluation o22f the Technique Applied in Word Representation Using Word2vec Algorithm

ANGELICA M. AQUINO23 JASMIN D. NIGUIDULA

ABSTRACT

In this paper, the technique used in word representation is further studied and explained since language representations have been an interest in the field of Natural Language Processing (NLP) and this discipline is concerned with how semantic knowledge is acquired, organized, and ultimately used in language processing and understanding. [5] Two different approaches are available for representing words, count-based approach and predict-based approach. Both are widely used and rely on the same linguistic theory. The primary purpose of this study is to analyses and evaluates the performance of the algorithm being used in word representation, and that is word2vec algorithm. Since representations for the language is essential in the field of NLP, it is expedient to study how computer process natural language. Hence, the next objective of this research is to study the methods or models used of creating representations for the language. Continuous Bag-of-Words (CBOW) and Continuous Skip-gram models were used as the architectures or engines designed for word2vec algorithm. The use of wevi (Word Embedding Visual Inspector) is explained also in this study emphasizing word vectors generated from neurons. The idea of creating representations was done using word embedding that preserves kind of similarity relationships of words. As a result of the analysis made in this study, when pair of words was converted to low-dimension representation, words are expected to be close to each other and neural embedding which continuously recreate a representation eventually happens. Thus, engines of word2vec algorithm showed significant contribution in language representation by generating contexts to the neural networks.

22 Technological Institute of the Philippines- Manila [email protected]

23

Colegio de San Juan de Letran Calamba City [email protected]


95

INTRODUCTION

Word representations are objects that capture word’s meaning and its grammatical properties in a way that can be read and understood by computers. Moreover, it maps words into equivalence classes such that words that share similar properties to each other are part of the same equivalence class. These are either constructed manually by humans (in the form of word lexicons, dictionaries etc.) or obtained automatically using unsupervised learning algorithms. An example of word representation is word lexicons. This is used to construct sparse and interpretable word vectors that are competitive to current state-of-the art models of distributional word representations.

Feng and Lafata (2010) describe words which are similar in meaning has the tendency to behave similarly according to how these words are distributed across different contexts. Latent Semantic Analysis (LSA) as semantic space model is known best and it operationalize the idea by capturing word meaning quantitatively in terms of simple co-occurrence. Moreover, language representation needs to represent words by the context it appears in and a lot of NLP systems are using atomic word representations. This gave the idea to study word embedding which is significant in understanding how a computer process natural language. There are approaches in creating representation for the language and this is structured by the performance of word2vec algorithm. Word2vec is an algorithm used for word representation and it utilizes two different models to produce distributed representation of words, these are Continuous Bag-of-Words (CBOW) and Continuous Skip-gram model architectures.

Different methods have been used as learning algorithms in word representation and this has been a contributing factor to field of natural language processing (NLP) since understanding how a computer process a natural language is important. In order for a computer to process natural language, we need to create representations for the language. By means of unsupervised learning algorithms producing a word representation through word embedding in a large set of text become practical nowadays and word meaning can be learned from the linguistic environment. Moreover, word representations are supposed to capture a word’s meaning in a language. However, the concept of “meaning” is usually abstract and difficult to interpret. Different forms of word representations capture different aspects of meaning. The research focus is to see how this difficulty has been resolved by analyzing the models which were used to obtain word vector representations and to find out how word vectors are produced accurately from the set of neurons irrespective of the models that produced them. The applied algorithm shows the technique of producing word embedding. It simply takes a large input of text then produces a vector space. Word vectors are placed in the vector space where the words that share common contexts in the body of text are positioned in close immediacy to each other.


96

The rest of the paper explain further the concept of distributed word representation in a vector space which support learning algorithms to achieve better performance in natural language processing. Word vectors are represented by neurons and these neurons representation contains an input layer, output layer, and a hidden layer. (See Figure 1).

Figure 1. Neurons Representation

In the layers shown above, vectors are actually built. Input layer contains input vectors which are the weights between the input layer and the hidden layer, while the output layer has output vectors as the weights between hidden layer and output layer. Hidden layer simply acts as the inner-neurons to communicate with other neurons in the input and output layers. This further means that there is a propagation of data that happen and this is the distributed representation of word where a given word is represented as continuous level of activation. That is why neural embedding of word following a model like word2vec computationally recreates representations of a word. RELATED WORKS

There were several subsequent studies conducted to analyse and explain the models used for producing word embedding. In fact, machine learning emphasizing natural language has abundant contribution to the language representation. Words are treated atomic units, meaning there is a notion of similarity between words and these are can be represented as indices in a vocabulary. Also, word vectors with semantic relationships can be used to improve the capability of machine translation, information retrieval and question answering systems. Below are works that describes words and how this being represented in a computer.

The study entitled “Distributed Representations of Words and Phrases and their Compositionality” [Mikolov et al., 2013] stated that the tasks of grouping similar words is a method of providing distributed representation of words in a vector space which is computed using neural networks.


97

In one study, Alstyle et al. (2016) proposed that LSA (Latent Semantic Analysis) can be used to explore word associations. Corpus-based semantic representations exploit properties of textual structure to embed words in a Victoria space. In this space, terms with similar meanings tend to be located close to each other. These methods rely in the idea that words with similar meanings tend to occur in similar contexts. This proposition is called distributional hypothesis and provides a practical framework to understand and compute semantic relationship between words. Word embedding has been used in different applications such as sentiment analysis, psychiatry, psychology, philology, cognitive science and social science.

Moreover, in the study entitled “Diverse Context for Learning Word Representations”, Fauquier (2016) explains the importance of understanding the meaning of words for effective natural language processing. For example, the word playing can mean taking part in a sport, participating in an activity etc. Its part-of-speech is verb with gerund verb form. These are some of the aspects of word meaning which help us identify the word’s role in a language. Thus, for developing natural language processing models that can understand languages, it’s essential to develop a structure that can capture aspects of word meaning in a form that is readable and understandable by computers.

According to Oates [Oates et al., 1999], the use mutual information to cluster words which have similar syntactic structure affects its meaning. The meaning is attributed to the clusters of words by estimating the probability that some cluster of words co-occurs with another. Words are mapped to raw sensor data (via clusters, or other abstraction) for the purposes of recognizing objects/scenes. Partially this is a result of a focus in previous work on learning nouns, and the goal has been discrimination of the environment. Machine learning of linguistic constructs requiring richer representational structure, such as that described in the mental model literature, has not yet been addressed.

In Feng’s and Lapata’s [Fens & Lapata, 2010] Visual Information in Semantic Representation, The representation and modelling of word meaning has been a central problem in cognitive science and natural language processing. Both disciplines are concerned with how semantic knowledge is acquired, organized, and ultimately used in language processing and understanding.

Moreover, in the study of Mikolov et al [Mikolov et al., 2013] entitled “Efficient estimation of word representations in vector space, With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger dataset, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words. He explained that the first proposed architecture was Continuous Bag-of-Words where a non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same


98

position (their vectors are averaged). The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, there is the use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.

In relation to the two architecture models stated above, the study entitled “Distributed Representations of Words and Phrases and their Compositionality” also explains same architecture however the author stress out that there was an inherent limitation of word representation and this can be seen in their differences in word ordering and their inability to represent idiomatic phrases. Using this challenge, a simple method for finding phrases in text and show that learning good vector representations for millions of phrases is also possible.

As to the two models mentioned above as learning algorithms of word2vec, the book entitled “word2vec: From Theory to Practice” of Heuer presented the differences between the two models. CBOW and skip-gram work differently.

Skip-gram works well with small amount of the training data; it represents well even rare words or phrases while CBOW is several times faster to train than the skip-gram, slightly better accuracy for the frequent words. (Mikolov, 2013)

In addition, the output of training the data is represented as word vector which has a finite dimensional vector of real numbers which represents a word in the vector space. The dimensions stand for context items (for example, co-occurring words), and the coordinates depend on the co-occurrence counts. The similarity between two words can be measured in terms of proximity in the vector space (Faruqui, 2016). This is the process of word embedding which were experimented in large set of text and that uses different methods to analyse semantic representations. Corpus-based semantic representations (i.e. embedding) exploit statistical properties of textual structure to embed words in a vectorial space. In this space, terms with similar meanings tend to be located close to each other. These methods rely in the idea that words with similar meanings tend to occur in similar contexts (Altszyler, Sigman, Slezak, 2016).

Similarly, word clustering is a process of grouping of words which ideally captures syntactic, semantic, and distributional regularities among the words belonging to the group. Representing a word by its cluster id helps map many words to the same point, and hence leads to reduction in the number of parameters (Faruqui, 2016).

According to Rong, there are lacks on the materials that comprehensively explain the parameter learning process of word Hem bedding models in detail which lead to confusion and understanding the working mechanism of such models. As a result, Rong explained the learning parameter through a visual inspector named


99

Word Embedding Visual Inspector or wevi in short. Wevi allows to visually examining the movement of input vectors and output vectors as each training instance is consumed.

METHODOLOGY

In this paper, experimental approach was used in the study. The result of this study was obtained through the use of a tool named wevi to check and validate the performance of word2vec algorithm in doing word embedding. As a result of using this tool, the stated problem or challenge was understood and the goals of the study were met. Figure 2 shows the window panes of the wevi. Wevi using the algorithm of word2vec runs on a browser and its window has four panes named control panel (top left), neurons (top right), weighted matrices (bottom left), and vectors (bottom right).

Figure 2. Word Embedding Visual Inspector

The figure above shows how a given input in the control panel processed the data by training it to predict

the output. We can specify what pair of words will be generated after processing the data. Basically, wevi as a

tool used in word representation simply takes an input, do vocabulary building and context building. Below is an

example of test data (training data) inputted in the control panel and the prediction for the output? Test data

consist of context|target. See Figure 3 in the results and discussion section.


100

RESULTS AND DISCUSSION

To analyses the output of word embedding using wevi, sample training data was run in the engine of

word2vec. (See example below.)

Training data: write| song, write| poem, sing| song, sing| loud Predicted output: (shown in the neuron pane)

Figure 3. Sample Training Data and Predicted Output (Without Mapping of Neurons)

Below are the details of the neuron panel: (with labels)

Figure 4: Neuron Panel

Legends: X1 to X5: Word 1 to Word 5 h1 to h3: hidden neuron 1 to neuron3 O1 to O5: Output(target) 1 to Output(target) 5 INPUT LAYER: contains five (5) neurons as the 5 given words. -This neurons/ words will be mapped and will have connection to the hidden layer HIDDEN LAYER: contains three (3) neurons. -It transforms the inputs into something that the output layer can use to predict the output. OUPUT LAYER: contains five (5) neurons.


101

-It transforms the hidden layer activations into whatever scale you wanted your output to be. As shown in Figure 4, the training data set as the input is represented in the neurons pane where the three

(3) layers are representing the input, hidden, and output layers respectively. As illustrated in the neuron panel (See Figure 4), word2vec algorithm uses a single hidden layer which is

fully connected to neural network. The neurons in the hidden layer are all linear neurons. While, the input layer is set to have as many neurons as there are words in the vocabulary for training or the training data set. The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is same as the input layer. Thus, assuming that the vocabulary for learning word vectors consists of V words and N to be the dimension of word vectors, the input to hidden layer connections can be represented by matrix WI of size VxN with each row representing a vocabulary word. In same way, the connections from hidden layer to output layer can be described by matrix WO of size NxV. Below is the mathematical explanation of CBOW model:

Words (V) given is computed with the number of dimensions (N) between input layer and hidden layer

where hidden layer (h) is equal to words as training data in a row (T) of each given word.

If XK = 1 and XK1 = 0, then K = K1

It means that each word in input layer (XK) found in each row is not match in the words of the output layer (XK

1). To show how the words in the input layer mapped with the neurons in the hidden layer; and how hidden layer connects to the output layer, weight matrices are shown and explained below: (See Figure 4 to counter check) Weight matrices between input layer and hidden layer.

3 neurons in hidden layer (3 columns)

5neurons in

input layer

(5 rows)

Rows 1 to 5 represents the words given in the input layer such as word 1 to word 5 and each word (context) is mapped to the hidden layer then transforms the inputs into something that the output layer can use to predict the output. Weight matrices between hidden layer and output layer.

X11 X12 X13

X21 X22 X23

X31 X32 X33

X41 X42 X43

X51 X52 X53

O11 O12 O13 O14 O15

O21 O22 O23 O24 O25

O31 O32 O33 O34 O35

h1 h2 h3


102

Columns 1 to 5 represent the words given in the output layer. The neurons in the hidden layer will connect to the word (target) in the output layer/ Another figure below is given to show the example mapping of words in the neuron panel. As per example, sing is match (embed) with the output word loud, it means that the similarity of the two words seen is an example of word embedding with similar context. The similarity of words can be observed not only in similar context but also in the synonyms, antonyms, hyponym and co-hyponym. Word embedding is important in order for a computer to process natural language; and in order to process the natural language, creation of representation for the language is a necessity. [8] Word embedding compute the similarities between words by mapping input words with the output words in neuron representation. By using CBOW as the engine of word2vec, it predict the target word based on the context.

Using the training data below, figure 5 shows the neuron panel of the wevi window where mapping of neurons occurs.

Training data: write| song, write| poem, sing| song, sing| loud Predicted output: (shown in the neuron panel)

Figure 5. Sample Training Data and Predicted (With Mapping of Neurons)


103

Figure 5 shows an example of word input as its context and word output as its target as well as the process of embedding them. The context and target is inputted in the training box of the control panel. The neurons pane shows the interactions between the three layers of neuron representation. The arrow signifies the next neuron to be matched which is according to the sequence given in the training data. The presets box gives an option whether the training data will be tested using CBOW or Skip-gram engines. While the training data is being learned, the vector pane shows how interactions of vectors occur. Visually, it has a movement while learning the data. For the quality of different models used to produce word vectors, this paper provides a table showing an example of learning data that exists as the words given, together its most similar words. (See Table 2.) It was shown in the wevi that the word write is similar/ related to song, and sing is similar/ related to song. The study also found out that there were other similarities between words. For better understanding on the engine of word2vec, here is table 1 showing their differences. Table 1. General Comparisons of word2vec Engines

Engines Used by Word2vec for Word Representation

(Learning Algorithms) Continuous Bag-of-

Words (CBOW) Skip-gram

Faster and more appropriate for a large amount of data

Slower to train but does a better job for frequent words

Predict current word based on its context

Exhaust the possibilities of classifying a word based on another word in the same sentence

CBOW model is being used in testing the data. It was observed that word similarities are determined using this model. To show the example and analysis for the input data being trained, Table 2 presented the parameters used to compare the word vectors produced from learning the data.


104

Table 2. Analysis for the Vector produced from the size of hidden layer Training Data context|target (Input Data)

Parameters Used Hidden Size (for

hidden layer)

Vectors (Structure Produced)

Interpretation

write|song write|poem sing|song sing|loud

3

Using 3 as the hidden size, there are 3 neurons found in hidden layer which affects how other neurons positioned in the neural space. The proximity is much closer than the use of 5 as the hidden size.

5

This proven the validity of the information collected and analyses in the research study. Furthermore, word embedding as the language model used in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors, data are trained to reconstruct linguistic contexts of words. It simply takes a large amount of text and produces a vector space with several dimensions and each unique word in the amount of text is assigned a matching vector in a space. As per word vector, these are placed in the vector space in such


105

that the word or group of words share common context in the amount of text which are located nearby other vector in the space. Furthermore, to achieve a better performance of learning algorithm in natural language processing, similar words are to be grouped and represented as vector space. Word lexicon uses an algorithm for representing a word and it is a kind of unsupervised learning to obtain vector representations of words. Training the data is done on combined global word-word co-occurrence statistics from a quantity and the resulting representations.

CONCLUSION

Learning algorithms such as CBOW model have contributed to the field of natural language processing

(NLP) as it was used as feature for NLP tasks and machine learning algorithms. Representing word using artificial

neural network (ANN) as the information paradigm to process information highly projects the power of information

theory to the field of NLP as it could handle and process large amount of data. Neural network address the gap in

categorizing text, as human brains do the same especially if this is in large data sets. It was established that

word2vec algorithm can perform word representation and pair of words which were mapped can determine the

similarity of words following a method called CBOW was proven in this study. It was an interesting result of this

study that the use of word vector could definitely support how data is to be represented. Utilizing the models as the

tools used to validate the result of word embedding named Continuous bag-of-word (CBOW) provide word

representation thus resulting to a powerful technique in the field of NLP. Thus, it was verified that word2vec is an

effective technique in word representation that could perform word embedding and can be applied in NLP tasks

like sentiment analysis, syntactic parsing, text classification and document clustering. Furthermore, the techniques

mentioned in this paper can be further explored to enhance the capabilities of the applied algorithm in word

representations.

REFERENCES Altszyler, E., Sigman, M., Slezak, D. (2016). Comparative Study of LSA vs Word2vec embeddings in small

corpora: A Case Study in dream database. Laboratorio de Inteligencia Artificial Aplicada, Depto. de Computación

Burns, B., Sutton, C., Morrison, C., and Cohen, P. (2000). Information Theory and Representation in Associative

Word Learning. University of Massachusetts Amherst, Amherst MA 01002.

Colyer, A. (2016, April 21). The amazing power of word vectors. Retrieved from

https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/


106

Faraqui, M. (2016). Diverse Context for Learning Word Representations. Carnegie Mellon University 5000 Forbes

Avenue, Pittsburgh, PA 15213.

Feng, Y. & Lapata, M., Visual Information in Semantic Representation in Human Language Technologies: The

2010 Annual Conference of the North American Chapter of the ACL, (Los Angeles, California, 2010), Association

for Computational Linguistics, 91-99.

Mikolov, T, Chen K., Corrado G., Dean, J. (2013). Efficient estimation of word representations in vector space.

ICLR Workshop.

Pennington, J., Socher, R., & Manning, C. (2000). GloVE: Gloval Vectors for Word Representation. Computer

Science Department, Stanford University, Stanford, CA 94305.

Rong, X. (2015). Word2vec Parameter Learning Explained. Ann Arbor Michigan.

Selamat, A. & Akosu, N. Word-length Algorithm for Language Identification of Under-resourced Langauges. 28,

457-469.

Oates, T. (2001). Grounding Knowledge in Sensors: Unsupervised Learning for Language and Planning. PhD

thesis, University of Massachusetts, Amherst.

Analysis and Evaluation o22f the Technique Applied in Word ...icmsit.ssru.ac.th/icmsit/fmsicmsit/images/Analysis... · position (their vectors are averaged). The second architecture

Documents