LexBank: A Multilingual Lexical Resource for Low-Resource Languages by Feras Ali Al Tarouti M.S., King Fahd University of Petroleum & Minerals, 2008 B.S., University of Dammam, 2001 A Dissertation submitted to the Graduate Faculty of the University of Colorado at Colorado Springs in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2016
155
Embed
LexBank: A Multilingual Lexical Resource for Low … · · 2016-09-13iv Al Tarouti, Feras A. (Ph.D., Computer Science) LexBank: A Multilingual Lexical Resource for Low-Resource
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LexBank: A Multilingual Lexical Resource for Low-Resource
Languages
by
Feras Ali Al Tarouti
M.S., King Fahd University of Petroleum & Minerals, 2008
B.S., University of Dammam, 2001
A Dissertation submitted to the Graduate Faculty of the
A Lexical resource is a classified group of lexical units that provide some linguistic
information. The lexical units can be morphemes, words or multi-word phrases. The basic
unit of a lexical resource is usually called a lexical entry. Some lexical resources can
be used by humans directly while other lexical resources are machine readable. Lexical
resources are the base of most Natural Language Processing (NLP) applications.
There are many types of lexical resources. Based on its type, a lexical resource
can provide syntactical, morphological, phonological or semantic information. Lexicons,
unilingual dictionaries, bilingual dictionaries and wordnets are examples of lexical re-
sources. There are some few fortunate languages, such as English and Chinese, which
have relatively large number of high quality lexical resources. These languages are usually
called resource-rich. Most of the created lexical resources of the resource-rich languages
have been painstakingly manually created by researchers through many years. Unfortu-
nately, most of the other existing languages lack many of those lexical resources. Thos lan-
guages which lack lexical resources are called resource-low or resource-poor languages.
While some of those languages might have some resources, other languages barely have
any lexical resources. Especially poor in this context are the endangered languages around
the world.
2
One important resource that is very helpful in computational processing and in human
language learning is a thesaurus providing synonyms and antonyms of words. An extended
version of a thesaurus that provides additional relations among words in the computational
context is usually called a wordnet. A wordnet is a structured lexical ontology of words
that groups words based on their meaning using sets that are called synsets. For example,
the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that
has the gloss: an aircraft without wings that obtains its lift from the rotation of overhead
blades. The wordnet connects synsets with each other based on semantic relations. Word-
nets are used in many applications such as word sense disambiguation, machine translation,
information retrieval, text classification and text summarization.
The Princeton WordNet (PWN) is the original English version of such a wordnet and
has been painstakingly produced with diligent manual work augmented by the development
of computational tools, over several decades at Princeton University. Similar complete
wordnets have also been produced for a small number of additional languages such as
French (Sagot and Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and
Watanabe, 2006). Efforts to produce wordnets for a variety of other languages have been
proposed, but most are moving slowly, such as the effort to construct the Asian wordnets
Charoenporn et al. (2008) and Indian wordnets (Bhattacharyya, 2010).
Another important type of resource is the bilingual dictionary, an essential tool for
human language learners. Most existing (online) bilingual dictionaries are between two
resource-rich languages or between a resource-rich language and a resource-poor language.
It is fortunate that many endangered languages have one bilingual dictionary, created usu-
ally by explorers, evangelists or other scholars. However, dictionaries or translators for
3
translations between two resource-poor languages do not really exist. Wiktionary1, a dic-
tionary created by volunteers, supports over 171 languages, although coverage is poor for
many of them. The online translation machines developed by Google2 and Microsoft3 pro-
vide pairwise translations, including translations for single words, for 90 and 51 languages,
respectively. While this is a wide range of languages, these machine translators still leave
out many widely-spoken languages, not to mention endangered ones.
In previous work we focused on developing new techniques that leverage existing
resources for resource-rich languages to build bilingual dictionaries, and core wordnets
and other resources such as simple translators for resource-poor languages, including a few
endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources
in the next level by improving the functionality, quality and coverage of these resources.
We present several new techniques that we did not use in our previous work. Our ultimate
goal is to produce an integrated multilingual lexical resource available online, one that
includes several important individual resources for several languages. We believe that our
resources will help researches, speakers, learners and other users of these languages.
1.2 Research Focus
The goal of this dissertation is to create and make available multilingual lexical re-
sources for several languages by bootstrapping from a limited number of existing resources.
Our study has the potential not only to construct new lexical resources, but also to provide
support for communities using languages with limited resources. Additionally, our re-1http://en.wiktionary.org/wiki/Wiktionary:Main_Page2http://translate.google.com/3http://www.bing.com/translator
export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another
recent work develops a Java library, which is called JWI, for accessing the PWN and com-
pares it with eleven other libraries is (Finlayson, 2014). The comparison between the li-
braries was based on five features: special requirements, used similarity metrics, ability to
edit the wordnet, whether they need to work with the Maven project or not, and forward-
compatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a sum-
mary of the comparison.
Metric
Stan
dalo
ne
Sim
ilari
tyM
etri
cs
Editi
ng
Mav
en
Min
imum
Java
CICWN Yes No No No 1.6extJWNL No No Yes Yes 1.6Javatools Yes Yes No No 1.6Jawbone Yes Yes No No 1.6JawJaw Yes Yes No No 1.5JAWS Yes No No No 1.4JWI Yes Yes No No 1.5JWNL No Yes No Yes 1.4URCS Yes No No No 1.6WNJN No No No No 1.5WNPojo No No No No 1.6WordnetEJB No No No No 1.6
Table 3.2. A comparison between some of the Java libraries for accessing the PWN.
Another wordnet management tool was also presented recently for the IndoWordNet2
(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management
Tool3 (CSS), provides an interactive user interface for creating new language synsets and2http://www.cfilt.iitb.ac.in/indowordnet/3http://indradhanush.unigoa.ac.in/conceptspace
The core idea behind wordnet is to group words which are synonyms, or roughly syn-
onymous, into lexical categories that are called synsets. Then, semantic relations between
these synsets are established in a hierarchical manner. In this chapter, we present a method
to automatically construct the wordnet semantic relations such as Hypernyms, Hyponyms,
Member Meronyms, Part Meronyms and Part Holonyms using PWN.
4.1 Constructing Core Wordnets
In (Lam et al., 2014b) we introduced an approach, which we refer to as the IWND
approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows,
in IWND, to create wordnet synsets for a target language T we used existing wordnets and
a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every
synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the
offset for a synset with a particular part-of-speech (POS). Notice here that each synset
may have one or more words, each of which may be in one or more synsets. Words in a
synset have the same sense. Then, we extracted the corresponding synsets for each offset-
POS from existing wordnets linked to PWN, in several languages. Next, we translated
the extracted synsets in each language to T to produce synset candidates using MT or a
32
dictionary. Then, we applied a ranking method on these candidates to find the correct
words for a specific offset-POS in T.
Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b).
33
The ranking method we used in (Lam et al., 2014b) is based on the occurrence count
of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as
below.
rankw =occurw
numCandidates ⇤ numDstWordNetsnumWordNets where:
- numCandidates is the total number of translation candidates of an offset-POS
- occurw is the occurrence count of the word w in the numCandidates
- numWordNets is the number of intermediate wordnets used, and
- numDstWordNets is the number of distinct intermediate wordnets that have words
translated to the word w in the target language.
4.2 Constructing Wordnet Semantic Relations
Synsets in wordnet are linked in hierarchal fashion. The hierarchy in wordnet is
established using the super-subordinate relation between synsets. For example, nouns are
linked using hyperonymy which is a relation between general synsets and specific one. An
example of a hyperonymy relation is the relation between the synsets {food, solid_food}
and {baked_goods}. Hyperonymy relation is transitive, for example, the synset {bread},
which is a hyponym of the synset {baked_goods}, is also a hyponym of the synset {food,
solid_food}. Table 4.1 shows the semantic relations available in wordnet(Wikipedia, 2015).
In (Lam et al., 2014b), we constructed core wordnets, which essentially means that
we created synsets with no connections between them. As Figure 4.2 shows, our goal is to
recover the taxonomy of synsets. To establish the semantic relations between the sysnets
34
Phrase Type Relation Definition
Nouns
Hypernyms Y is a hypernym of X if every X is a (kind of) YHyponyms Y is a hyponym of X if every Y is a (kind of) XCoordinate terms Y is a coordinate term of X if X and Y share a
hypernymMeronyms Y is a meronym of X if Y is a part of XHolonyms Y is a holonym of X if X is a part of Y
Verbs
Hypernyms The verb Y is a hypernym of the verb X if theactivity X is a (kind of) Y
Troponyms The verb Y is a troponym of the verb X if theactivity Y is doing X in some manner
Entailments The verb Y is entailed by X if by doing X youmust be doing Y
Coordinate terms Those verbs sharing a common hypernym
Table 4.1. Wordnet semantic relations.
Figure 4.2: Core wordnet mapping to structured wordnet.
we created in (Lam et al., 2014b), we rely on Princeton WordNet (Fellbaum, 2005) as an
intermediate resource.
35
As Figure 4.3 shows, to construct the links between synsets in our wordnet for lan-
guage T, we extract each synseti from wordnett and map it with synsetj , which is the cor-
responding synset in the Princeton WordNet. Then, for each synsetj in Princeton WordNet,
we extract each semantic relations rj and the linked synsetsk . Next, we check the availabil-
ity of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a relation
between synseti and synsetk to wordnett .
Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.
We must notice here that although we used some disambiguation methods when we
created the core wordnets, there still are words that are misplaced. This will cause some
false classification of synset relations. Another challenge is that translation leads to loss
of some information. For example, it is very important to distinguish between classes and
instances in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance
will not be translated into the target language as a class and vice versa. Furthermore, as
Figure 4.4 shows, since the core wordnets are automatically created, there will be some
36
missing synsets that might not be available in the target languages. That is will lead to
fragments in the recovered links. All the previous needs to be observed and dealt with to
obtain accepted accuracy.
Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations usingintermediate wordnet.
4.3 Experiment and Evaluation
In this section, we use generate the semantic relations between synsets in three word-
nets: Arabic, Assamese and Vietnamese. We start by creating the core nets using the algo-
rithms we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets
for the three languages. Next we apply our method, which we is presented in Section 4.2,
to link between the synsets. The algorithm was able to recover a total of 206,766 relations
Table 5.7. Precision of the Vietnamese wordnet we create.
50
Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.
The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms
we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to
0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces
synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the
Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7. As shown in Tables
(5.8, 5.9, 5.10), our results suggest that using lower precision for producing synsets reduces
the quality of the other created semantic relations. Our results clearly show that pairs with
higher cosine similarity are more likely to be semantically related. It confirms the benefit
of combining the translation method with word embeddings in the process of automatically
generating new wordnets.
5.9 Summary
In this chapter, we discuss an approach for enhancing the automatically generated
wordnets we create for low-resource languages. Our approach takes advantage of word
embeddings to enhance the translation method for automatic wordnet creation. We present
51
Table 5.9. Examples of related words and their cosine similarity from our Assamese word-net.
Table 5.10. Examples of related words and their cosine similarity from our Vietnamesewordnet.
52
an application of our approach to producing new Arabic Wordnet. Our method automat-
ically produces Arabic synonyms with 78.4% precision and semantically related pairs of
words with up to 90.4% precision.
Chapter 6
SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD
EMBEDDINGSWord embedding is a way to represent words as vectors in a multi-dimensional space
such that related word are represented as vectors with similar direction. It has been shown
that this model can be used to discover relation between words effectively. In this chapter,
we introduce a method to represents wordnet synsets in similar way. A wordnet synset is
a group of synonym words grouped together because they all represent the same concept.
Our proposed method can be used in several NLP applications such as word-sense disam-
biguation and automatic wordnet construction. To test our method we use it in the task of
selecting glosses for wordnet synsets of several languages.
6.1 Creating Language Model Using Word Embedding
We start by creating word embeddings using a corpus and the word2�ec software
(Mikolov et al., 2013). word2�ec is a two-layer feedforward neural-network learning
model that produces multi-dimensional vector representation of words. There are two im-
plementations of this learning model: Skip-Gram (SG) implementation and Continuous
Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the
words around a given word, while in the CBOW implementation the model learns the word
within a given sequence of words.
6.2 Generating Vector Representation of Wordnet Synsets
In this section, we present our method to produce wordnet synsets. We build our
method based on the vectors of the synonym words produced by the word embedding
54
method. We believe that combining the vectors of synonym words into one vector can
produce a way to represent meaning. Next, we describe our propose method to build the
vector representation of synsets, which we call s�nset2�ec.
Let
s�nseti = {word1
,word2
, ...,wordj} be a synset in wordnetx ,
{n1
,n2
, ...,nj} is the number of synsets for each word in s�nseti , and
{~V1
, ~V2
, ..., ~Vj} are the corresponding vectors for {word1
,word2
, ...,wordj} in the word
embedding model.
We identify two cases:
1. The first case is when a word, which does not have any synonyms, represents several
synsets i.e. have more than one meaning. Therefore, the vector that produced by the
word embedding is actually representing the combined meanings of the word. For
example, in PWN, the word “abduction” is the only word in both synset 00775460-
n, “the criminal act of capturing and carrying away by force a family member”, and
synset 00333037-n, “moving of a body part away from the central axis of the body”.
Hence, the vector for “abduction” actually represents both meanings.
2. The second case is when a word, which does have one or more synonyms, have
one or more meanings. In this case, the synonyms might or might not have other
meanings also. For example, the noun “spill” have four meanings in PWN and it
have 6 synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its
synonyms in PWN.
Obviously, to generate a combined vector for a synset, we need a way to limit the
effect of the other meanings that the synonyms might hold. To do so we start by solving
the second case where the synsets have more than one word. In this case, We normalize
the vector of each word by dividing its coordinates by the number of synsets that the word
55
Synset Key Gloss Synonyms00076884-n a sudden drop from an upright position {spill, tumble, fall}00329619-n the act of allowing a fluid to escape {spill, spillage, release}
04277034-n a channel that carries excess waterover or around a dam or other obstruction {spill, spillway, wasteweir}
15049594-n liquid that is spilled {spill}
Table 6.1. Meanings of the noun “spill” and its synonyms.
belongs to. This reduces the noise when generating the synset vector caused by the other
meanings that a word can hold. We define the vector of s�nseti (~Vsi) as follows:
~Vsi =1
j· (~V
1
· 1
n1
+ ~V2
· 1
n2
+ ...+ ~Vj ·1
nj).
Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include
three words: spill, tumble and fall.
Figure 6.1: An example of creating a vector for a wordnet synset that include more thanone word.
Next, we produce vectors for the synsets that share a single word, i.e. words that do
not have any synonyms and have more than one meaning. In this case, for each synset,
we produce the synset vector by combing the word vector with the vector of a word in a
56
related synset, e.g. a hypernym, a hyponym, or a meronym. For example, let s�nseti and
s�nsetk be synsets that both include the same single wordw . And let h1
be a word from the
hypernym of s�nseti and h2
be a word from the hypernym of s�nsetk . We define the vector
of s�nseti (~Vsi) as follows:
~Vsi =1
2
· (~Vw ·1
nw+ ~Vh
1
· 1
nh1
) .
Similarly, we define the vector of s�nsetk (~Vsk) as follows:
~Vsk =1
2
· (~Vw ·1
nw+ ~Vh
2
· 1
nh2
).
Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduc-
tion”. In Appendix A.2 we list a python implementation of the procedure.
Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.
6.3 Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec
In this section, we give one usage example of our model. We show how our proposed
model can be used in the automatic selection of glosses for wordnet synsets. The automatic
57
selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence
which is , usually, manually attached to a synset to clarify the meaning of the synset. This
short sentence can be a definition or an example sentence of one of the members of the
synset. We test our method using PWN and, then, apply it to automatically add glosses to
wordnets created in (Lam et al., 2014b).
In the foloowing steps, we present our method to select a gloss for s�nseti we defined
in section 6.2.
• Let G = {�1
,�2
, ...,��} be set of candidate glosses that include a word belongs to
s�nseti .
• To select the closest gloss to s�nseti from G we generate a vector for each gloss �z 2
G. We list a Python function for this step in Appendix A.3.
• Assume that the gloss �z consists of the words {w1
,w2
, ...,wd},
{m1
,m2
, ...,md} is the number of synsets for each word in �z , and
{~Vw1
, ~Vw2
, ..., ~Vwd} are the corresponding vectors for {w1
,w2
, ...,wd}.
• We compute the vector of gloss �z as follows:
~V�z =1
d· (~Vw1
· 1
m1
+ ~Vw2
· 1
m2
+ ...+ ~Vwd ·1
md).
• Then, we compute the cosine similarity between the vector of each gloss �z and ~Vsi .
We present a Python implementation for this step in Appendix A.4.
• Finally, we select the gloss with highest cosine similarity with ~Vsi .
For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs
to two synsets and does not have any synonyms, we notice that our algorithm was able to
distinguish between the two meanings and select the right gloss for both synsets.
58
Synset Key Gloss CosineSimilarity
00333037-nthe criminal act of capturing and carrying awayby force a family member 0.172
moving of a body part away from the centralaxis of the body 0.214
00775460-nthe criminal act of capturing and carrying awayby force a family member 0.204
moving of a body part away from the centralaxis of the body 0.189
Table 6.2. Cosine similarity between the different synset vectors and glosses of the word“abduction” in PWN.
6.4 Evaluation
In this section, we introduce two forms of evaluation. First, we apply our method
to select glosses for the PWN synsets. In this case, we directly compare our results to the
actual manually attached glosses in PWN. Then, we apply our method to attach glosses to
wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to
evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese.
6.4.1 Using Synset2vec to Select Glosses for PWN Synsets
In order to evaluate our synset vector representation in the task of selecting glosses for
wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of
the glosses manually added to the synsets in PWN to automatically measure the precision of
our synsets representation. The following steps describe the evaluation process of selecting
glosses for PWN synsets.
• For each s�nseti in PWN, we construct a set of candidate glosses. The candidate
glosses are extracted from PWN using the following method. First the gloss attached
to s�nseti in PWN is added to the candidate set of glosses. Next, to generate negative
glosses for s�nseti , we extract words which belong to s�nseti and other synsets, i.e.
59
words have the meaning of s�nseti and one ore more other meaning. This allow us to
examine the ability of the algorithm to differentiate between the different meanings
of synsets.
• We randomly selects two types of synsets from PWN: synsets that have single words,
i.e. synsets that are represented by only single words, and synsets that include multi-
ple synonym words.
• We generate the synset vectors using the algorithm we described in 6.2.
• Next, we generate the gloss vectors using the method we described in 6.3.
• Then, we compute the cosine similarity between s�nseti and each gloss in the candi-
date set.
• Finally, we select the gloss with the highest cosine similarity.
6.4.2 Using Synset2vec to Select Glosses for Arabic,Assamese and Viet-
namese Synsets
In this section, we examine the precision of our method by applying it for the pur-
pose of selecting glosses from corpora to attach to the wordnets we create in the previous
chapters. In this experiment, we used the wordnets of the languages: Arabic, Assamese
and Vietnamese. Next, we describe the steps of evaluating glosses selected by our method
for the synsets of the target languages:
• For each s�nseti in the target wordnet wordnett , we generate a set of candidate
glosses by extracting the set of sentences that include any member of s�nseti from
the corpora we described in Section 5.7.
60
Synset Type Number of Synsets PrecisionSingle Member 1400 76.5%Multi Member 600 79.6%
Table 6.3. The precision of selecting glosses for PWN synsets
• We randomly selects two types of synsets from wordnett : synsets that have single
words, i.e. synsets that are represented by only single words, and synsets that include
multiple synonym words.
• We generate the synset vectors using the algorithm we described in 6.2.
• Next, we generate vectors for each sentence in the set of candidate glosses using the
method we described in 6.3.
• Then, we compute the cosine similarity between s�nseti and each sentence in the
candidate set.
• Next, the top 3 sentences with the highest cosine similarity with the s�nseti are se-
lected.
• Finally, 3 native speakers of the target language are asked to evaluate the selected
sentences using a 5 point scale.
6.4.3 Results & Discussion
As shown in Table 6.3, we used our algorithm to select glosses for 1400 single-
member synsets from PWN. The algorithm achieved 76.5% precision. Also, we used it to
select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in this
case.
In the second evaluation, we randomly selected 300 synsets from the Arabic, As-
samese and Vietnamese wordnets we create (100 synset each). For each synset, we ex-
tracted all the sentences that included any member of the synset from the corpora. The
61
Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.
sentences were sorted according to the cosine similarity with the synset vector and the top
3 sentences where selected.
As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is
81.4% when selecting the sentences with the highest cosine similarity with the synset vec-
tor. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respec-
tively. The overall precision of selecting glosses using our method for the Arabic synsets is
72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along
with the their cosine similarity values.
The precision of our method for selecting glosses for the Assamese synsets is 85.2%
when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and
top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for
Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the
Assamese synsets along with the their cosine similarity values.
The top Vietnamese glosses selected by our method has 39.4% precision. The top 2
and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table
62
Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet.
6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the
their cosine similarity values.
In general, the precision of the recently published algorithms (Apidianaki and Von Neu-
mann, 2013) for the task of multilingual word-sense disambiguation is arround 68.7%,
meaning that our algorithm is showing better performance for English, Arabic and As-
samese. However, we notice that our method perform poorly with Vietnamese. The reason
behind the poor results with Vietnamese is that Vietnamese words are not separated by
white spaces (Gordon and Grimes, 2005). That means that the meaning of most the words
can change based on the following words. This makes the process of generating the vectors
for both the synsets and sentences extremely difficult since word2�ec algorithm assumes
that words are separated by white spaces. The Same problem appears in the process of
automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a). One
63
Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.
Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets
possible solution to this problem is replacing the white spaces within the single Vietnamese
words with a special non-white character. This requires the existence of a language dictio-
nary to distinguish the words that include white spaces within them.
64
6.5 Summary
In this chapter, we presented new method for selecting synset gloss from a corpus.
The method can be used for low-resource languages to attache glosses to wordnets con-
structed automatically. Our method present vector representation for wordnet synsets in
a multi-dimensional space. We construct a synset vector by grouping the word embed-
ding vector of each synonym in the synset. Our evaluation showed that our method selects
glosses with precision up to 84.4%.
Chapter 7
LEXBANK: A MULTILINGUAL LEXICAL RESOURCE
Figure 7.1: An overview of LexBank system.
7.1 Introduction
In this chapter, we discuss the design and implementation of LangBank: a system that
provides access to the multilingual lexical resources we create in this dissertation. We aim
to give public users the ability to access and use the resources that we have created in our
project. The system provides wordnet search services to several resource-low languages in
addition to bilingual dictionary look up services. In addition, the system receives evaluation
and feedback from users to improve the quality of the resources.
66
As Figure 7.1 shows, the system is divided into three layers: Web interface, applica-
tion layer and database layer. The Web interface allows users to log into the system and
access the search services. The web interface, also, provides a control panel for adminis-
trators to allow them to manage the system. The application layer includes all the software
required to securely execute the users requests. The database layer has two databases: lex-
ical resources database and system database. The system database stores users information
and the system settings. The design of the system allows including new language resources
and easy modifications.
7.2 Database Design
LexBank uses two databases: one for storing the system settings and one for storing
the lexical resources. We have used Microsoft SQL Server to construct the databases. The
SQL code we used to construct the databases is listed in Appendix B. Next, we describe
each database in details.
7.2.1 The system settings database
There are two tables in the setting database: Users_Info and System_log. Next, we
describe both of the tables.
7.2.1.1 Users_Info
The Users_Info table contains information of the registered users. Following are the
fields contained in the Users_Info table:
• UserId: a unique short alias name, which is selected by the user, that is used to
identify users in the system.
• UserName: the full name of the user.
• UserEmail : the email address of the user.
67
• UserPwd: the encrypted password used by the user to access the system.
• UserPri�: a text field that determine the privileges that the user has. There are two
levels of users in the system. The first level is administrator which has the privileges
of managing users and data in the system. The Second level is client which has the
privilege of browsing the available resources.
• UserStatus: this field specify the status of the user. The status can be Active, Inactive
or New.
7.2.1.2 System_log
The System_log table keep records of all the users activities in the system. This helps
us in maintenance and keeping track of the utilization of the system. The following fields
are contained in the System_log table:
• E�entId: a unique key that is used to identify the event.
• E�entDesc: a text description of the event.
• E�entTime: the date and time of the event.
• UserId: the identification key of the user who committed the event.
7.2.2 The lexical resources database
The lexical resources database contains the resources we produce in this thesis. For
each language supported by the system the database maintain tables for storing the core
wordnet, the semantic relations, the wordnet glosses, the evaluation data of the semantic
relations and the evaluation data of the wordnet glosses. Next, we describe each table in
this database.
68
7.2.2.1 CoreWordnet
The CoreWordnet table stores the wordnet synsets we create in this thesis. The
core wordnet groups the synonym words into sets called synsets. In this table, synsets are
identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos
consists of two parts: byte offset used to locate the synset in the data file and the part of
speech of the synset. Following are the fields in the CoreWordnet table:
• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the
synset.
• Member : a word belongs to the synset.
7.2.2.2 Sem_Relations
Whereas the synonymy relation is stored in the CoreWordnet table, other semantic
relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As
we described in Section 4.2, the semantic relations are directed relations. Therefore, we
should maintain the direction by specifying the side of each synset in the relation. The
Sem_Relations table contain the following fields:
• Le f t_offset-pos: this field specify the offset-pos of the synset in the left side of the
relation.
• Relation: a text field that specify the relation between the left side and the right side
synsets.
• Ri�ht_offset-pos : the offset-pos of the synset in the right side of the relation.
7.2.2.3 WordnetGlosses
The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. Fol-
lowing are the fields of the WordnetGlosses table:
69
• offset-pos: the offset-pos of wordnet synset.
• Gloss: a text field that contain the gloss of the synset.
7.2.2.4 Sem_Relations_Eval_Data
The Sem_Relations_Eval_Data table contains the semantic relations sample data
which is used in the evaluation. This table contains the following fields:
• RelationKey: a unique identification number used to identify the semantic relation
being evaluated.
• Le f t_offset-pos: the offset-pos of the synset in the left side of the relation being
evaluated.
• Word1: this field specify the word in the left side of the relation being evaluated.
• Relation: a text field that specify the type of relation being evaluated.
• Ri�ht_offset-pos: the offset-pos of the synset in the right side of the relation being
evaluated.
• Word2: this field specify the word in the right side of the relation being evaluated.
• COS: the cosine distance, as measured in Section 5.4, between the left word and the
right word in the relation being evaluated.
7.2.2.5 Sem_Relations_Eval_Response
The Sem_Relations_Eval_Response table contains the collected responses of the se-
mantic relations we produce from evaluators. This table consists of the following fields:
• AnswerKey: a unique integer number that is generated automatically to identify the
response.
70
• RelationKey: the key of the semantic relation being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the semantic relation.
• UserId: identification key of the evaluator who evaluated the response.
7.2.2.6 WordnetGlosses_Eval_Data
The WordnetGlosses_Eval_Data table holds the wordnet glosses sample which being
evaluated by the users. The table includes the following fields:
• GlossKey: an automatically generated unique integer used to identify the gloss being
evaluated.
• offset-pos: the offset-pos of the wordnet synset.
• Word: the word which being used in the gloss to represent the wordnet synset.
• Sentence: the sentence selected as gloss for this wordnet synset.
• PWNGloss: the English gloss of the corresponding synset in PWN.
• CosSem: the cosine similarity between the selected sentence and the synset as mea-
sured in Section 6.3.
• GlossRank: an integer value that represents the rank of the gloss among the other
candidate glosses. The rank is assigned by the system to the gloss being evaluated
based on the CosSem value. Glosses with the highest CosSem value have a rank value
1.
7.2.2.7 WordnetGlosses_Eval_Response
Responses from the users for evaluating the wordnet glosses we produced in Section
6.3 are stored in the WordnetGlosses_Eval table. This table consists of the following fields:
71
• AnswerKey: a unique integer number that is generated automatically to identify the
response.
• GlossKey: the key of the gloss being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the gloss.
• UserId: identification key of the evaluator who evaluated the gloss.
7.3 Application layer
In this section, we describe the main functions provided by LexBank. In order to
maintain simplicity, we implement most of the functions of the system in one utility class
(LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix
C, consists of the following methods:
• IsUserIdAvailable(): takes a userId and return true if this never been used by another
user before.
• EncryptPassword(): takes a plain text password and return an encrypted password.
• DecryptPassword(): takes an encrypted password and return a decrypted password.
• CreateNewUser(): takes the details of a new user and create an account for him by
string the data in the Users_Info table.
• IsAuthenticated(): takes the user identification and password and return true if it
match the user information in the users table.
• FindSynSet(): takes a lexeme and return a list of synsets that include this lexeme.
• FindSynSetLexemes(): takes an OffsetPos of a synset and return the list of lexemes of
this synset.
72
• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and return
true if the synset is available in the spcified wordnet.
• FindSynSetRelations(): takes an OffsetPos of a synset and return all the semantically
related lexemes.
• FindGloss(): takes an OffsetPos of a synset and return the gloss of the synset.
• ReadRelation(): takes a RelationKey and return the details of the relation.
• ReadSynsetGloss(): takes a GlossKey and return the details of the gloss.
• EvaluateRelation(): takes RelationKey,Score and UserId and store them in the eval-
uation table of the semantic Relations.
• EvaluateGloss(): takes GlossKey,Score and UserId and store them in the evaluation
table of the wordnet glosses.
• LogEvent(): takes event description and store it in the System_log table.
• ChangeUserStatus(): takes UserId of a user and change his status to a specific new
status.
• RetrieveUsers(): a method that return a list of all the users in the system and their
information.
7.4 Web Interface Design & Implementation
In this section, we describe the design of the web interface of LexBank. The web
interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2
shows the site map of the web interface. The interface is accessed by the log-in web page
(frmLogin.aspx). New users need to register to gain access to the system. Registration can
be done by filling the registration web form (frmRegister.aspx). Once a user logged into the
73
system, the main menu web page (frmMainMenu.aspx) is shown. The main menu include
links to access the services available in the system. In the following sections we describe
each web page in the system.
Figure 7.2: LexBank web site map
7.4.1 Registration Form
New users needs to register in the system uisng the registration form (frmRegis-
ter.aspx). As shown in Figure 7.3, a new user needs to provide: the full name, email,
email confirmation, user identification, password and password confirmation, then press
the Register button.
The registration process starts when a new user submit his information through the
registration web form. Once the registration form receive the information, it check if all the
fields met the requirements of the system. The requirements include a valid format for the
email address and the password. The requirements, also, include that the user identification
was never been used before by an existing user. If the information sent by the user pass
the validation process, the registration form calls the CreateNewUser() method from the
74
Figure 7.3: The registration web form
utility class. The CreateNewUser() method uses the EncryptPassword() method to encrypt
the password, then it writes the data into the Users_info table. The registration process is
summarized in the sequence diagram shown in Figure 7.4.
7.4.2 Log-in Form
Registered users can login to the system using the log-in web page (frmLogin.aspx)
which is shown in Figure 7.5. User with an active account needs to provide his user identi-
fication and password to start the log-in process.
As shown in Figure 7.6, when the log-in web form (frmLogin.aspx) receives the
userid and the passowrd it calls the IsAuthenticated() method from the utility class. Then,
the password is encrypted using the EncryptPassword() and compared with the encrypted
75
Figure 7.4: Sequence diagram of the registration process
Figure 7.5: The log-in web form
76
Figure 7.6: Sequence diagram of the log-in process
password stored in the users table. If the userid and the password provided by the user
matched the userid and the password stored in the users table, the main menu of the web
interface is shown to the users, otherwise, an error message is shown to the user. The main
menu is shown in Figure
7.4.3 The Main Menu
The main menu include links to access the services available in the system. The
services presented by the web interface are:
• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).
• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).
• Evaluating semantic relations between synsets, provided by the web page (frmEval-
Relations.aspx).
• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).
77
Figure 7.7: The main menu
• Users management, provided by the web page (frmManageUsers.aspx).
7.4.4 Searching Wordnet By Lexeme Web Form
The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a
lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the
following components:
• A text box used to allow the user to enter a lexeme.
• A drop menu to allow the user select the language.
• A list box for showing the synsets list of the entered lexeme.
• A list box for showing the synonyms of the entered lexeme.
• A list box for showing the related lexemes.
• A button to start the searching process.
The searching process, as shown in Figure 7.9, starts when the user submit a lexeme
and language to the frmWordnetSearch.aspx web form. Then, the method FindSynset()
78
Figure 7.8: The Web form for searching wordnet by lexeme. The form is showing the resultof searching the Arabic lexeme (���) which means Egypt.
from the utility class is called to retrieve the synsets that include the entered lexeme and
show the result in the synsets list. Next, when the user selects a synset from the synsets
list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the
utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the
FindSynsetRelations() method to obtain the related lexemes and show them to the user in
the related lexemes list. The user also can extend the details of the synset shown in the
synset list and the related lemexes list by double clicking on the synset OffsetPos. This will
show the frmSynsetDetails.aspx web form which we will describe next.
7.4.5 Searching Wordnet By OffsetPos Web Form
Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx web form
which is shown in Figure 7.10. This web form consists of the following components:
79
Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme
• A text box for entering the OffsetPos of the synset.
• A drop menu to allow the user select the language.
• A text box for showing the gloss of the synset.
• A text box for showing the English gloss of the synset.
• A list box to show the synonym list of the synset.
• A list box to show the related synsets and lexemes of the entered synset.
• A button to start the search process.
In this form, the user starts the process of searching wordnet by submitting the Off-
setPos of the synset and the target language to the frmSynsetDetails.aspx web form. The
80
Figure 7.10: The Web form for searching wordnet by OffsetPos. The form is showing theresult of searching the Arabic synset (08897065-n).
web form calls the FindGloss() mehtod from the utility class to retrieve the gloss of the
synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to
obtain the synonym list and releated synsets of the input synset to show them in the form.
7.4.6 Evaluating Semantic Relations Between Synsets Web Form
The web form frmEvalRealtions.aspx allow users to evaluate semantic relations be-
tween lexemes and synsets in the system. The form shows the relation as a sentence and
asks the user to rate the correctness of the sentence using a Likert-type scale. The form
consists of the following components:
• A text box showing the relation key.
• A text box showing the relation in the form of a sentence.
81
Figure 7.11: Sequence diagram of the process of searching wordnet using OffsetPos.
Figure 7.12: The Web form for evaluating semantic relations between synsets in a word-net. The form is showing an example of evaluating a hyponymy relation between the twoAssamese lexemes radiotelegraph and radio.
82
• A text box showing the UserId of the evaluator.
• An option box that allow the user to rate the relation.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.13: Sequence diagram of the process of evaluating the relation between two lex-emes.
The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling
the ReadRelation() method from the utility class to show the relation details to the user.
When the user submit the score he assign to a relation, the evaluation form frmEvalReal-
tions.aspx store the score by calling the EvaluateRelation() method from the utility class.
Then, the evaluation form reads the next relation and show it to the user. The user can
stop the evaluation process by clicking the End Session button. The user have the option
to resume the evaluation process if he stopped any time he wish without re-evaluating the
relations he already evaluated.
83
7.4.7 Evaluating Wordnet Synsets Glosses Web Form
Figure 7.14: The Web form for evaluating wordnet synsets glosses. The form is showingan example of evaluating Arabic synset (13108841-n).
The glosses of the wordnets is evaluated using the frmEvalGloss.aspx web form. To
evaluate a synset gloss, the form attach the English gloss of the synset obtained from the
PWN to the selected gloss in the target language. Then, the user is asked if the lexeme
in the selected gloss has the same meaning of the PWN gloss. This evaluation form is
composed of the following components:
• A text box showing the gloss key.
• A text box showing a lexeme from a synset, a candidate gloss written in the target
language, the English gloss of the synset.
• A text box showing the UserId of the evaluator.
• An option box that allow the user to rate the candidate gloss.
84
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.15: Sequence diagram of the process of evaluating the relation between two lex-emes.
The web form frmEvalGloss.aspx starts the evaluation process of glosses by calling
the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate
gloss and the English gloss of the synset being evaluated. Then, the web uses the previous
data to construct a question for the user. When the user submit the score he assign to the
candidate gloss, the evaluation form stores the score by calling the EvaluateGloss() method
from the utility class. Then, the evaluation form reads the next gloss and show it to the user.
The user can stop glosses evaluation process by clicking the End Session button. The user
can resume glosses evaluation process in any time he wish without re-evaluating the glosses
he already evaluated.
85
Figure 7.16: The Web form for managing users in LexBank.
7.4.8 Users Management Web Form
To allow the administrators of LexBank to manage the users, we designed the frm-
ManageUsers.aspx web form. Access to this form is restricted to administrators. The form
list all registered users with their information. An administrator can activate the accounts
of new users using this form. Also, he can deactivate any user from the list. This form can
be extended in the future by adding more functionality. As shown in Figure 7.16, this form
consists of the following components:
• ID: the UserId of the user.
• Name: the full name of the user.
• Email: the email address of the user.
• Privilege: the privilege assigned to the user. This can be administrator or client.
• Status: the current status of the user.
• Change Status: a command link to change the current status of the user. The status
of the user can be change to be Inactive or Active.
As summarized in the sequence diagram shown in Figure 7.17, an administrator user
start the process of users management by trying to access the frmManageUsers.aspx web
86
Figure 7.17: Sequence diagram of the process of managing users in LexBank.
form. The web form calls the method IsAdmin() from the utility calss to verify if the user is
authorized to access the form or not. If the user is not authorized, an error message is sent to
the user. Otherwise, if the user is authorized the web form calls the method RetrieveUsers()
to obtain the list of registered users in the system. Then, the administrator can select a user
from the list and click the change status link to change the current status of the user. Then,
the web form calls the ChangeUserStatus() method from the utility class to store the new
status and reload the updated users list in the screen.
7.5 Summary
In this chapter, we described the design and implementation of the LexBank, the mul-
tilingual lexical resource we produce in this thesis. The architecture of LexBank consists
of three layers: the database layer, the application layer and the web interface layer. The
87
database layer consists of two databases: system settings database and resource database.
The application layer of the system is implemented using Microsoft C#. It provides admin-
istrative and resource access services to the web interface. The web interface is designed
and implemented using Microsoft Visual Studio 2012. The interface include web forms for
managing users and provide different wordnet search services in several languages. The
system can easily updated to accommodate other lingual services and languages.
Chapter 8
CONCLUSIONSIn this chapter, we summarize the main contributions of this dissertation. This dis-
sertation is motivated by the fact that so many languages around the word lack the compu-
tational lexical resources that are essential in natural language processing. Our first goal
in this dissertation is to develop automatic techniques, that rely on few available public
resources, for constructing wordnets for low-resource languages. A wordnet is a structured
lexical ontology of words that groups words based on their meaning using sets that are
called synsets. Wordnet is a very important lexical resource that is used in many applica-
tions, such as translation, word-sense disambiguation, information retrieval and document
classification. The second goal of this dissertation is to design and implement a system that
makes the lexical resources we produced available to the public. Next, we list the main
contributions of this dissertation.
• We have developed an approach for constructing structured wordnets. This approach
was developed by extending the approach for constructing the core wordnets pre-
sented by (Lam et al., 2014b). A core wordnet consists of only synsets that group
synonym words in sets with unique id. In a more comprehensive wordnet, these
synsets are semantically connected to represent the relation between the meaning of
the synsets. Our approach produces synsets that are semantically connected by se-
mantic relations. Examples of the semantic semantic relations we produced are: syn-
onyms, hypernyms, topic-domain related, part-holonyms and instance-hypernyms
and member-meronyms.
• We presented an approach for enhancing the quality of automatically constructed
wordnets. The approach is based on the vector representation of words (word em-
89
beddings). Word embeddings is a machine learning technique that maps words to
real numbres vectors in a multi-dimensional space. Our approach uses the word2�ec
algorithm (Mikolov et al., 2013) to generate word representations from an exist-
ing corpus. The word2�ec algorithm is a feedforward neural network that predict
the vector representation of words within a multi-dimensional language model. Our
approach compute the cosine similarity, using word2�ec, between semantically re-
lated words in our constructed wordnets and filter any entries which do not satisfy a
pre-selected threshold value.
• We introduced s�nset2�ec, which is an algorithm for representing wordnet synsets
in a multi-dimensional space. Word embeddings provides an excellent vector repre-
sentation of words. However, words representation is effected by the fact that many
words have multiple meanings. In order to represent meanings rather than words, we
combine the vectors of synset lemxes into one vector that represent the meaning. We
believe that this vector representation can be used in many important applications.
For example, it can be used in of word-sense disambiguation, machine translation
and gloss selection for wordnet synsets.
• We used our algorithm s�nset2�ec to add glosses to our automatically constructed
synsets. Glosses are a very important part of wordnets. It is used to declare or
clarify the meaning of a synset in a wordnet. Gloss can be a definition statement
or an example sentence that shows the usage of the synonyms of the synset. To
select a gloss from a corpus for a synset, we used s�nset2�ec to generate vector
representations of candidate glosses and the synset. Then we compute the cosine
similarity between each candidate gloss and the synsets. Finally, we select the gloss
with highest cosine similarity with synset and attach it to the synset.
• We have developed LexBank which is a web application that give access for public
users to our created resources. LexBank provides useful services for users that seek
90
linguistic assistance in a friendly manner. It, also, include evaluation web forms
that are used to gather feedback from human judges. The design of LexBank is
flexible and it can be easily expanded to accommodate additional new languages
and resources.
Chapter 9
FUTURE WORKIn this chapter, we propose some potential future work that can be done based on this
dissertation. The general goal of the proposed future word is to enhance the quality and
extend the coverage of the lexical resources. For example, we produced our core wordnets
based on machine translation and some small dictionaries. The quality of these wordnets
are limited by the resources we used to create them. It is well known that these resources
does not guarantee high coverage and accuracy for all of the target languages. Next, we list
some of the potential future work.
9.1 Extending Bilingual Dictionaries
In this section, we provide one more additional possible task that can be undertaken in
future work. We propose a new method to extend the bilingual dictionaries created in (Lam
et al., 2015b). To increase the coverage of the bilingual dictionaries, we take advantage of
the wordnets we have created in this dissertation. This section is divided into two parts.
In the first part, we describe the approach we used in (Lam et al., 2015b) to create the
bilingual dictionaries. In the second part, we describe the proposed method to extend these
bilingual dictionaries.
9.1.1 Related Work
In (Lam et al., 2015b) we have created a large number of new bilingual dictionar-
ies using intermediate core wordnets and a machine translator. A dictionary, or a lexicon,
as defined by (Landau, 1984), consists of sorted 2-tuple <LexicalUnit, Definition> en-
tries. Each entry is called LexicalEntry. The first part of a LexicalEntry is the phrase being
defined, while the second part is the definition of the phrase. The definition include the
92
meaning of the LexicalUnit and usually have several Senses which is is a separate repre-
sentation of a single aspect of the meaning of a phrase. In (Lam et al., 2015b), the entries
in the dictionaries are of the form < LexicalUnit ,Sense1
>, < LexicalUnit ,Sense2
>,....
The approach for creating dictionaries using intermediate wordnets and a machine
translator (IW) is described as in Figure 9.1 and Algorithm 2.
Figure 9.1: The IW approach for creating a new bilingual dictionary
Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S
is a source language and D is a target language, given the dictionary Dict(S,R) where R
93
is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from
Dict(S,R) and extract SenseR from it. Then, it retrieves all Offset-POSs of SenseR from
the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted
Offset-POSs are extracted from all the available intermediate wordnets. Then, the algorithm
construct a candidate set candidateSet for the final translations in language D by translating
all the extracted synonyms to language D using machine translation (Algorithm 3). There
are 2 attributes in each candidate in candidateSet : word which represents a translation in
language D, and rank which counts the occurrence of this translation. The rank attribute
is used to order the candidates in descending order where the top candidate is the best
translation. Finally, the sorted candidates are inserted into the new dictionary Dict(S,D)
1: Dict(S, D) := �2: for all LexicalEntry 2 Dict(S,R) do3: for all SenseR 2 LexicalEntry do4: candidateSet := �5: Find all Offset-POSs of synsets containing SenseR from the R Wordnet6: candidatSet = FindCandidateSet (Offset-POSs, D)7: sort all candidate in descending order based on their rank values8: for all candidate 2 candidateSet do9: SenseD=candidate.word
10: add tuple <LexicalUnit,SenseD> to Dict(S,D)11: end for12: end for13: end for
9.1.2 Extending Bilingual Dictionaries Using Structured Wordnets
In this section, we propose a new method to extend dictionaries we created by (Lam
et al., 2015b) using the structured wordnets that we have created in this dissertation. The
############################ Program to compute cosine similarity# between semantically related words in a WordNet# using Word2Vec# Author: Feras Al Tarouti# Date : Feb 4 2016
import unicodecsv as csvimport codecsimport gensimimport editdistance
with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f:writer = csv.writer(f)writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2',
'COS','ld'])with open('LexBankVieSemRelatedWords.csv', 'rb') as f:
############################ A function for computing a synset vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorForSynset(syn,thislemma):
FinalVector=np.zeros(100)VectorList=[] # define the vector set for this synsetLemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset
VectorList.append(Vector) # add vector of word to the synset Vector
for vec in VectorList:FinalVector=np.add(FinalVector,vec)
# compute the averagenumbofVec= len(VectorList)scalar=np.divide(float(1),float(numbofVec))FinalVector=np.multiply(FinalVector, scalar)return FinalVector
113
A.3 GenerateVectorForGloss.py
############################ A function for computing a gloss vector# Author: Feras Al Tarouti# Date : May 18 2016def GenerateVectorFor(thisSentence,lemma):
VectorList=[] # define the vector set for this SentenceFinalVector=np.zeros(100)for word in thisSentence.split():
skip = Falseif word not in stopwrds and word != lemma:try:
Vector = word2vecmodel[word]NofSyns = FindNumberOfSyns(word)# Scale the vector base on the number of synsetsif NofSyns > 1:
############################ A program for computing similarity between synset and gloss# Author: Feras Al Tarouti# Date : May 18 2016# First Step : Open the synset-gloss files, and read the sentence# Second Step : Generate the vector for the synset# Third Step : Generate the vector for the sentence# Fourth Step : Compute the cosine similarity between the synset vector# and the sentence vector# Fivth Step : Save the result###########################
with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb')as out_file:reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',')writer = csv.writer(out_file, encoding='utf-8')writer.writerow(['ID','CosSem'])rownum=0for row in reader:
if rownum!=0:print("Computing Cosine Similarity for Row numb: {0}".format(rownum)
)thisSenID = row[0] # read the current sentence IDthisSynset = row[1] # read the current synsetIDthisSynMem = row[2] # read number of members for this synsetthiswrd = row[3] # read the word used in this sentencethiswrdSyns = row[4] # read the number of synsets for this wordthisSentence = row[5] # read the current sentence
#Compute a vector for this synsetthisSynsetVector = GenerateVectorForSynset(thisSynset,"")
# Generate Vector for this sentencethisSentenceVector = GenerateVectorFor(thisSentence,"")
---- Database: `LexBank_System`---- ------------------------------------------------------------ Table structure for table `Users_Info`--USE [LexBank_System]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Users_Info]([UserId] [varchar](50) NOT NULL,[UserName] [varchar](100) NOT NULL,[UserEmail] [varchar](70) NOT NULL,[UserPwd] [varchar](max) NOT NULL,[UserPriv] [varchar](15) NOT NULL,[UserStatus] [varchar](15) NOT NULL,
OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Table structure for table `System_Log`--USE [LexBank_System]GO
SET ANSI_NULLS ON
116
GO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[System_Log]([EventId] [int] IDENTITY(1,1) NOT NULL,[EventDesc] [varchar](200) NOT NULL,[EventTime] [datetime] NOT NULL,[UserId] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ------------------------------------------------------------ Database: `LexBank_Resources`---- ------------------------------------------------------------ Table structure for table `Arabic_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
117
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_CoreWordnet`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_CorWordnet]([Offset_Pos] [nvarchar](10) NOT NULL,[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,
118
[Right_Offset_Pos] [nvarchar](10) NOT NULL) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations]([Left_Offset_Pos] [nvarchar](10) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGlosses`--
119
USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGlosses`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGlosses`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
120
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses]([Offset_Pos] [varchar](10) NOT NULL,[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ON
121
GO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data]([RelationKey] [int] IDENTITY(1,1) NOT NULL,[Left_Offset_Pos] [nvarchar](10) NOT NULL,[Word1] [nvarchar](100) NOT NULL,[Relation] [nvarchar](50) NOT NULL,[Right_Offset_Pos] [nvarchar](10) NOT NULL,[Word2] [nvarchar](100) NOT NULL,[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
122
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
123
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[RelationKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ON
124
GO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Data`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data]([GlossKey] [int] IDENTITY(1,1) NOT NULL,[Offset-pos] [varchar](10) NOT NULL,[Word] [nvarchar](500) NULL,[Sentence] [nvarchar](4000) NULL,[PWNGloss] [nvarchar](900) NULL,[CosSem] [real] NULL,[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Arabic_WordnetGloss_Eval_Response`--
125
USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Assamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- ---------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Response`--USE [LexBank_Resources]GO
126
SET ANSI_NULLS ONGO
SET QUOTED_IDENTIFIER ONGO
SET ANSI_PADDING ONGO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response]([AnswerKey] [int] IDENTITY(1,1) NOT NULL,[GlossKey] [int] NOT NULL,[Score] [int] NOT NULL,[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFFGO-- --------------------------------------------------------
Appendix C
LEXBANK UTILITY CLASS
1 using System;2 using System.Collections.Generic;3 using System.Linq;4 using System.Web;5 using System.Data;6 using System.Data.SqlClient;7 using System.Web.Configuration;8 using System.IO;9 using System.Text;
10 using System.Security.Cryptography;11
12 namespace LexBank201613 {14 public class LexBankUtils15 {16 private string LexBankConnectionString = WebConfigurationManager
.ConnectionStrings["LexBankData"].ToString();17
18 public Boolean IsUserIdAvailable(string UserId)19 {20 // This function takes user id and check if it is already
used or not21 Boolean result = false;22
23
24 using (SqlConnection connection = new SqlConnection(LexBankConnectionString))
25 {26 connection.Open();27 //28 // Create new SqlCommand object.29 //30 using (SqlCommand command = new SqlCommand("SELECT
UserId FROM Users_Info where UserId like @UserId",connection))
31 {32 // Define the parameters33 command.Parameters.AddWithValue("@UserId", UserId.
56 PasswordEncryptor.Key = PBKDF.GetBytes(32);57 PasswordEncryptor.IV = PBKDF.GetBytes(16);58 using (MemoryStream ms = new MemoryStream())59 {60 using (CryptoStream cs = new CryptoStream(ms,