-
Bengali Part-Of-Speech Tagging
1
A
PROJECT REPORT
ON
PART-OF-SPEECH TAGGING
FOR BENGALI
IN PARTIAL FULFILLMENT OF THE REQUIRMENT FOR THE DEGREE OF
MASTER OF COMPUTER SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
ASSAM UNIVERSITY, SILCHAR
2016
Submitted by:
DEEPANKAR DAS
Roll: 101614 No.: 22220380
Under the Guidance of
PROF. BIPUL SYAM PURKYASTHA
HEAD OF DEPARTMENT, PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
ASSAM UNIVERSITY, SILCHAR-788011
-
Bengali Part-Of-Speech Tagging
2
CERTIFICATE
This is to certify that Deepankar Das bearing Roll: 101614 No:
22220380 has
carried out her work for the project entitled “PART-OF-SPEECH
TAGGING FOR
BENGALI” under my supervision in partial fulfillment for the
requirement of the
award of degree of Master of Science in Computer Science of
Assam University,
Silchar. He has done sincerely his work for preparing this
project. He has fulfilled
all the requirements laid down in the regulations of the MSc (2
years) 4th
Semester Examination (Paper MS-405) of the Department of
Computer Science,
Assam University, Silchar, for the session 2015-2016.
Date: Signature of the Guide
Place: (PROF. BIPUL SYAM PURKAYASTHA)
Supervisior, Professor
Department of Computer Science
Assam University, Silchar
-
Bengali Part-Of-Speech Tagging
3
CERTIFICATE
This is to certify that Deepankar Das bearing Roll: 101614 No:
22220380 has
carried out her work for the project entitled “PART-OF-SPEECH
TAGGING FOR
BENGALI” under my supervision in partial fulfillment for the
requirement of the
award of degree of Master of Science in Computer Science of
Assam University,
Silchar. He has done sincerely his work for preparing this
project. He has fulfilled
all the requirements laid down in the regulations of the MSc (2
years) 4th
Semester Examination (Paper MS-405) of the Department of
Computer Science,
Assam University, Silchar, for the session 2015-2016.
Date: Signature of the HOD
Place: (PROF. BIPUL SYAM PURKAYASTHA)
HOD, Professor
Department of Computer Science
Assam University, Silchar
-
Bengali Part-Of-Speech Tagging
4
DECLARATION
I, Deepankar Das, student of 4th semester (MSc 2 years),
Department of Computer
Science do hereby solemnly declare that I have duly worked on my
project
entitled “PART-OF-SPEECH TAGGING FOR BENGALI” under the
supervision of Prof. Bipul
Syam Purkayastha, Professor, Department of Computer Science,
Assam
University, Silchar.
Date: Signature
Place: ( Deepankar Das )
Msc 4th Semester
Roll: 101614 No.: 22220380
Regn. No.: 02-110018703 of 2011-12
Department of Computer Science
Assam University, Silchar
-
Bengali Part-Of-Speech Tagging
5
ACKNOWLWDGEMENT
At the very outset, I take the privilege to convey my gratitude
to those
persons whose co-operation, suggestions and heartfelt support
helped
us to accomplish the term paper successfully.
I take immense pleasure to express my sincere thanks and
profound
gratitude to my respected guide Prof. Bipul Shyam Purkayastha,
Head
of the Department of Computer Science, Assam University,
Silchar, for
his excellence and able guidance, valuable suggestions and
encouragement he rendered for completing the term paper and also
for
his valuable suggestions.
I also indebted to my family members, friends and well-wishers
who
encouraged me to do this work with vigor and seriousness.
Last but not the least I would like to acknowledge the
cooperation I
received from the entire staff of our department and thanks to
all those
who directly or indirectly extended their helpful hands and
moral
support while making this project.
( Deepankar Das )
-
Bengali Part-Of-Speech Tagging
6
Table of Contents
Chapters Title Page No Chapter 1 Introduction 1
1.1 NLP 2
1.2 Applications of NLP 2
1.3 POS Tagging 6
1.4 The POS Tagging Problem 7
1.5 Applications of POS Tagging 9
1.6 Motivations 10
1.7 Goals of Our Work 10
1.8 Organization of the report 11
Chapter 2 Prior Work 12
2.1 Prior Work in POS Tagging 13
2.2 Linguistics Taggers 13
2.3 POS Tagging Approaches 14
2.4 Indian Language POS Taggers 18
Chapter 3 Foundational Consideration 20
3.1 Corpora Collection 21
3.2 The Tagset 21
-
Bengali Part-Of-Speech Tagging
7
Chapter 4 Tagging with Rule Based Approach 24
4.1 Rule Based Approach 25
4.2 Our Approach 25
Chapter 5 Experimental Result & Discussion 28
5.1 Tools Used 29
5.2 Graphical User Interface 30
5.3 Experimental Result 31
5`4 Result Discussion 32
Chapter 6 Conclusion & Future Direction 33
6.1 Conclusion 34
6.2 Future Work 34
References 35
-
Bengali Part-Of-Speech Tagging
8
Abstract
Part-of-Speech (POS) tagging is the process of assigning the
appropriate part of
speech or lexical category to each word in a natural language
sentence. Part-of-speech
tagging is an important part of Natural Language Processing
(NLP) and is useful for most
NLP applications. It is often the first stage of natural
language processing following which
further processing like chunking, parsing, etc are done.
POS tagging is considered as the one of the basic necessary
tool. Its simplified form is
commonly taught to school age children, in the identification of
words as nouns, pronouns,
verbs, adjectives, adverbs, prepositions, conjunctions,,
interjections etc. Development of any
Indian language POS tagger will influence several pipelined
modules of natural language
understanding system including Information Extraction(IE);
Information Retrieval(IR);
Machine Translation (MT); Partial Parsing (PP) and Word Sense
Disambiguation(WSD).
Our objective in this work is to develop an effective POS tagger
for Bengali Language. Once
performed by manual, POS tagging is now done with the context of
computational
linguistics, using algorithms which associate discrete terms, as
well as hidden parts of speech,
in accordance with a set of descriptive tags. POS tagging
algorithms fall into two distinctive
groups: rule based and stochastic. E. Brill's tagger, one of the
first and most widely used
English POS taggers, employs rule based algorithms.
Bengali is the main language spoken in Bangladesh, the second
most commonly
spoken language in India, and the seventh most commonly spoken
language in the world with
nearly 230 million total speakers(189 million native speakers).
Natural language processing
of Bengali is in its infancy. POS tagging of Bengali is a
necessary component for most NLP
applications of Bengali.
The developed system is tested with a set of experimental data
and result analysis has
been made. The system gives accuracy over 74.50%. The
performance can be increased by
increasing the size of the lexicon.
-
Bengali Part-Of-Speech Tagging
9
CHAPTER 1
Introduction
-
Bengali Part-Of-Speech Tagging
10
1.1 NLP
The goal of natural language processing (NLP) is to build
computational models of natural
language for its analysis and generation. First, there is
technological motivation of building
intelligent computer systems such as machine translation
systems, natural language interfaces
to databases, man-machine interfaces to computers in general,
speech understanding systems,
text analysis and understanding systems, computer aided
instruction systems, systems that
read and understand printed or handwritten text. Second, there
is a cognitive and linguistic
motivation to gain a better in- sight into how humans
communicate using natural language
(NL).
Natural language processing (NLP) is a field of computer science
and linguistics
concerned with the interactions between computers and human
(natural) languages; it began
as a branch of artificial intelligence .In theory, natural
language processing is a very attractive
method of human computer interaction. Natural language
understanding is sometimes
referred to as an AI-computer problem because it seems to
require extensive knowledge
about the outside world and the ability to manipulate it.
Natural language processing (NLP) is
a collection of techniques used to extract grammatical structure
and meaning from input in
order to perform a useful task as a result, natural language
generation builds output based on
the rules of the target language and the task at hand. NLP is
useful in the tutoring systems,
duplicate detection, computer supported instruction and database
interface fields as it
provides a pathway for increased interactivity and
productivity.
The tools of work in NLP are grammar formalisms, algorithms and
data structures,
formalism for representing world knowledge, reasoning
mechanisms, etc. Many of these have
been taken from and inherit results from Computer Science,
Artificial Intelligence,
Linguistics, Logic, and Philosophy.
1.2 Applications of NLP
Automatic summarization : Produce a readable summary of a chunk
of text. Often used to
provide summaries of text of a known type such as articles in
the financial section of a
newspaper.
Machine translation: Automatically translate text from one human
language to another. This
is one of the most difficult problems, and is a member of a
class of problems colloquially
http://en.wikipedia.org/wiki/Automatic_summarizationhttp://en.wikipedia.org/wiki/Machine_translation
-
Bengali Part-Of-Speech Tagging
11
termed "AI-complete", i.e. requiring all of the different types
of knowledge that humans
possess (grammar, semantics, facts about the real world, etc.)
in order to solve properly.
Morphological segmentation: Separate words into individual
morphemes and identify the
class of the morphemes. The difficulty of this task depends
greatly on the complexity of the
morphology (i.e. the structure of words) of the language being
considered. English has fairly
simple morphology, especially inflectional morphology, and thus
it is often possible to ignore
this task entirely and simply model all possible forms of a word
(e.g. "open, opens, opened,
opening") as separate words. In languages such as Turkish,
however, such an approach is not
possible, as each dictionary entry has thousands of possible
word forms. Not only for Turkish
but also the Manipuri which is a highly agglutinated Indian
language.
Named entity recognition (NER): Given a stream of text,determine
which items in the text
map to proper names, such as people or places, and what the type
of each such name is (e.g.
person, location, organization). Note that, although
capitalization can aid in recognizing
named entities in languages such as English, this information
cannot aid in determining the
type of named entity, and in any case is often inaccurate or
insufficient. For example, the first
word of a sentence is also capitalized, and named entities often
span several words, only
some of which are capitalized. Furthermore, many other languages
in non-Western scripts
(e.g. Chinese or Arabic) do not have any capitalization at all,
and even languages with
capitalization may not consistently use it to distinguish names.
For example, German
capitalizes all nouns, regardless of whether they refer to
names, and French and Spanish do
not capitalize names that serve as adjectives.
Natural language generation: Convert information from computer
databases into readable
human language.
Natural language understanding: Convert chunks of text into more
formal representations
such as first-order logic structures that are easier for
computer programs to manipulate.
Natural language understanding involves the identification of
the intended semantic from the
multiple possible semantics which can be derived from a natural
language expression which
usually takes the form of organized notations of natural
languages concepts. Introduction and
creation of language metamodel and ontology are efficient
however empirical solutions. An
explicit formalization of natural languages semantics without
confusions with implicit
assumptions such as closed world assumption (CWA) vs. open world
assumption, or
http://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Morphemehttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/Turkish_languagehttp://en.wikipedia.org/wiki/Manipuri_languagehttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Capitalizationhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/German_languagehttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/French_languagehttp://en.wikipedia.org/wiki/Spanish_languagehttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Natural_language_generationhttp://en.wikipedia.org/wiki/Natural_language_understandinghttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Computer
-
Bengali Part-Of-Speech Tagging
12
subjective Yes/No vs. objective True/False is expected for the
construction of a basis of
semantics formalization.
Optical character recognition (OCR): Given an image representing
printed text, determine
the corresponding text.
Part-of-speech tagging(POST) : Given a sentence, determine the
part of speech for each
word. Many words, especially common ones, can serve as multiple
parts of speech. For
example, "book" can be a noun ("the book on the table") or verb
("to book a flight"); "set"
can be a noun, verb or adjective; and "out" can be any of at
least five different parts of
speech. Some languages have more such ambiguity than others.
Languages with little
inflectional morphology, such as English are particularly prone
to such ambiguity. Chinese is
prone to such ambiguity because it is a tonal language during
verbalization. Such inflection is
not readily conveyed via the entities employed within the
orthography to convey intended
meaning.
Parsing: Determine the parse tree (grammatical analysis) of a
given sentence. The grammar
for natural languages is ambiguous and typical sentences have
multiple possible analyses. In
fact, perhaps surprisingly, for a typical sentence there may be
thousands of potential parses
(most of which will seem completely nonsensical to a human).
Question answering: Given a human-language question, determine
its answer. Typical
questions have a specific right answer (such as "What is the
capital of Canada?"), but
sometimes open-ended questions are also considered (such as
"What is the meaning of
life?"). Recent works have looked at even more complex
questions.
Relationship extraction: Given a chunk of text, identify the
relationships among named
entities (e.g. who is the wife of whom).
Sentence breaking (also known as sentence boundary
disambiguation): Given a chunk of
text, find the sentence boundaries. Sentence boundaries are
often marked by periods or other
punctuation marks, but these same characters can serve other
purposes (e.g. marking
abbreviations).
Sentiment analysis: Extract subjective information usually from
a set of documents, often
using online reviews to determine "polarity" about specific
objects. It is especially useful for
identifying trends of public opinion in the social media, for
the purpose of marketing.
http://en.wikipedia.org/wiki/Optical_character_recognitionhttp://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Parts_of_speechhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Nounhttp://en.wikipedia.org/wiki/Verbhttp://en.wikipedia.org/wiki/Adjectivehttp://en.wikipedia.org/wiki/Inflectional_morphologyhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Tonal_languagehttp://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/Parse_treehttp://en.wikipedia.org/wiki/Grammarhttp://en.wikipedia.org/wiki/Natural_languagehttp://en.wikipedia.org/wiki/Ambiguoushttp://en.wikipedia.org/wiki/Question_answeringhttp://en.wikipedia.org/wiki/Relationship_extractionhttp://en.wikipedia.org/wiki/Sentence_breakinghttp://en.wikipedia.org/wiki/Sentence_boundary_disambiguationhttp://en.wikipedia.org/wiki/Full_stophttp://en.wikipedia.org/wiki/Punctuation_markhttp://en.wikipedia.org/wiki/Abbreviationhttp://en.wikipedia.org/wiki/Sentiment_analysis
-
Bengali Part-Of-Speech Tagging
13
Speech recognition: Given a sound clip of a person or people
speaking, determine the textual
representation of the speech. This is the opposite of text to
speech and is one of the extremely
difficult problems colloquially termed "AI-complete" (see
above). In natural speech there are
hardly any pauses between successive words, and thus speech
segmentation is a necessary
subtask of speech recognition (see below). Note also that in
most spoken languages, the
sounds representing successive letters blend into each other in
a process termed co
articulation, so the conversion of the analog signal to discrete
characters can be a very
difficult process.
Speech segmentation: Given a sound clip of a person or people
speaking, separate it into
words. A subtask of speech recognition and typically grouped
with it.
Topic segmentation and recognition: Given a chunk of text,
separate it into segments each of
which is devoted to a topic, and identify the topic of the
segment.
Word segmentation: Separate a chunk of continuous text into
separate words. For a language
like English, this is fairly trivial, since words are usually
separated by spaces. However, some
written languages like Chinese, Japanese and Thai do not mark
word boundaries in such a
fashion, and in those languages text segmentation is a
significant task requiring knowledge of
the vocabulary and morphology of words in the language.
Word sense disambiguation: Many words have more than one
meaning; we have to select the
meaning which makes the most sense in context. For this problem,
we are typically given a
list of words and associated word senses, e.g. from a dictionary
or from an online resource
such as WordNet. In some cases, sets of related tasks are
grouped into subfields of NLP that
are often considered separately from NLP as a whole. Examples
include:
Information retrieval (IR): This is concerned with storing,
searching and retrieving
information. It is a separate field within computer science
(closer to databases), but IR relies
on some NLP methods (for example, stemming). Some current
research and applications seek
to bridge the gap between IR and NLP.
Information extraction (IE): This is concerned in general with
the extraction of semantic
information from text. This covers tasks such as named entity
recognition, Co reference
resolution, relationship extraction, etc.
http://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Text_to_speechhttp://en.wikipedia.org/wiki/AI-completehttp://en.wikipedia.org/wiki/Natural_speechhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Coarticulationhttp://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Topic_segmentationhttp://en.wikipedia.org/wiki/Word_segmentationhttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Thai_languagehttp://en.wikipedia.org/wiki/Vocabularyhttp://en.wikipedia.org/wiki/Morphology_%28linguistics%29http://en.wikipedia.org/wiki/Word_sense_disambiguationhttp://en.wikipedia.org/wiki/Meaning_%28linguistics%29http://en.wikipedia.org/wiki/WordNethttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Named_entity_recognitionhttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Coreferencehttp://en.wikipedia.org/wiki/Relationship_extraction
-
Bengali Part-Of-Speech Tagging
14
1.3 POS Tagging
Part-of-Speech (POS) tagging is the process of automatic
annotation of lexical categories.
Part-of–Speech tagging assigns an appropriate part of speech tag
for each word in a sentence
of a natural language. The development of an automatic POS
tagger requires either a
comprehensive set of linguistically motivated rules or a large
annotated corpus. But such
rules and corpora have been developed for a few languages like
English and some other
languages. POS taggers for Indian languages are not readily
available due to lack of such
rules and large annotated corpora.
A part-of-speech is a grammatical category commonly including
nouns, pronouns,
verbs, adjectives, adverbs, prepositions, conjunctions,
interjections. Parts of speech can be
divided into two broad categories: closed classes and open
classes. Closed classes are those
that have relatively fixed membership. For example, pronouns are
categorized in closed class
because there is a fixed set of them in English; new pronouns
are rarely added. But nouns are
in open class because new nouns are continually added in every
language.
The linguistic approach is the classical approach to POS tagging
was initially
explored in middle sixties and seventies (Harris, 1962; Klein
and Simmons, 1963; Greene
and Rubin, 1971). People manually engineered rules for tagging.
The most representative of
such pioneer tagger was TAGGIT (Greene and Rubin, 1971), which
was used for initial
tagging of the Brown Corpus. The development of ENGTWOL (an
English tagger based on
constraint grammar architecture) can be considered most
important in this direction (Karlsson
et al., 1995). These taggers typically use rule-based models
manually written by linguists.
The advantage of this model is that the rules are written from a
linguistic point of view and
can be made to capture complex kinds of information. This allows
the construction of an
extremely accurate system. But handling all rules is not easy
and requires expertise. The
context frame rules have to be developed by language experts and
it is costly and difficult to
develop a rule based POS tagger. Further, if one uses of rule
based POS tagging, transferring
the tagger to another language means starting from scratch
again.
On the other hand, recent machine learning techniques makes use
of annotated
corpora to acquire high-level language knowledge for different
tasks including PSO tagging.
This knowledge is estimated from the corpora which are usually
tagged with the correct part
of speech labels for the words. Machine learning based tagging
techniques facilitate the
development of taggers in shorter time and these techniques can
be transferred for use with
corpora of other languages. Several machine learning algorithms
have been developed for the
-
Bengali Part-Of-Speech Tagging
15
POS disambiguation task. These algorithms range from instance
based learning to several
graphical models. The knowledge acquired may be in the form of
rules, decision trees,
probability distribution, etc. The encoded knowledge in
stochastic methods may or may not
have direct linguistic interpretation. But typically such
taggers need to be trained with a
handsome amount of annotated data to achieve high accuracy.
Though significant amounts of
annotated corpus are often not available for most languages, it
is easier to obtain large
volumes of un-annotated corpus for most of the languages. The
implication is that one may
explore the power of semi-supervised and unsupervised learning
mechanism to get a POS
tagger.
Our interest is in developing taggers for Bengali Languages.
Annotated corpora are
not readily available for this language, but the language is
morphologically rich. The use of
morphological features of a word, as well as word suffixes can
enable us to develop a POS
tagger with limited resources. In the present work, these
morphological features (affixes)
have been incorporated in different machine learning models
(Maximum Entropy,
Conditional Random Field, etc.) to perform the POS tagging task.
This approach can be
generalized for use with any morphologically rich language in
poor-resource scenario.
The development of a tagger requires either developing an
exhaustive set of linguistic
rules or a large amount of annotated text. However no tagged
corpus was available to us for
use in this task. We had to start with creating tagged resources
for Bengali. Manual part of
speech tagging is quite a time consuming and difficult process.
So we tried to work with
methods so that small amount of tagged resources can be used to
effectively carry out the part
of speech tagging task.
1.4 The Part-of-Speech Tagging Problem
Natural languages are ambiguous in nature. Ambiguity appears at
different levels of the
natural language processing (NLP) task. Many words take multiple
part of speech tags. The
correct tag depends on the context.
Consider, for instance, the following English and Bengali
sentence
1. Keep the book on the top shelf.
2. সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর
The sentences have lot of POS ambiguity which should be resolved
before the
sentence can be understood. For instance in example sentence 1,
the word “ keep ” and
-
Bengali Part-Of-Speech Tagging
16
“book” can be a noun or a verb; “on” can be a preposition, an
adverb, an adjective; finally,
“top” can be either an adjective or a noun. Similarly, in
Bengali example sentence 2, the
word “তারা ” can be either a noun or a pronoun; “দিবে” can be
either a verb or a postposition
”করে” can be a noun, a verb, or a postposition. In most cases
POS ambiguity can be
resolved by examining the context of the surrounding words.
Figure1 shows a detailed
analysis of the POS ambiguity of an English sentence considering
only the basic 8 tags. The
box with single line indicates the correct tag for a particular
word where no ambiguity exists
i.e. only one tag is possible for the word. On the contrary, the
boxes with double line indicate
the correct POS tag of a word form a set of possible tags.
Figure 1: POS ambiguity of an English sentence with eight basic
tags.
Figure 2: POS ambiguity of a Bengali sentence with tagset of
experiment.
Figure 2 illustrate the detail of the ambiguity class for the
Bengali sentence as per the
tagset used for our experiment. As we are using a fine grained
tagset compare to the basic 8
tags, the number of possible tags for a word increases POS
tagging is the task of assigning
appropriate grammatical tags to each word of an input text in
its context of appearance.
Essentially, the POS tagging task resolves ambiguity by
selecting the correct tag from the set
of possible tags for a word in a sentence.
সকাবো তারা ক্ষেবত াঙ দিবে কাজ কবর
N PR N N V N
PSP
V
PSP
-
Bengali Part-Of-Speech Tagging
17
1.5 Applications of POS Tagging
POS disambiguation task is useful in several natural language
processing tasks. It is often the
first stage of natural language understanding following which
further processing e.g.,
chunking, parsing, etc are done. Part-of –speech tagging is of
interest for a number of
applications, including – speech synthesis and recognition ,
machine translation, lexicography
etc.
Most of the natural language understanding systems are formed by
a set of pipelined
modules; each of them is specific to a particular level of
analysis of the natural language text.
Development of a POS tagger influences several pipelined modules
of the natural language
understanding task. As POS tagging is the first step towards
natural language understating, it
is important to achieve a high level of accuracy which otherwise
may hamper further stages
of the natural language understanding. In the following, we
briefly discuss some of the above
applications of POS tagging.
Speech synthesis and recognition, Part-of-speech gives
significant amount of information
about the word and its neighbours which can be useful in a
language model for speech
recognition (Heeman et al., 1997). Part of Speech of a word
tells us something about how
the word is pronounced depending on the grammatical category
(the noun is pronounced
Object and the verb object).
Information retrieval and extraction, by augmenting a query
given to a retrieval
system with POS information, more refined information extraction
is possible. For
example, if a person wants to search for document containing “
book” as a noun, adding
the POS information will eliminate irrelevant documents with
only “ book” as a verb.
Also, patterns used for information extraction from text often
use POS references.
Machine translation, the probability of translating a word in
the source
language into a word in the target language is effectively
dependent on the
POS category of the source Word.
As mentioned earlier, POS tagging has been used in several other
application such as a processor
to high level syntactic processing (noun phrase chunker),
lexicography, stylometry, and word
sense disambiguation. These applications are discussed in some
detail in (Church, 1988;
Ramshaw and Marcus, 1995; Wilks and Stevenson, 1998).
-
Bengali Part-Of-Speech Tagging
18
1.6 Motivation
A lot of work has been done in part of speech tagging of several
languages, such as English.
While some work has been done on the part of speech tagging of
different Indian languages
(Ray et al., 2003; Shrivastav et al., 2006; Arulmozhi et al.,
2006; Singh et al., 2006; Dalal et
al., 2007), the effort is still in its infancy. Very little work
has been done previously with part
of speech tagging of Bengali. Bengali is the main language
spoken in Bangladesh, the second
most commonly spoken language in India, and the seventh most
commonly spoken language
in the world.
Apart from being required for further language analysis, Bengali
POS tagging is of
interest due to a number of applications like speech synthesis
and recognition. Part-of-speech
gives significant amount of information about the word and its
neighbours which can be
useful in a language model for different speech and natural
language processing applications.
Development of a Bengali POS tagger will also influence several
pipelined modules of
natural language understanding system including: information
extraction and retrieval;
machine translation; partial parsing and word sense
disambiguation. The existing POS
tagging technique shows that the development of a reasonably
good accuracy POS tagger
requires either developing an exhaustive set of linguistic rules
or a large amount of annotated
text. We have the following observations.
i. POS tagging has wide range of applications.
ii. Reputed companies like Google, Microsoft are concentrated on
NLP
applications so POS tagging has got more importance.
iii. Part of speech tagging using rule based approach is a
challenging task. Part of
Speech resolves ambiguities
Therefore, there is a pressing necessity to develop a automatic
Part-of-Speech tagger for
Bengali. With this motivation, major goals of this report have
been made.
1.7 Goals of Our Work
The primary goal of the thesis is to develop a reasonably good
accuracy part-of-speech
tagger for Bengali. To address this broad objective, we identify
the following goals:
We wish to investigate different machine learning algorithm to
develop a part-of-
speech tagger for Bengali.
-
Bengali Part-Of-Speech Tagging
19
Bengali is a morphologically-rich language. We wish to use the
morphological
features of a word, as well as word suffix to enable us to
develop a POS tagger with
limited resource.
As stemming is one of the pre-processing steps to develop an
effective POS tagger, so
we wish to stem a few Bengali text documents
1.8 Organization of the Report
Rest of this report is organized into chapters as follows:
Chapter 2 provides a review of the previous work on POS tagging.
Comparative review
of the work is not shown in this chapter because such an attempt
is extremely difficult due
to the large number of publications in this area and the works
based on several theories
and techniques used by researchers over the years. Instead, a
brief review i.e. the work
based on different techniques used for POS tagging has been
presented. This chapter also
presents a discussion on English language POS taggers and Indian
languages POS
taggers.
Chapter 3 supply some information about several important issues
related to POS
tagging, which can greatly influence the performance of the
taggers i.e. corpora and the
Bengali tagset.
Chapter 4 provides information about the developed system and
the way the system is
developed. Also in this chapter the system architecture has been
shown.
Chapter 5 provides the experimental result and a discussion was
made on the
experimental result.
Chapter 6 presents the general conclusion, summary of the work
and contributions are
outlined along with a discussion on scope for future research
work.
-
Bengali Part-Of-Speech Tagging
20
CHAPTER 2
Prior Work
-
Bengali Part-Of-Speech Tagging
21
2.1 Prior Work in POS Tagging
The area of automated Part-of-speech tagging has been enriched
over the last few decades by
contribution from several researchers. Since its inception in
the middle sixties and seventies
(Harris, 1962; Klein and Simmons, 1963; Greene and Rubin, 1971),
many new concepts have
been introduced to improve the efficiency of the tagger and to
construct the POS taggers for
several languages. Initially, people manually engineered rules
for tagging. Linguistic taggers
incorporate the knowledge as a set of rules or constraints
written by linguists. More recently
several statistical or probabilistic models have been used for
the POS tagging task for
providing transportable adaptive taggers. Several sophisticated
machine learning algorithms
have been developed that acquire more robust information. In
general all the statistical
models rely on manually POS labelled corpora to learn the
underling language model, which
is difficult to acquire for a new language. Finally,
combinations of several sources of
information (linguistic, statistical and automatically learned)
have been used in current
research direction.
This chapter provides a brief review of the prior work in POS
tagging. For the sake of
consciousness, we do not aim to give a comprehensive review of
the related work. Instead,
we provide a brief review on the different techniques used in
POS tagging. Further, we focus
onto the detail review of the Indian language POS taggers.
2.2 Linguistic Taggers
Automated part of speech tagging was initially explored in
middle sixties and seventies
People manually engineered rules for tagging. The most
representative of such pioneer tagger
was TAGGIT (Greene and Rubin, 1971), which was used for initial
tagging of the Brown
Corpus. Since that time to nowadays, a lot of effort has been
devoted to improving the quality
of the tagging process in terms of accuracy and efficiency.
Recent linguistic taggers incorporate the knowledge as a set of
rules or constraints,
written by linguists. The current models are expressive and
accurate and they are used in very
efficient disambiguation algorithms. The linguistic rules range
from a few hundred to several
thousands, and they usually require years of labour. The
development of ENGTWOL (an
English tagger based on constraint grammar architecture) can be
considered most important
in this direction .The constraint grammar formalism has also
been applied for other languages
like Turkish.
-
Bengali Part-Of-Speech Tagging
22
The accuracy reported by the first rule-based linguistic English
tagger was slightly
below 80%. A Constraint Grammar for English tagging (Samuelsson
and Voutilainen, 1997)
is presented which achieves a recall of 99.5% with a very high
precision around 97%. Their
advantages are that the models are written from a linguistic
point of view and explicitly
describe linguistic phenomena, and the models may contain many
and complex kinds of
information. Both things allow the construction of extremely
accurate system. However, the
linguistic models are developed by introspection (sometimes with
the aid of reference
corpora). This makes it particularly costly to obtain a good
language model. Transporting the
model to other languages would require starting over again.
2.3 POS Tagging Approaches
POS taggers are broadly classified into three categories called
rule based, Empirical based
and Hybrid based .In case of rule based approach hand-written
rules are used to distinguish
the tag ambiguity. The empirical POS taggers are further
classified into Example based and
Stochastic based taggers. Stochastic taggers are either HMM
based, choosing the tag
sequence which maximizes the product of word likelihood and tag
sequence probability, or
cue-based, using decision trees or maximum entropy models to
combine probabilistic
features. The stochastic taggers are further classified in to
supervised and unsupervised
taggers. Each of these supervised and unsupervised taggers are
categorized into different
groups based on the particular algorithm used. The Fig.2.3 shows
the classification of parts of
speech approaches.
2.3.1 Rule Based POS tagging
The rule based POS tagging models apply a set of hand written
rules and use
contextual information to assign POS tags to words. These rules
are often known as context
frame rules. For example, a context frame rule might say
something like: “If an
ambiguous/unknown word X is preceded by a Determiner and
followed by a Noun, tag it as
an Adjective”. One of the first and widely used English
POS-taggers employs rule based
algorithms is “Brill‟s tagger”. The earliest algorithms for
automatically assigning part-of-
speech were based on two-stage architecture. The first stage
used a dictionary to assign each
word a list of potential parts of speech. The second stage used
large lists of hand-written
disambiguation rules to bring down this list to a single
part-of-speech for each word. The
-
Bengali Part-Of-Speech Tagging
23
ENGTWOL tagger is based on the same two-stage architecture,
although both the lexicon
and the disambiguation rules are much more sophisticated than
the early algorithms.
Fig.2.3 : Classification of POS tagging
2.3.2 Empirical Based POS tagging
The relative failure of rule-based approaches, the increasing
availability of machine
readable text and the increase in capability of hardware (CPU,
memory, disk space) with
decrease in cost are some of the reasons, researchers to prefer
corpus based pos tagging. The
empirical approach of parts speech tagging is further divided in
to two categories: Example-
based approach and Stochastic based approach. Literature shows
that majority of the
developed POS taggers belongs to empirical based approach.
-
Bengali Part-Of-Speech Tagging
24
2.3.2(a) Example Based POS tagging
Example based approach are depend on trained or tagged corpus
which have to
be trained with the machine with learning technique. In example
based
morphoynthetic tagging this problem must be formulated as a
classification task. The
features usually include POS of neighbouring tokens, their auto
graphics forms ,
sometimes also fixed width affixes of the word forms.
2.3.2(b) Stochastic based POS tagging
The stochastic approach finds out the most frequently used tag
for a specific word in
the annotated training data and uses this information to tag
that word in the unannotated text.
A stochastic approach required a sufficient large sized corpus
and calculates frequency,
probability or statistics of each and every word in the corpus.
The problem with this approach
is that it can come up with sequences of tags for sentences that
are not acceptable according
to the grammar rules of a language. The use of probabilities in
tags is quite old; probabilities
in tagging were first used in 1965, a complete probabilistic
tagger with Viterbi decoding was
sketched by Bahl and Mercer (1976), and various stochastic
taggers were built in the 1980's
(Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988).
Supervised and unsupervised
are two broad categories of stochastic based approach.
Supervised POS tagging: The supervised POS tagging models
require pre-tagged
corpora which are used for training to learn information about
the tagset, word-tag
frequencies, rule sets etc. The performance of the models
generally increases with the
increase in size of this corpus. The following are the two
familiar examples for supervised
POS taggers Hidden Markov Model and Support Vector Machines
.
Hidden Markov Model (HMM) based POS tagging: An alternative to
the
word frequency approach is known as the n-gram approach that
calculates the
probability of a given sequence of tags. It determines the best
tag for a word
by calculating the probability that it occurs with the n
previous tags, where the
value of n is set to 1, 2 or 3 for practical purposes. These are
known as the
Unigram, Bigram and Trigram models. The most common algorithm
for
implementing an n-gram approach for tagging new text is known as
the
HMM‟s Viterbi Algorithm. The Viterbi algorithm is a search
algorithm that
avoids the polynomial expansion of a breadth first search by
trimming the
-
Bengali Part-Of-Speech Tagging
25
search tree at each level using the best „m‟ Maximum Likelihood
Estimates
(MLE) where „m‟ represents the number of tags of the following
word. For a
given sentence or word sequence, HMM taggers choose the tag
sequence that
maximizes as in formula 1
P(word | tag ) X P(tag | previous n tags) (1)
A bigram-HMM tagger of this kind chooses the tag ti for word wi
that is most
probable given the previous tag ti-1 and the current word wi
:
ti = arg max P( ti | ti-1 , wi) (2)
j Support Vector Machines (SVM ): SVM is a machine learning
algorithm for
binary classification, which has been successfully applied to a
number of
practical problems, including NLP. Let {(x1, y1). . . (xN, yN)}
be the set of N
training examples, where each instance xi is a vector in RN and
yi ∈ {−1,+1}
is the class label. In their basic form, a SVM learns a linear
hyperplane, that
separates the set of positive examples from the set of negative
examples with
maximal margin (the margin is defined as the distance of the
hyperplane to the
nearest of the positive and negative examples). This learning
bias has proved
to have good in terms of generalization bounds for the induced
classifiers.
The SVM Tool is intended to comply with all the requirements of
modern
NLP technology, by combining simplicity, flexibility,
robustness, portability
and efficiency with state–of–the–art accuracy. This is achieved
by working in
the Support Vector Machines (SVM) learning framework, and by
offering
NLP researchers a highly customizable sequential tagger
generator.
Unsupervised POS Tagging: Unlike the supervised models, the
unsupervised POS
tagging models do not require a pre-tagged corpus. Instead, they
use advanced computational
methods like the Baum-Welch algorithm to automatically induce
tagsets, transformation rules
etc. Based on the information, they either calculate the
probabilistic information needed by
the stochastic taggers or induce the contextual rules needed by
rule-based systems or
transformation based systems.
Transformation-based POS tagging :In general, the supervised
tagging
approach usually requires large sized pre-annotated corpora for
training, which
is difficult for most of the cases. But recently, good amount of
work has been
done to automatically induce the transformation rules. One
approach to
-
Bengali Part-Of-Speech Tagging
26
automatic rule induction is to run an untagged text through a
tagging model
and get the initial output. A human then goes through the output
of this first
phase and corrects any erroneously tagged words by hand. This
tagged text is
then submitted to the tagger, which learns correction rules by
comparing the
two sets of data. Several iterations of this process are
sometimes necessary
before the tagging model can achieve considerable performance.
The
transformation based approach is similar to the rule based
approach in the
sense that it depends on a set of rules for tagging.
2.3.3 Hybrid Based Tagger
A hybrid approach combines the features of both Rule based &
Stochastic Based
approaches. Like rule based systems, they use rules to specify
tags. Like stochastic systems
they use machine-learning to induce rules from a tagged training
corpus automatically. The
transformation-based learning (TBL) tagger or Brill tagger
shares features of the hybrid
approach. This approach follows the advantages and disadvantages
of both rule based and
stochastic based approach.
2.4 Indian Language POS Taggers
There has been a lot of interest in Indian language POS tagging
in recent years. POS tagging
is one of the basic steps in many language processing tasks, so
it is important to build good
POS taggers for these languages. However it was found that very
little work has been done
on Bengali POS tagging and there are very limited amount of
resources that are available.
The oldest work on Indian language POS tagging we found is by
Bharati et al. (Bhartai et al.,
1995). They presented a framework for Indian languages where POS
tagging is implicit and
is merged with the parsing problem in their work on
computational Paninian parser.
For Bengali, ( Dandapat et al. 2007) studied the possibility of
developing a tagger
using HMM and Maximum Entropy (ME) models. They too used a
morphological analyzer
for compensating the shortage of annotated corpus. With these
two modes they implemented
a supervised tagger and a semi-supervised tagger and reported an
accuracy of around 88% for
the two approaches. ( Ekbal et al 2007) annotated news corpus
and developed an SVM based
tagger. They reported an accuracy of 86.84% for their tagger
-
Bengali Part-Of-Speech Tagging
27
An attempt on Hindi POS disambiguation was done by Ray (Ray et
al. 2003). The
part-of-speech tagging problem was solved as an essential
requirement for local word
grouping. Lexical sequence constraints were used to assign the
correct POS labels for Hindi.
A morphological analyzer was used to find out the possible POS
of every word in a sentence.
A rule based POS tagger for Tamil (Arulmozhi et al., 2004) has
been developed in
combination of both lexical rules and context sensitive rules.
They used a very coarse grained
tagset of only 12 tags. They reported an accuracy of 83.6% using
only lexical rules and
88.6% after applying the context sensitive rules. The accuracy
reported in the work, are tested
on a very small reference set of 1000 words.
Shrivastav et al. (Shrivastav et al. 2006) presented a CRF based
statistical tagger for
Hindi. They used 24 different features (lexical features and
spelling features) to generate the
model parameters. They experimented on a corpus of around 12,000
tokens and annotated
with a tagset of size 23. The reported accuracy was 88.95% with
a 4-fold cross validation.
Smriti et al. (Smriti et al. 2006) in their work, describes a
technique for morphology-
based POS tagging in a limited resource scenario. The system
uses a decision tree based
learning algorithm (CN2). They used stemmer, morphological
analyzer and a verb group
analyzer to assign the morphotactic tags to all the words, which
identify the Ambiguity
Scheme and Unknown Words. Further, a manually annotated corpus
was used to generate If-
Then rules to assign the correct POS tags for each ambiguity
scheme and unknown words. A
tagset of 23 tags were used for the experiment. An accuracy of
93.5% was reported with a 4-
fold cross validation on modestly-sized corpora (around 16,000
words).
In 2006, two machine learning contests were organized on
part-of-speech tagging and
chunking for Indian Languages for providing a platform for
researchers to work on a
common problem. Both the contests were conducted for three
different Indian languages:
Hindi, Bengali and Telugu. All the languages used a common
tagset of 27 tags. The results of
the contests give an overall picture of the Indian language POS
tagging. The first contest was
conducted by NLP Association of India (NLPAI) and IIIT-Hyderabad
in the summer of 2006.
-
Bengali Part-Of-Speech Tagging
28
CHAPTER 3
Foundational Considerations
-
Bengali Part-Of-Speech Tagging
29
In this chapter we discuss several important issues related to
the POS tagging problem, which
can greatly influence the performance of a tagger. Another
important issue of POS tagging is
collecting and annotating corpora. Most of the statistical
techniques rely on some amount of
annotated data to learn the underlying language model. The sizes
of the corpus and amount of
corpus ambiguity have a direct influence on the performance of a
tagger. Finally, there are
several other issues e.g. how to handle unknown words, smoothing
techniques which
contribute to the performance of a tagger.
In the following sections, we discus three important issues
related to POS tagging.
The first section discuss the process of corpora collection. In
second section we present the
tagset which is used for our experiment.
3.1. Corpora Collection
The compilation of raw text corpora is no longer a big problem,
since nowadays most of the
documents are written in a machine readable format and are
available on the web. Collecting
raw corpora is a little more difficult problem in Bengali (might
be true for other Indian
languages also) compared to English and other European
languages. This is due to the fact
that many different encoding standards are being used. Also, the
number of Bengali
documents are available in the web is comparatively quite
limited.
Raw corpora do not have much linguistic information. Corpora
acquire higher
linguistic value when they are annotated, that is, some amount
of linguistic information (part-
of-speech tags, semantic labels, syntactic analysis, named
entity etc.) is embedded into it.
Although, many corpora (both raw and annotated) are available
for English and other
European languages but, we had no tagged data for Bengali to
start the POS tagging task. The
raw corpus developed at TDIL was available to us. We used a
portion of the TDIL corpus to
develop the annotated data for the experiments.
3.2. The Tagset
With respect to the tagset, the main feature that concerns us is
its granularity, which is
directly related to the size of the tagset. If the tagset is too
coarse, the tagging accuracy will
be much higher, since only the important distinctions are
considered, and the classification
may be easier both by human manual annotators as well as the
machine. But, some important
information may be missed out due to the coarse grained tagset.
On the other hand, a too fine-
grained tagset may enrich the supplied information but the
performance of the automatic POS
-
Bengali Part-Of-Speech Tagging
30
tagger may decrease. A much richer model is required to be
designed to capture the encoded
information when using a fine grained tagset and hence, it is
more difficult to
So, when we are about to design a tagset for the POS
disambiguation task, some
issues needs to be considered. Such issues include – the type of
applications (some
application may require more complex information whereas only
category information may
sufficient for some tasks), tagging techniques to be used (rule
based which can adopt large
tagsets very well, supervised/unsupervised learning). Further, a
large amount of annotated
corpus is usually required for rule based POS taggers. A too
fine grained tagset might be
difficult to use by human annotators during the development of a
large annotated corpus.
Hence, the availability of resources needs to be considered
during the design of a tagset.
learn.
The Bureau of Indian Standards (BIS) Tagset has recommended the
use of a common
tagset for the part of speech annotation of Indian languages.
The tagset, incorporating the
advice of the experts and the stakeholders in the area of
natural language processing and
language technology of Indian languages, has to be followed in
the annotation tasks taking
place in Indian languages after August, 2010.
The BIS tagset has a total of 38 annotation level tags which are
common to all the
Indian languages covered under this tagset. We are using the
basic eight (8) part-of-speech
tagset i.e. Noun, Pronoun, Verb, Adjective , Adverb,
Preposition, Conjunction, Interjection,
along with Residuals and Quantifier from the BIS tagset.
The below table describes the individual tags with examples used
in our experiments:
-
Bengali Part-Of-Speech Tagging
31
Category Annotation
TAG
Examples
Noun N িীপঙ্কর , রাম, লযাম , দিল্লী etc
Pronoun PR ক্ষস, দতদি,তা, দযদি, আদম, তুদম , আমরা, তারা etc
Verb V কদর, করাম, খাওো, ে, ক্ষদখ etc
Adjective JJ খারাপ, ভাবা, েড়, ক্ষছটা etc
Adverb RB অদিকতর, অিবূর, এতটা, etc
Preposition /
Postposition
PSP ক্ষেবক, হইবত, উপবর, দভতর etc
Conjunction CC এেং, দকন্তু , অেচ, অেো
Interjection INJ প্লীজ,িন্নোি,সােিাি, হাাঁ, etc
Residuals RD । , , , ?, “” , ‘ ‘ ,
Quantifiers QT প্রেম , ,১,২.etc
.
Table 3.2 : The tagset for Bengali with 10-tags
-
Bengali Part-Of-Speech Tagging
32
CHAPTER 4
Tagging with Rule Based
Approach
-
Bengali Part-Of-Speech Tagging
33
In the first section we describe Rule Based Approach for POS
tagging. Since only a small
labeled training set is available to us for Bengali POS tagging.
Second section devoted to our
particular approach to Bengali POS tagging using Rule Based
Approach.
4.1. Rule Based Approach
The rule based POS tagging models apply a set of hand written
rules and use contextual
information to assign POS tags to each word in a sentence. These
rules are often known as
context frame rules. Most of the rule based taggers have two-
stage architecture. The first
stage is simply a dictionary look-up procedure, which returns a
set of potential tags and
appropriate syntactic features for each word. The second stage
uses a set of hand written rules
to discard contextually illegitimate tags to get a single best
POS for each word. A context
frame rule might say something like: “If current word is post
position then there is high
probability that previous word will be noun.” e.g. in the
sentence “ক্ষস লদিির উপর পাের ছুবর
মার।” the noun-adjective {N, JJ} ambiguity is present in the
word “লদিির”. So the
mentioned rule simply resolve this ambiguity problem.
In addition to contextual information, many taggers use
morphological information to
help in the disambiguation process. An example of a rule that
makes use of morphological
information is: IF word ends with –“ইরেছি / ছিলাম ” and
preceding word is a verb THEN
label it a verb (V).
Speed is an advantage of the rule based tagger, and unlike
stochastic taggers, they are
deterministic. Maximum effort is required in writing the
disambiguation rules. Also rule
based tagger is usable for only one language i.e. it is language
dependent. Using it for another
one requires a rewrite of most of the program.
4.2. Our Approach
4.2.1 System Flow Diagram
This section is concern with all the processing tasks are
designed. Here we concerned about
the following:
What are the modules need to be designed?
How they are interconnected?
-
Bengali Part-Of-Speech Tagging
34
No Yes
Start
Show the GUI
Accept Bengali
Language
Divide the sentence into tokens
Tokens with
suffix / affix ?
Split tokens into its stem by
Stemming
Assign the TAGS to tokens in Tagger
Find ambiguous Word
Assign the TAGS to ambiguous word using
POS tagging rules
View the result
Stop
Fig 4.2.1: Flow diagram
-
Bengali Part-Of-Speech Tagging
35
The fig 4.2.1 shows the diagrammatic representation of flow of
data throughout the
system. It consist of the following components/modules:
GUI(Graphical User Interface), the interface by which user will
communicate with
the back-end files. The interface should be simple in view and
easy to maintain
.
Tokenizer : This module generates the tokens of the given input
sentence. It also
calls the other modules when required. The tokens of the
sentence are basically stored
in a String array for further processing.
Stemming : The Stemming module split a word into its stem, i.e.
root. It is one of the
important applications and common requirement of any Natural
Language Processing
task. Word stemming is useful for indexing and search systems
also indexing and
searching are the key concepts of Text Mining applications and
IR systems. It also has
been used to improve the performance of spelling checkers where
morphological
analysis would be computationally expensive. A stemmer can also
reduce the size of a
dictionary which is the main feature to use a stemmer in
spelling checker applications
in mobile and other handheld device.
Tagging : The tagging module assigns tags to tokens and also
search for ambiguous
words and according to their type assign some special symbols to
them. If we
encounter words which are not present in the Lexicon they are
treated as unknown.
The ambiguous words are those words which act as a noun and
adjective or adjective
and adverb according to different context.
Resolving Ambiguity : The ambiguity which is identified in the
tagging module is resolved
using the Bengali grammar rules.
Displaying results : This module will be displaying the final
result. The tokens i.e.
words in the sentences are shown with their corresponding parts
of speech
-
Bengali Part-Of-Speech Tagging
36
CHAPTER 5
Experimental Result &
Discussion
-
Bengali Part-Of-Speech Tagging
37
5.1 Tools Used
Software: Few open source software tools were used in the
development of the project work
which are mentioned below:
- jdk 1.7.0_05
NetBenas IDE 7.1.1, NetBeans IDE lets you quickly and easily
develop
java desktop ,mobile and web
application. It can be directly
downloaded at
https://netbeans.org/downloads/
Fig 5.1.1 NetBenas IDE
Notepad, Notepad is a simple text editor for Microsoft Windows
and a basic
text editing program that you can use to create documents. It
has been include
in all versions of Microsoft Windows since Windows 1.0 in 1985.
So, no need
to download it. It is a common text only (plain text) editor.
The resulting file
typically saved wit the .txt extension. It looks simple
application but it has a
great impact in software
development. It can
write the programming
languages like
C.C++,Java, HTML and
many more but saved
with different
extensions.
Fig 5.1.2 Notepad
https://netbeans.org/downloads/
-
Bengali Part-Of-Speech Tagging
38
Hardware: We design and developed the whole system on a ACCR
Notebook with the
following specification:
Processor: Intel(R) Pentium(R) CPU 2030M @ 2.50GHz
RAM : 4.00 GB
HDD : 500 GB
Although the current system is ok for development but terrible
for huge dada handling
i.e. higher the size of data slower the speed of system reply
and this is just because of
Processor, if anyone use i3 or more then the speed will be
better.
5.2 Graphical User Interface
Snapshot1: This is the welcome screen of our project. Click the
Proceed button to go
to the Tagging section.
Fig 5.2.1: Welcome Screen
-
Bengali Part-Of-Speech Tagging
39
Snapshot 2: Here we first enter the Bengali sentence for tagging
purpose in the
specified blank text filled then press the TAG button for
tagging. The RESET button
will remove all the texts from the text field.
Fig 5.2.2 : The Tagging Menu
-
Bengali Part-Of-Speech Tagging
40
5.3 Experimental Results
The system has been tested with a set of data. The input text is
taken from the corpus
which was discussed in the chapter 3. Here only four results are
shown in the following
snapshot.
Result I:
-
Bengali Part-Of-Speech Tagging
41
Result II:
-
Bengali Part-Of-Speech Tagging
42
Result III:
-
Bengali Part-Of-Speech Tagging
43
Result IV:
-
Bengali Part-Of-Speech Tagging
44
5.4 Result Discussion
Accuracy of the tagger is computed as the ratio of the number of
words correctly tagged by
the system to the total number of tested words.
x 100%
The following are the observations that have been made during
testing the system.
Test No of tested words Accuracy
Test 1 150 67 %
Test 2 400 71 %
Test 3 800 78%
Test 4 1200 82 %
The overall accuracy of the system was computed by taking the
mean of four tested
results. The overall accuracy of the system was achieved
74.50%.
.
-
Bengali Part-Of-Speech Tagging
45
CHAPTER 6
Conclusion & Future
Works
-
Bengali Part-Of-Speech Tagging
46
6.1 Conclusion
Part-of-speech tagging is playing an important role in various
speech and language
processing applications in NLP. Since many of the reputed
companies like Google and
Microsoft are concentrating on Natural language processing
applications, it has got more
importance. Currently, many tools are available to do the task
of part of speech tagging. In
this report, our effort was computational linguistics analysis
for Bengali language by
developing a tagging system and we achieved accuracy over
74.50%. It had shown that the
performance of the tagger depends upon the size of the lexicon
and corpus. The performance
can be increased by increasing the size of the lexicon.
6.2 Future Work
Future work is still to be done in several directions. Though we
attained accuracy over
74.50% for known words, it is still an open area to enhance the
performance of the tagger.
This can be achieved by increasing the tagset and enlarge the
size of the lexicon so that the
tagger can do less ambiguous classification of the text. One can
also compare our results with
the result achieved by other Indian language tagging system.
-
Bengali Part-Of-Speech Tagging
47
References
Church K. W. 1988. A stochastic parts program and noun phrase
parser for unrestricted text.
Proceedings of the second conference on Applied Natural Language
Processing.
Austin, Texas, 136-143.
Ramshaw L. A. and Marcus M. P. 1995. Text chunking using
transformation-based learning.
In Proc. Third Workshop on Very Large Corpora. ACL, 1995
Wilks Y., and Stevenson M. 1997. Combining Independent Knowledge
Sources for Word
Sense Disambiguation. In Proceedings of the Third Conference on
Recent Advances
in Natural Language Processing Conference (RANLP-97), Bulgeria.
1-7.
Heeman, P. A. and J. F. Allen. 1997. Incorporating POS tagging
into language modelling. In
Proceedings of the 5th European Conference on Speech
Communication and
Technology (Eurospeech), Rhodes, Greece.
Ray P. R., Harish V., Basu A. and Sarkar S., 2003. Part of
Speech Tagging and Local Word
Grouping Techniques for Natural Language Processing. In
Proceedings 1st
International Conference on Natural Language Processing
Shrivastav M., Melz R., Singh S., Gupta K. and Bhattacharyya P.,
2006. Conditional
Random Field Based POS Tagger for Hindi. In Proceedings of the
MSPIL, Bombay,.
63-68.
Dandapt, S., Sarkar, S., Basu, A.(2007) “Automatic
Part-of-Speech Tagging for Bengali :An
Approach for Morphological Rich Languages in a Poor Resource
Scenario”. In:
Association for Computational Linguistic,pp 221-224.
Bharati, A., Chaitanya V., Sangal R., (1995) “Natural Language
Processing- A PAninian
Perspective”. Prentice-Hall India, New Delhi(1995)
Arulmozhi P., Rao R. K. and Sobha L., 2006. A Hybrid POS Tagger
for a Relatively Free
Word Order Language. In Proceedings of the Modeling and Shallow
Parsing of
Indian Language (MSPIL), Bombay. 79-85.
-
Bengali Part-Of-Speech Tagging
48
Singh S., Gupta K., Shrivastav M. and Bhattacharyya V. 2006.
Morphological Richness
Offset Resource Demand – Experience in constructing a POS Tagger
for Hindi. In
Proceedings of COLLING/ACL 06. 779-786.
Dalal, K. Nagaraj, U. Swant, S. Shelke and P. Bhattacharyya.
2007. Building Feature Rich
POS Tagger for Morphologically Rich Languages: Experience in
Hindi. In
Proceedings of ICON, India.
Greene B. B. and Rubin G. M., 1971. Automatic grammatical
tagging of English. Technical
Report, Department of Linguistics, Brown University.
Samuelsson C., Voutilainen A. 1997. Comparing a linguistic and a
stochastic tagger. In
Proceedings of the eighth conference on European chapter of the
Association for
Computational Linguistics (EACL), Madrid, Spain. 246-253
Ekbal, A., Bandyopadhyay, S., (2007) ”Lexicon Development and
POS tagging using A
Tagged for Marathi Text” 2014 in proceeding of: International
Journal of Computer
Science and Information Technologies, Vol.5
(2),2014,1322-1326.
-
Bengali Part-Of-Speech Tagging
49
APPENDIX
CD