This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. Presented by- Sujit Kumar Das M.Tech 3rd sem,IT Roll-021413
No-363202205 1 POS Tagging And Token Classification By Using Bangla
TokenizerUnder the Supervision Of Mr. Sourish Dhar Asst.
Professor,Dept of IT Assam University
2. Contents 2 Introduction Literature Survey Our Proposal
Future Works To Be Done Conclusions References
3. Introduction: 3 What is NLP? Field of computer science,
artificial intelligence, and linguistics concerned with the
interactions between computers and human (natural) languages[1].
NLP provides means of analyzing text . The goal of NLP is to make
computers analyze and understand the languages that humans use
naturally.
4. Cont 4 Why Natural Language Processing? Computers see text
in English the same way we use to see. People have no trouble
understanding language but computers have. No common sense
knowledge. No reasoning capacity.
5. Cont 5 What We Need In NLP Task? Knowledge about Language.
Knowledge about world. A way to combine Knowledge sources.
6. Cont 6 Mostly Solved Making Good Progress Still Really Hard
Spam Detection Sentiment Analysis Question Answering POS Tagging
Word Sense Disambiguation Paraphrase Named Entity Recognition
Parsing Summarization Machine Translation Dialog Language
Technology:
7. Cont 7 POS Tagging: Input: The grand jury commented on a
number of other topics. Output: The/DT grand/JJ jury/NN
commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. NE
Recognition: Input: Dan went to London for attend a conference on
NLP in 2012. Output: Dan went to London for attend a conference on
NLP in 2012. Name Dan Location London Date 2010
8. Cont What Is Tokenization? 8 Tokenization is the process of
breaking a stream of text up into words, phrases, symbols and other
meaningful elements called tokens. Token: Its a sequence of
character that can be treated as a single logical entity. Typically
Tokens are-Natural Languages Programming Languages Words
Identifiers Numbers Keywords Abbreviations Operators Symbols
Special symbols Constants
9. Cont What Is Tokenizer? 9 The job of a Tokenizer is to break
up a stream of text into tokens. Why Tokenizer? It does very
crucial task in pre-processing any natural language. To handle
semantic issues in the subsequent stages in machine translation.
Produces a structural description on an input sentence. For
language modeling, the distribution of input text into tokens is
compulsory[9].
10. Cont 10 What is Token Classification? Tokens classification
means identification of each tokens(words/terms) in a document and
classify them into some predefined categories. Theses predefined
categories can be name of a person, symbols, punctuations,
Abbreviations, numbers, date etc.
11. Cont Steps in Token Classification: 11 Tokenize the given
input text. Assign to each token the class (or tag) that it belongs
to. For Example, Token Class Name Number Word
12. Cont 12 Why Bengali Language Processing: One of the top ten
spoken Language in the world. Lack of research work till now.
Challenges In Bengali Language Processing: Due to its Grammatical
Vastness. Not well structured like Eastern Language(for example
English).
13. Cont 13 Goals of Bengali Language Processing: To develop
technology and standards to make computer usage Bangla enabled. To
establish standards for Bangla text processing to ensure
interoperability across platforms. To develop large standardized
corpus for Bangla text and speech. To create an ensemble of
available Bangla software and corpus in a standardized form and
make them easily available to all. To develop new software and
modify or enhance the existing software. To develop suitable speech
Technology for Bangla.
14. Literature Survey: 14 A Tokenizer is a component of parser
. Parsing natural language text is more difficult than the computer
languages such as compiler and word processor because the grammars
for natural languages are complex, ambiguous and infinity number of
vocabulary[8]. Natural language applications namely Information
Extraction, Machine Translation, and Speech Recognition, need to
have an accurate parser[8]. A tokenizer plays its significant part
in a parser, by identifying the group or collection of words,
existing as a single and complex word in a sentence. Later on, it
breaks up the complex word into its
15. Cont Related Works: 15 Some Existing standard tokenizers-
Standford Tokenizer for English Language[10]. Shallow Tokenizer for
Bengali Language. Vaakkriti Tokenizer for Sanskrit Language[2].
These Tokenizers was developed for some particular languages only
i.e., all Tokenizers doesnt work for all languages.
16. Cont Standford Tokenizer: 16 Developed mainly for English
Language and later on for Arabic,Chinese and spanish languages
also. Java language was used for developing. Online Interface:
17. Cont Results after parsing: 17 S=sentence ,NP=Noun Phrase
,NNS=Noun Plural, VP=Verb Phrase, VBZ=Verb,3rd present singular,
VBN=Verb, past participle, PP=Prepositional phrase ,TO=to, IN=
Preposition or subordinating conjunction.
18. Cont Shallow Bangla Tokenizer: 18 The shallow parser gives
the analysis of a sentence in terms of- Morphological Analysis. POS
Tagging. Chunking. Apart from the final output, intermediate output
of individual modules is also available.
19. Cont 19 Online Interface:
20. Cont 20 Result after submitting:
21. Cont 21 Bengali Stemmers: A Rule-Based Stemmer for Bengali
Language by Sandipan Sarkar,IBM and Sivaji Bandhopadhay,Jadavpur
University[12]. A light weight stemmer for Bengali and which was
use in spelling checker by Md. Zahurul Islam, Md. Nizam Uddin and
Mumit Khan,CRBLP,BRAC University,Dhaka in 2007[13]. Yet Another
Suffix Stripper, which uses a clustering based approach based on
string distance measures and requires no linguistic knowledge by
P.Majumdar, Gobinda Kole,ISI Pabitra Mitra,IIT and Kalyankumar
Dutta,Jadavpur University in
22. Cont 22 Comparison Of Three stemmers: Stemmer Used Method
Accuracy(%) Rule-Based Orthographic- syllable 89.0 Light weight
Longest Match Basis 90.8 YASS String Distance Measure 88.0
23. Cont 23 POS Tagger: Supervised POS Tagging: Has pre-tagged
Corpora used for training to learn information about the tagset,
word-tag frequencies, rule sets etc[11]. e.g., N-Gram,Maximum
Entropy Model(ME),Hidden Markov Model(HMM) etc. Unsupervised POS
Tagging: Do not require a pre-tagged corpora. they use advanced
computational methods to automatically induce tagsets. e.g.,Brill,
Baum-Welch algorithm etc[11].
24. Cont 24 Supervised Model POS Taggers Comparison:Tagger
Applied Method Uni-Gram(N=1) Most likely approach HMM One sentence
at a time. Formula- P (word | tag) * P (tag | previous n tags)
Bi-Gram(N=2) Same as Unigram but consider just previous word
tag
25. Cont 25 UNI-GRAM BI-GRAM HMM Sentences Tokens Accuracy(%)
Accuracy(%) Accuracy(%) 87 1002 28.6 28.6 39.3 304 4003 42.4 41.9
49.7 532 8026 48.1 47.9 53.6 677 10001 49.8 49.5 54.3 Bangla -
SPSAL Corpus and Tagset with Test data: 400 sentences, 5225 tokens
from the SPSAL test corpus[11].
26. Cont Problem Domain: 26 Bangla is very rich in inflections,
vibhakties (suffix) and karakas, and often they are ambiguous also.
It is not easy to provide necessary semantic and world knowledge
that we humans often use while we parse and understand various
Bangla sentences. So, mainly due to grammatical vastness design of
bangla Toeknizer is not an easy task.
27. Cont Bengali Grammar: POS 27
28. Cont Bengali Grammar: Genders 28 There are four genders in
Bengali grammar - 1.Pung lingo(masculine) 2.Stree lingo(feminine)
3.Ubha lingo(common) 4.Klib lingo(material)
29. Cont Bengali Grammar: Numbers 29 Like English language
Bengali has also two numbers- Singular: When we define a single
object or person its singular. eg. a man, a girl etc. When we
consider more than one objects or persons its plural numbers. eg.
Two man, mangoes etc.
30. Our Proposal: 30 We are going to develop such a system
which can be use for tokenize Bengali Text as well as the system
will be able to solve the problem of Tokens Classification. raw
(unstructured) text part-of-speech tagging Token Classification
annotated (structured) text Natural Language Processing Fig: Our
Model Pre- processing
31. Cont Flow Chart : 31 Input Words Stop Words Removal POS Tag
Classify Text Stemming
32. Cont 32 Input: Input will be a Bengali Text.
Words:(Completed) Text will be split into words after removing all
non- character and white spaces and then store them into excel
file. Stop Words Removal(Completed): Stop words are the frequently
occurring set of words which do not aggregate relevant information
to the text classification task. Root words: After pulling out
prefixes and suffixes from any word thus the origin form of a word
is known as root
33. Cont 33 POS Tagging: After finding the root word(stemming)
each elements will push into some particular classes which is
previously generated. Thus, Parts-Of- Speech(POS) will be tagged
with each word here. Tokens Classification: Tokens classification
means after finding tokens from above tasks categories them into
some pre-defined classes. Our consideration of classes will be
mainly Title, Surname,Collocation,punctuation,Abbreviation, Number,
Date, Unknown and foreign word.
34. Current Status Of Our Work: 34 Snapshot1: system
Interface
35. Cont 35 Snapshot 2: After Loading Using Load Button
36. Cont 36 Snapshot 3: After getting tokens from
37. Cont 37 Snapshot4: Tokens after removing Stop- words
38. Cont 38 Snapshot3: After execution words are split and
stored in excel file.
39. Future Works To Be Done: 39 Stemming i.e., Finding Root
Words. POS Tagging. Classification
40. Conclusions: 40 Although in Language processing tokenizing
is a Fundamental task, But due to richness of Bengali grammar and
structure of Bengali text it is not an easy task in case of Bengali
Language. Again Stemming is also a difficult task to do. To make an
effective bangla Tokenizer one must have a vast knowledge on
Bengali Grammar. So, We hope that we will able to develop such a
system which will overcome difficulties and the limitations of
existing bangla Tokenizer and give efficient Tokens and finally we
will able to classify the tokens.
41. References: 41 [1] Wikipedia [2] Aasish Pappu and Ratna
Sanyal Vaakkriti: Sanskrit TokenizerIndian Institute of Information
Technology, Allahabad (U.P.), India. [3] Firoj Alam, S. M. Murtoza
Habib, Mumit Khan Text Normalization system for Bangla Center for
research on Bangla Language Processing, Department of Computer
Science and Engineering, BRAC University, Bangladesh. [4] Goutam
Kumar Saha, Parsing Bengali Text - an Intelligent Approach
Scientist-F, Centre for Development of Advanced Computing, (CDAC),
Kolkata.
42. Cont 42 [5] Magic of ASP.Net with C# by Kumar Sanjeeb and
Shibi Panikkar. [6] www.C-sharpcorner.com [7] Overview of Stemming
Algorithms Ilia Smirnov http://the-smirnovs.org/info/stemming.pdf.
[8] Recognizing Bangla grammar using predictive parser, by K. M.
Azharul Hasan, Al-Mahmud, Amit Mondal, Amit Saha. Department of
Computer Science and Engineering (CSE) Khulna University of
Engineering and Technology (KUET) Khulna-9203, Bangladesh. [9]
Model for Sindhi Text Segmentation into Word Tokens J. A. MAHAR, H.
SHAIKH*, G. Q. MEMON Faculty of Engineering, Science and
Technology,
43. Cont 43 [11] COMPARISON OF DIFFERENT POS TAGGING TECHNIQUES
FOR SOME SOUTH ASIAN LANGUAGES by Fahim Muhammad Hasan, BRAC
University,Dhaka,Bangladesh. [12] Design of a Rule-based Stemmer
for Natural Language Text in Bengaliby Sandipan Sarkar IBM India
and Sivaji Bandyopadhyay Computer Science and Engineering
Department Jadavpur University, Kolkata. [13] A Light Weight
Stemmer for Bengali and Its Use in Spelling Checker by Md. Zahurul
Islam, Md. Nizam Uddin and Mumit Khan, Center for Research on
Bangla Language Processing, BRAC University, Dhaka, Bangladesh.
[14] Yet Another Suffix Stripper by PRASENJIT MAJUMDER, MANDAR
MITRA, SWAPAN K. PARUI,