Top Banner
A COMPUTATIONAL LINGUISTIC APPROACH FOR METADATA GENERATION FOR HINDI POETRY A Ph.D. Synopsis submitted to GUJARAT TECHNOLOGICAL UNIVERSITY, AHMEDABAD for the Award of Doctor of Philosophy in Computer Science By MILIND KUMAR AUDICHYA 159997431001 Under the Supervision of DR. JATINDERKUMAR RAMDASS SAINI Professor and Director, Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed University), Pune, India. DPC Member 1 DR. RAVI M. GULATI Associate Professor, Veer Narmad South Gujarat University, Surat, India. DPC Member 2 DR. RUSTOM D. MORENA Professor, Veer Narmad South Gujarat University, Surat, India.
19

A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

Feb 17, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

A COMPUTATIONAL LINGUISTIC APPROACH FOR

METADATA GENERATION FOR HINDI POETRY

A Ph.D. Synopsis submitted to

GUJARAT TECHNOLOGICAL UNIVERSITY,

AHMEDABAD

for the Award of

Doctor of Philosophy

in

Computer Science

By

MILIND KUMAR AUDICHYA

159997431001

Under the Supervision of

DR. JATINDERKUMAR RAMDASS SAINI

Professor and Director, Symbiosis Institute of Computer Studies and Research,

Symbiosis International (Deemed University), Pune, India.

DPC Member 1

DR. RAVI M. GULATI

Associate Professor,Veer Narmad South Gujarat University,

Surat, India.

DPC Member 2

DR. RUSTOM D. MORENA

Professor,Veer Narmad South Gujarat University,

Surat, India.

Page 2: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

2

Table of Contents

A. Title of the thesis and abstract .......................................................................................... 3

a. Title ........................................................................................................................................... 3

b. Abstract ..................................................................................................................................... 3

B. A brief description of the state of the art of the research topic ....................................... 5

C. Motivation.......................................................................................................................... 7

D. Definition of the Problem ................................................................................................. 7

E. Objective and Scope of work ............................................................................................. 8

F. Original contribution by the thesis. .................................................................................. 8

G. The methodology of Research, Results / Comparisons ................................................... 9

H. Achievements concerning objectives .............................................................................. 16

I. Conclusion ....................................................................................................................... 17

J. Copies of papers published and a list of all publications arising from the thesis ........ 17

K. Patents (if any) ................................................................................................................ 18

L. References ....................................................................................................................... 18

Page 3: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

3

A. Title of the thesis and abstract

a. Title

A Computational Linguistic Approach For Metadata Generation For Hindi Poetry

b. Abstract

Poems are an unbreakable part of literature. Many poems have been written and

are being written in Hindi literature as well. Hindi poems are mainly based on verses

(‘Chhand’), the figure of speech (‘Alankar’), and emotions (‘Ras’). This trio itself is a

research series in terms of research. This research work focuses explicitly on the verses.

The composition of verses in Hindi literature requires the knowledge of special rules.

These rules of verses are gradually becoming extinct. People who know the

composition of the verses are gradually becoming extinct, and there is no official

standard of the rules of the verses except the well-known verses. How much better will

it be if a poem inputted to a computer and the computer identifies that this poem is

written using this verse? It sounds great, isn’t it? Well, this research work is trying to

do something beyond that.

In various research of languages by natural language processing (NLP), Hindi

has an important place. However, the verses are still untouched in this context. From

the point of view of computational linguistics (CL), knowledge of the extinct verses

can be saved. This research is the best combination of the NLP and CL, wherein NLP

one can do wonders using human language, and in CL, computers are used to

understand the languages. Here both are performing their duties very well to establish

an automatic metadata generator of poetries based on the different rules of Hindi verses.

As the research work is an initial one and unique in this segment, it opens a new wing

in the research domain for upcoming researchers.

The research proposes a systematic hierarchical structure of Hindi verses and

the proper management of the different composition rules of verses after identifying

and validating the rules. This research work mainly revolves around the rule-based

modeling of an automatic metadata generator based on the well-managed verse rules.

Apart from that, the proposed metadata generator is robust and sufficient to handle both

old and new poems. From the implementation and application perspective, for a better

Page 4: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

4

understanding of poem words, meaning and examples are suggested with the existing

wordnet integration. Stop word filtering, symbolic representation, and numerical

sequence with verse identification make this research work more innovatively

beautiful. A similar example suggestion for the detected verse is the cherry on the

writers’ community’s cake feature. With this all, the primary and regular metadata is

essential, including Character Count, Word Count, Diacritic Count, Total Laghu Matra,

Total Guru Matra, Total Half Characters, Line Count, Charan Count, Matra Count,

Varna Count, Charan Matra Count, Charan Matra Sum, Charna Varna, Charan Varna

Sum. One can now imagine this research work’s beauty and its value for the research

fraternity.

This research work is not associated with the development of an algorithm to

understand or interpret Hindi poems. However, it gives a mechanism for distinguishing

between the different kinds of Hindi verses by identifying them through the rule-based

modeled automatic metadata generator. As a perk, this study provided meanings and

examples of the words used in the poem with the help of Wordnet but also irrespective

of context.

The purpose of the research is to open a new research wing, emphasizing the

significance of verses identification in Hindi poetry. Moreover, one might find this a

meaningful application for learning about the different Hindi verses. Apart from that,

if anyone wants to search any Hindi verse-based poetry in a database or a search engine,

the results are generally delivered through keyword matching; with this traditional

approach, a user might not always get the correct result. Instead, suppose the metadata

generated by this automatic metadata generator gets stored while storing the poems. In

that case, it will help populate the exact results by using that metadata to the users, and

it will also provide users their specific interest-based result filtering features. It is just

one of the applications of this research. Additionally, research in this segment may

improve our existing knowledge by bringing out the high intellect hidden behind the

Hindi verses’ composition rules.

Page 5: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

B. A brief description of the state of the art of the research topic

Hindi is one of the constitutionally accepted official languages of the Republic

of India. Devanagari script is used to write the Hindi language and is also used for over

120 languages[1]. The Devanagari Unicode range for Hindi is 0900–097F [2]. To know

state of the art with an aspect to Hindi verses based researches, an extensive literature

review was carried out.

Since the 1950s, Natural Language Processing (NLP) and Computational

Linguistics (CL) researches are going parallelly to deliver and serve society something

extraordinary[3]. Usually, both are often considered a grouped bundled part of artificial

intelligence (AI), but both were present even before artificial intelligence came into the

picture[4]. In this combination of NLP and CL, both computers and human languages

are equally important.

Poems, written in any language, give a different experience in itself. Many

Writers follow specific predefined rules for writing poems so that the joy of those

poems is increased. Many poems have been written in the Hindi language for centuries.

A literature review is a fundamental part of any research work. For the same, best efforts

had been given to find out the relevant research works. Research work directly related

to the metadata generation, related to the Hindi Prosody, and precisely based on the

Computation Linguistics not found.

Some nearby research work related to computational linguistics or metadata

generation was seen, which is enlightening here. Only two slightly nearby research

works were found in the recent literature review. Joshi and Kushwah[5] did research

emphasizing the detection of “Chaupai Chhand” in Hindi Poems and achieved 95.03%

accuracy using an automatic Chaupai detection algorithm developed in Java (JDK 8.1).

In another research, Kushwah and Joshi[6] researched “Rola: An Equi-Matrik Chhand

of Hindi Poems” and achieved 89.83% accuracy through their algorithm.

In some other poetry classification-related works Bafna and Saini[7-9] classify

Hindi verses such as Bal geet, Bhajan, Desh Bhakti, and Updesh Geet using different

machine learning algorithms. Kaur and Saini[10] researched Punjabi poetry

classification tests on the ten machine learning algorithms on 240 poetries. They

achieved respective accuracy of 50.63% (Hyperpipes), 52.92% (K- nearest neighbor),

52.75% (Naive Bayes), and 58.79% (Support Vector Machine). In another research

work, Kaur and Saini[11] worked for poetry classification using machine learning

Page 6: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

6

algorithms with the reduced feature set and found that with 60% accuracy Naive Bayes

performed the best.

Kaur and Saini[12] worked for automatic Punjabi poetry classification using

poetic features on 2034 poetries and achieved 71.98% accuracy with the Support Vector

Machine (SVM) algorithm. Kaur and Saini[13-14] also attempted to develop a Punjabi

poetry classifier using logistics features and weighting in different research works and

tested using different algorithms with different textual features.

Some ‘Ras’ related works were also seen. Kaur and Saini[12] worked for

automatic Punjabi poetry classification using poetic features on 2034 poetries and

achieved 71.98% accuracy with the Support Vector Machine (SVM) algorithm. Kaur

and Saini[13-14] also attempted to develop a Punjabi poetry classifier using logistics

features and weighting in different research works and tested using different algorithms

with different textual features.

Bafna and Saini[17] also worked for Hindi and Marathi poems and stories token

extraction applications using Zipf’s law to count and compression extracted tokens.

They used 820 poems, 710 stories for Hindi and 505 poems, and 610 stories for Marathi.

Bafna and Saini[18] worked for a technique, “BaSa” to identify context-based common

tokens for Hindi verses and prose. Total 820 proses and 710 verses were processed.

In other works of other related languages, Sanskrit Computational Linguistics

related work by Kulkarni and Huet[19] were seen and observed. They worked on

Panini’s Astadhyayi structure and parsing related issues and machine translation. Han

et al.[20] researched the rule-based metadata extraction from documents. Different

research works related to computational linguistics based metadata generation were

explored [21-23]. Ghneim et al.[24] did extensive research work for Arabic poetry

based on gender, dialect, and age. Hamidi et al.[25] worked for the automatic meter

classification in Persian poetries. He et al.[26] worked with SVM based classification

methods for poetry styles. Abbasi et al.[27] did sentiment analysis in multiple

languages. Kaur and Saini[28] also studied opinion mining in Indo-Aryan, Dravidian,

and Tibeto-Burman language families.

After research-level articles, efforts were made to find the facts and information

in old ancient books also. Some ancient books published almost before 50 years were

found in the archive of the web in which we found some information regarding

Chhands, And we found that all the books were coming from a base book named

‘Chhand Prabhakar’ written by Jagannath Prasad[29-31].

Page 7: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

7

‘Chhands’ are one of the six Vedangs, which are auxiliary disciplines

associated with Vedas’ study in Vedic culture, developed in ancient times. We also

found the presence of information related to Chhands in Agni Purana[32], one of the

eighteen Puranas. The core information related to Chhands and other writing arts can

be seen in Agni Purana from Chapter 328 to 347. All this information is in Sanskrit,

which is an old Indo-Aryan language. Chhandsharstra, which provides knowledge of

creating Chhands, which is discussed and referenced in Vedas and Puranas, was created

by a saint and great old mathematician named Pingal is also known as Pingalacharya

and Sheshavtar.

Apart from the research kind of publication, some findings from the ancient

books and some information related to the ‘Chhands’ only can be obtained from the

various books, blogs, and experts contributions (Dr. G. Wankhede, Maharashtra and

Dr. G. R. Dudwe, Madhya Pradesh) [33-35]. The information found from these sources

vary, or we can say they are contradictory [36-37].

Despite all the research efforts so far, the authors could not discover any

research related to the verses identification in Hindi poetry. From the aspect of

computational linguistics, this lack of research motivated us to initialize this research

work.

C. Motivation

After a thorough literature review, it was observed that Hindi has a rich heritage

of poetry. Many verses are written in the Hindi language. This composition of verses

in Hindi has been from ancient times. The knowledge hidden behind the rules of the

creation of verses is genuinely inspiring. The metadata generation based on Hindi

Poetry can improve the current search engine results based on metadata information

based filtered search queries. Also, it can be used for the advancement of digital

libraries. However, not much research work was seen in this area, and this research gap

inspired us to do this research work.

D. Definition of the Problem

The official languages used by the Central Government of India are Hindi and

English. Hindi is one of the 23 constitutionally recognized official languages in India.

The Hindi language is having a vast heritage of poetry. A significant portion of poetry

is available in the form of physical books, magazines, and papers. Physical Hindi poetry

Page 8: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

8

will be digitized today or tomorrow as research works to digitize the offline physical

data at a good pace using Optical Character Recognition (OCR), which helps store data

through the higher quality of document imaging. Some advanced OCR based research

works even claim to extract text out of it. Here comes the need for the hour; there is a

need for a particular automatic system to generate the metadata about these poems

individually to manage and digitize the Hindi poetry systematically.

In this aspect, the problem is developing a strong rule-based automatic metadata

generator that can identify the poetry based on the existing rules and capable enough to

incorporate the new practices in upcoming times with no or minimal efforts.

E. Objective and Scope of work

● To come out with a systematic hierarchical structure of Hindi verses.

● To model the metadata generator through computational linguistics perspective

rule-based modeling of Hindi verses construction rules for Hindi poetry.

● To develop a corpus for Hindi poetry

F. Original contribution by the thesis.

The entire work mentioned in this synopsis is innovative and novel work, with

the research papers as evidence. The proposed automatic metadata generator has been

visualized as a collection of various functionalities with the relevant publications. The

details of the associated research papers are as follow:

Papers Presented and Published:

1. Computational linguistic prosody rule-based unified technique for automatic

metadata generation for Hindi poetry, IEEE - 2019 1st International Conference

on Advances in Information Technology (ICAIT), Chikmagalur, India, 2019

2. Stanza Type Identification using Systematization of Versification System of

Hindi Poetry, International Journal of Advanced Computer Science and

Applications (IJACSA), 2021

3. Towards Natural Language Processing with Figures of Speech in Hindi Poetry.

International Journal of Advanced Computer Science and Applications

(IJACSA), 2021

Page 9: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

9

G. The methodology of Research, Results / Comparisons

a. ‘Chhand’ Introduction

The research work is majorly based on the Hindi verses, and there are plenty of

rules for the construction of verse associated with the different Hindi verses. Each verse

is having its own unique rule of formation. Hindi verses are based on plenty of rules. It

is not possible to include all the rules in the synopsis hence just focusing on the Part of

Verse, Types of Verse, and Classification and Identification related parts only.

Character, No. of Characters and their sequence, diacritics count, and quantity

and pause-speed related rules create a Verse which is known as ‘Chhand’. There are six

major parts of ‘Chhand’, which are: 1. Stanza, 2. Characters and Quantity, 3. Flow, 4.

Pause, 5. End of Stanza, 6. Predefined Sequence of Characters.

There are two types of ‘Chhands’ which are widely known as 1. ‘Vedic Chhand’

and 2. ‘Laukik Chhand’. This research work is dealing with the second type, ‘Laukik

Chhand’. Classification of verses for this research is based on the Hindi’ Laukik

Chhand’, There are three main streams of this type, which are known as: 1. ‘Matrik

Chhand’, 2. ‘Varnik Chhand’, 3. ‘Mukt’ or ‘Muktak Chhand’. Chhands in which Matra

(Quantity) Count is fixed are called Matrik Chhand. In Varnik Chhand, all stanza will

be having the same number of characters, along with the sequence for the characters

based on the different rules. ‘Mukt Chhand’ doesn’t have any specific rule like ‘Varnik’

or ‘Matrik’. This type was introduced later and can be said to be free-form. Including

these mainstream class types, In this research work, 111 different Chhand classes were

included, which are having specific appropriate rules. These three stream classes are

further divided into several classes and subclasses, which are denoted as subtypes and

subtypes of subtypes. Although not each subtype is having further subtypes or ‘Chhand’

but this research is structured in such a way that in case if in any class any new kind of

‘Chhand’ is found or will be found, then it will not require much effort to integrate with

the system.

Based on the systemized rules, the best attempts were made to provide the best

possible concrete concept for the automatic metadata generation for Hindi poetries by

incorporating the rules-based unified modeling. Based on the systemized rules, the

best attempts were made to provide the best possible concrete concept for the

automatic metadata generation for Hindi poetries by incorporating the rules-based

unified modeling.

Page 10: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

10

b. Data Preprocessing

UTF-8 standard encoded data for Devanagari was chosen to work with as input

for automatic metadata generator to use and integrate with the latest technologies and

will be more comfortable in the future. The automatic metadata generator expects the

information in the form of the UTF-8 based complete poems or a few lines or part of

poems. These lines are further processed for the pre-processing of the input for passing

it for further calculations after some necessary trimming and cleaning operations. Once

the pre-processed data is ready after the trimming operation, the cleaned data can be

passed for the next separation-related operations. Multiple levels of the separations

take place as per the demand of this research work. Initially, the separation of data is

for the line-level break based on the new line character (‘\n’) as the standard delimiter

for separating the lines.

After the separation of lines, the lines need to be further split into the parts known as

stanzas or ‘Charans’ of the verse. Separation uses a few delimiters (‘,’,’ || ‘, ‘| ‘). Any

other delimiter can also be used with a few minor changes. Once ‘Charan’ gets

separated, the separate stanzas are further chopped into the words by performing word-

level separation. Separated words further get divided into the characters and diacritics.

c. Identification and Detection of ‘Chhand’ based on the classification

Let us understand the concept of identification and detection with an example for in-

depth conceptual clarity. Here is an example of a Hindi verse named ‘Doha’ from

‘Hanuman Chalisa’ written by Saint Tulsidas.

पवनतनय संकट हरन, मंगल मरूित �प ।

राम लखन सीता सिहत, �दय बसह� सुर भपू ।।

These lines are inputted into the metadata generator, and the lines will go through the

trimming operation before the separation operation takes place.

Line Separation: (2 Lines)

Line 1: 'पवनतनय संकट हरन, मंगल मरूित �प',

Line 2: 'राम लखन सीता सिहत, �दय बसह� सुर भपू’

Stanza / ‘Charan’ Separation: (4 Stanzas)

Stanza 1: 'पवनतनय संकट हरन'

Stanza 2: 'मंगल मरूित �प'

Stanza 3: 'राम लखन सीता सिहत'

Stanza 4: '�दय बसह� सुर भपू'

Page 11: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

11

Words Separation: (14 Words)

1. 'पवनतनय', 2. 'संकट', 3. 'हरन', 4. 'मंगल', 5. 'मरूित', 6. '�प', 7. 'राम',

8. 'लखन', 9. 'सीता', 10. 'सिहत', 11. '�दय', 12. 'बसह�', 13. 'सुर', 14. 'भपू'

Stanza Wise Words Separation:

1. i.'पवनतनय', ii.'संकट', iii.'हरन'

2. i.'मंगल', ii.'मरूित', iii.'�प',

3. i.'राम', ii.'लखन', iii.'सीता', iv.'सिहत'

4. i.'�दय', ii.'बसह�', iii.'सुर', iv'भपू'

Character Wise Separation: (53 Characters) :

'प','व','न','त','न','य','स','◌'ं,'क','ट','ह','र','न','म','◌ं','ग','ल','म','◌'ू,'र','त','ि◌','र','◌'ू,'प','र','◌ा,'म','ल','ख','न','स',

'◌ी','त','◌ा','स','ह','ि◌','त','ह','◌'ृ,'द','य','ब','स','ह','◌'ु,'स','◌'ु,'र','भ','◌'ू,'प'

After the separation, the Matra Allocation takes place based on the predefined

rule, including some special and exceptional rules which may vary as per the provided

input’s characters and related to different possibilities.

The ‘Matra’ allocation and the ‘Matra’ Count will be used further to detect verse

after the rule-based modeling of verse rules automatically. The ‘Matra’ allocation is

also used after a few modifications and merging for character sequence mapping,

specifically for ‘Varnik’ verses. After this allocation and ‘Matra’ counting, the input

passes through the different set of rule-based methods specifically designed for the

specific verse-based rules. Each verse follows its own unique set of rules. In the

discussed Example of ‘Doha’ the rule for the formation of ‘Doha’ is getting fulfilled

as the combination of the odd-even stanza matra count is 13-11,13-11. Similarly, other

‘Chhands’ have different rules based on those rules’ Chhands’ are classified and

detected. The fig.1 showing the core idea of the whole process of detection and

identification of the classified ‘Chhands’.

Page 12: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

12

Figure 1. Core Flowchart of Chhand Detection

d. Advanced Muktak Detection

For the advancement concerning research computationally means something

which is not already available. Throughout this research, the need which was felt the

Page 13: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

13

most is that usually, when the ‘Chhand’ detection takes place, the input gets detected

as ‘Muktak’.

In that case, the input gets separated into several parts again until it is possible to till

‘Charan’ level separation. Those separated parts are processed again from scratch as

separate input and from which the results get recorded. At last, the Metadata Generator

can recognize that in ‘Muktak’ verses, also it tries to find out if any part of the ‘Muktak’

verses is using any ‘Matrik’ or ‘Varnik’ verse rules or not. If any rules were used, then

that ‘Chhand’ will also be detected. That can be one or more than one ‘Chhand’ rules

combination also.

e. Stopwords Filtering

In this research work, stop words filtering is also managed very well using the

existing list of Hindi stopwords from a hybrid research work [38], specifically on the

Hindi’s stop words. Filtering stop words helps reduce the processing and save time

when in further stages of wordnet integration.

f. Meanings and Example of words using wordnet integration

With the use of the best available Hindi Wordnet named ‘pyiwn’ the meaning of the

words and examples of those particular words are populated as a part of the metadata

while generating metadata [39]. The benefit of this integration is to make people

understand the words better with the other Example of the uses of the words. Even

though the meaning of the word is not as per the context of the use of the word but it

will still help users understand the meaning of the word. Apart from this, the integration

is helpful for the betterment of this Wordnet as it also generates the list of words that

are not found in Wordnet while searching for the meaning of the words.

g. Example Suggestion of the detected Chhand

In this part of the metadata generator, a mechanism using JavaScript Object

Notation (JSON) file is implemented. In the JSON file, the examples are stored in the

form of Key-Value pairs, and whenever any type of ‘Chhand’ is detected, the same key

is used to access the Example of that type of ‘Chhand’ example. It will be very helpful

for those who are interested in that type of ‘Chhand’ and want to explore more about

the same.

h. Alankar Detection Integration

‘Alankars’ were also identified and structured. Total 58 classification types of

Alankars were found, out of which 3 classes can be identified using this metadata

generator which are (‘Shabdalankar’, ‘Anupras’, ‘Punrukti’).

Page 14: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

14

Apart from these efforts were made to classify 4 more classes named

(‘Arthalankar’, ‘Yamak’, ‘Utpeksha’, ‘Upma’). All 4 were also implemented but as

context based meaning of word is required and that’s still a challenge in Hindi NLP,

these classes can't be said or claimed to be truly classified with this automatic metadata

generator.

i. Several Additional Helpful Utilities

As a part of the research, while carrying out this research, the needs of the

common utilities were felt, which did not exist but were required. Data collection utility

was created for the collection of the data to create a systematic data corpus. Later on, a

bulk collection utility for the data collection was also developed through which bulk

data can be processed and stored via text and comma-separated value (CSV) files. To

the research progress status, a utility for the research stats generation was also designed.

j. Results / Comparisons

The results are always the crucial part of any research study. Before proceeding

with the results, one thing to enlighten here is as not much-existing work was found in

this area, so comparing research work and the results is not possible. Any exactly

similar work related to the classification and identification of Hindi verses was not

found, and that only makes this work unique and novel.

Only two nearby research works were found. The first one was based on

identifying ‘Chaupai’ with 95.03% accuracy, while this research assures 97.09%

accuracy. Similarly, the second research work was based on ‘Rola’ with 89.83%

accuracy, where this research is providing 95.24% accuracy.

The automatic metadata generator is modeled based on the different rules of the

111 ‘Chhands’ classification types, subtypes, and subtypes of subtypes. The 53

different types of ‘Chhands’ data were collected for testing and validation. Each class

was having at least 20 and a maximum of 310 records based on the data’s availability.

From various sources Total of 3330 records were found and tested, out of which 3195

records were detected successfully.

The final accuracy rate on the basis of the results is 94.98%, and the failure rate

is 05.02%. Fig 2. Graph of the ‘Chhand’ data found, detected, and not found along with

the individual accuracy and failure rates of different ‘Chhands’ showing that the various

‘Chhands’ accuracy rate range is between 84% to 100%, and the failure rate is between

0% to 16%.

Page 15: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

15

Out of all the 53 different types of ‘Chhands’ data, only 5 ‘Chhands’

(‘Chhappay’, ‘Chopaiya’, ‘Shankar’, ‘Sar’, ‘Tantak’) were having an accuracy rate

below 90%, and the rest (48) all were between 90-100%, and in that also 29

(‘Roopmala’, ‘Rola’, ‘Panchchamar’, ‘Doha’, ‘Dodhak’, ‘Manoram’, ‘Nidhi’,

‘Harigitika’, ‘Durmil Savaiya’, ‘Ullala (Sam Matrik)’, ‘Chandrika’, ‘Dhar’, ‘Barvai’,

‘Sur Dhanakshari’, ‘Pramanika’, ‘Chaupai’, ‘Piyushvarsh’, ‘Sortha’, ‘Trotak’,

‘Malini’, ‘Bhujangi’, ‘Vanshasth’, ‘Drutvilambit’, ‘Mandakranta’, ‘Bhujangprayag’,

‘Aansu’, ‘Janak’, ‘Tilka’, ‘Muktak’) were having 95 or more than 95-100% accuracy

rate. The lowest accuracy percentage, 84% was detected for ‘Chhapay Chhand’, which

is made up of a combination of two different ‘Chhands’, that makes its formation and

identification complex, and due to that, only the problems occur more. The best

performing ‘Chhands’ with 100% accuracy rate was ‘Aansu’, ‘Bhujangprayag’,

‘Janak’, ‘Mandakranta’, ‘Muktak’ and ‘Tilka’.

The research is not limited to the already mentioned 'Chhands'; 227 records of

50 more 'Chhand' types were modeled and tested based on these records, for which the

data for the same were between 1 to 15 records of each. Out of 50 'Chhands', 34

'Chhands' ('Gitika / Chanchri / Charchari','Sundari / Madhavi Savaiya', 'Chakor

Savaiya', 'Sukhi Savaiya', 'Arsat Savaiya', 'Lavanglata Savaiya', 'Mukthara Savaiya',

'Vam Savaiya', 'Mod Savaiya', 'Shuddh Gita', 'Gaganangana', 'Lavni', 'Madhumalti',

'Vijat', 'Janharan Dhanakshari', 'Dev Dhanakshari', 'Vijya Dhanakshari / Kamini',

Figure 2. Chhand Detection Results

Page 16: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

16

'Jalharan Dhanakshari', 'Kanak Manjari', 'Giridhari', 'Panktika', 'Mattgayand Savaiya',

'Sumukhi Savaiya', 'Bihari', 'Nil', 'Ullala (Ardh Sam Matrik)', 'Shikhrini', 'Arvind

Savaiya', 'Muktamani', 'Indira', 'Gath', 'Damru Dhanakshari', 'Asabandha',

'Padhyamala') were having 1 to 5 examples for each, 9 'Chhands'

('Kusumasamudita','Dhuni','Pavan','Chanchala','Udiyana','Kaamrup','Chanchrik/Haripr

iya', 'Kanthi', 'Veer / Aalha / Matrik Savaiya') were having 6 to 10 examples each, the

remaining seven ('Shalini','Kirit Savaiya','Shakti', 'Indravajra', 'Ghanshyam',

'Upendravajra', 'Pavan') were having 11-15 records for each type. All the records were

detected successfully to their associated 'Chhand' types.

The primary three classes('Matrik Chhand', 'Varnik Chhand', 'Muktak / Mukt

Chhand') were incorporated, along with their associative further six subclasses ('Sam

Matrik', 'Ardh-Sam Matrik', 'Visham Matrik', 'Sam Varnik', 'Ardh Sam Varnik',

'Visham Varnik') were also included. However, any 'Chhand' type rules or examples

under two of the subclasses ('Ardh Sam Varnik', 'Visham Varnik') were not found, to

maintain the hierarchical structure and with a vision of upcoming times, these two were

also added so in upcoming times if any 'Chhand' is found in these subclasses than that

can be included quickly.

After analyzing all the results, it was found out that the lengthy data inputs took

more time. ‘Chhands’, which have a few numbers of lines, words, or characters, gets

detected faster than the ‘Chhands’, which have a more number of lines, words, and

characters.

H. Achievements concerning objectives

● A hierarchical structure of Hindi verses constructed.

● The construction rules of Hindi verses were identified and validated manually.

● An automatic metadata generator based on Hindi verse’s rule-based modelling

with a computational linguistics perspective is developed.

● A Hindi poetry corpus based on the Hindi verses is ready, and utility to collect

data systematically for corpus creation for Hindi Poetry is developed.

● Apart from the core objectives, To explore the ‘Alankar’ detection, a separate

module for ‘Alankar’ detection was also developed. Similar to ‘Chhands’, the

hierarchical structure was constructed.

Page 17: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

17

I. Conclusion

To conclude, the Hindi verse hierarchy was taxonomically structured initially

to begin the research. Furthermore, ‘Chhand’ rules were identified, verified, and

appropriately structured with the aspect of the computational linguistics based research

works, which are managed very well by this research work. ‘Chhands’ were classified

in the standard classes for better and smooth hierarchical management in this research

work. It was experienced that ‘Chhand’ detection based on the rules is a complex

process. ‘Chhands’ made up of complicated rules, takes more time to detect.

Special exception rules slow down the execution time as it takes more time to

process and check the data. ‘Chhands’ made up of the combination of the one or more

‘Chhand’ rules, also takes more time as they need to be gone through more than once

while detection and metadata generation processing for the detection.

Additionally, this research can extract stop words by filtering from the existing

list of stop words. The meaning and example uses of the words can be suggested with

Wordnet integration, which helps understand the poems better. The words which were

still not included in the Wordnet are also filtered. After carrying out this much research

work, it can be powerfully conveyed that systematized Hindi Prosody Rules and the

concept of automatic metadata generation for Hindi poetry are capable enough to

generate meaningful metadata automatically.

The research work done so far is sufficient to open a new path for upcoming

researchers to think of some other relevant aspects and further contribute to Natural

Language Processing and Computational linguistics research domain.

J. List of all publications arising from the thesis

1. M. K. Audichya and J. R. Saini, “Computational linguistic prosody rule-based unified technique for automatic metadata generation for Hindi poetry,” 2019 1st International Conference on Advances in Information Technology (ICAIT), Jul. 2019, doi: 10.1109/icait47043.2019.8987239.

2. M. K. Audichya and J. R. Saini, “Stanza Type Identification using Systematization of Versification System of Hindi Poetry,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 1, 2021, doi: 10.14569/ijacsa.2021.0120117.

3. M. K. Audichya and J. R. Saini, “Towards Natural Language Processing with Figures of Speech in Hindi Poetry,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 3, 2021, doi: 10.14569/ijacsa.2021.0120316.

Page 18: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

18

K. Patents (if any)

Nil

L. References

[1] Wikipedia Contributors, “Hindi,” Wikipedia, 04-May-2019. [Online]. Available:

https://en.wikipedia.org/wiki/Hindi

[2] “Unicode 13.0.0,” Unicode Inc, 22-Jan-2021. [Online]. Available:

https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf. [Accessed: 22-Jan-2021]

[3] Wikipedia Contributors, “Natural language processing,” Wikipedia, 28-Jun-2019. [Online]. Available:

https://en.wikipedia.org/wiki/Natural_language_processing

[4] Wikipedia Contributors, “Computational linguistics,” Wikipedia, 27-Oct-2019. [Online]. Available:

https://en.wikipedia.org/wiki/Computational_linguistics

[5] B. K. Joshi and K. K. Kushwah, “A Novel Approach to Automatic Detection of Chaupai Chhand in Hindi

Poems,” IEEE Xplore, 01-Sep-2018. [Online]. Available: https://ieeexplore.ieee.org/document/8675052

[6] K. K. Kushwah and B. K. Joshi, “Rola: An Equi-Matrik Chhand of Hindi Poems,” Journal of Computer Science

and Information Security (IJCSIS), vol. 15, no. 3, Mar. 2017 [Online]. Available:

https://www.academia.edu/32482898/Rola_An_Equi_Matrik_Chhand_of_Hindi_Poems. [Accessed: 22-Jan-2021]

[7] P. B. Bafna and J. R. Saini, “On Exhaustive Evaluation of Eager Machine Learning Algorithms for Classification

of Hindi Verses,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 2, 2020, doi:

10.14569/ijacsa.2020.0110224.

[8] P. Bafna and J. R. Saini, “Hindi Poetry Classification using Eager Supervised Machine Learning Algorithms,”

IEEE Xplore, 01-Mar-2020. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9167632

[9] P. B. Bafna and J. R. Saini, “Hindi Verse Class Predictor using Concept Learning Algorithms,” IEEE Xplore,

01-Mar-2020. [Online]. Available: https://ieeexplore.ieee.org/document/9074850

[10] J. Kaur and J. R. Saini, “Punjabi Poetry Classification,” Proceedings of the 9th International Conference on

Machine Learning and Computing - ICMLC 2017, pp. 1–5, 2017, doi: 10.1145/3055635.3056589.

[11] J. Kaur and J. R. Saini, “Automatic Punjabi poetry classification using machine learning algorithms with

reduced feature set,” International Journal of Artificial Intelligence and Soft Computing, vol. 5, no. 4, p. 311, 2016,

doi: 10.1504/ijaisc.2016.081353.

[12] J. Kaur and J. R. Saini, “Automatic classification of Punjabi poetries using poetic features,” International

Journal of Computational Intelligence Studies, vol. 7, no. 2, p. 124, 2018, doi: 10.1504/ijcistudies.2018.094901.

[13] J. Kaur and J. R. Saini, “PuPoCl: Development of Punjabi Poetry Classifier Using Linguistic Features and

Weighting,” INFOCOMP Journal of Computer Science, vol. 16, no. 1–2, pp. 1–7, Dec. 2017 [Online]. Available:

https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/546

[14] J. Kaur and J. R. Saini, “Designing Punjabi Poetry Classifiers Using Machine Learning and Different Textual

Features,” The International Arab Journal of Information Technology, pp. 38–44, Jan. 2019, doi:

10.34028/iajit/17/1/5.

[15] K. Pal and B. V. Patel, “Model for Classification of Poems in Hindi Language Based on Ras,” Smart Systems

and IoT: Innovations in Computing, pp. 655–661, Oct. 2019, doi: 10.1007/978-981-13-8406-6_62.

[16] J. R. Saini and J. Kaur, “Kāvi: An Annotated Corpus of Punjabi Poetry with Emotion Detection Based on

‘Navrasa,’” Procedia Computer Science, vol. 167, pp. 1220–1229, Jan. 2020, doi: 10.1016/j.procs.2020.03.436.

[Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877050920309030

[17] P. B. Bafna and J. R., “An Application of Zipf’s Law for Prose and Verse Corpora Neutrality for Hindi and

Marathi Languages,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 3, 2020,

doi: 10.14569/ijacsa.2020.0110331.

[18] P. B. Bafna and J. R. Saini, “BaSa: A Technique to Identify Context based Common Tokens for Hindi Verses

and Proses,” 2020 International Conference for Emerging Technology (INCET), Jun. 2020, doi:

10.1109/incet49848.2020.9154124.

[19] A. Kulkarni and G. Huet, Eds., Sanskrit Computational Linguistics: Third International Symposium,

Hyderabad, India, January 15-17, 2009. Proceedings. Berlin Heidelberg: Springer-Verlag, 2009 [Online].

Available: https://www.springer.com/gp/book/9783540938842

[20] H. Han, E. Manavoglu, H. Zha, K. Tsioutsiouliklis, C. L. Giles, and X. Zhang, “Rule-based word clustering for

document metadata extraction,” Proceedings of the 2005 ACM symposium on Applied computing - SAC ’05, 2005,

doi: 10.1145/1066677.1066917.

[21] X. Yu et al., “Using Automatic Metadata Extraction to Build a Structured Syllabus Repository,” Asian Digital

Libraries. Looking Back 10 Years and Forging New Frontiers, pp. 337–346, doi: 10.1007/978-3-540-77094-7_43.

Page 19: A COMPUTATIONAL LINGUISTIC APPROACH FOR ... - Amazon AWS

19

[22] M.-T. . Sagri and D. Tiscornia, “Metadata for content description in legal information,” IEEE Xplore, 01-Sep-

2003. [Online]. Available: http://ieeexplore.ieee.org/document/1232110. [Accessed: 18-Sep-2020]

[23] J. L. Klavans et al., “Computational linguistics for metadata building (CLiMB): using text mining for the

automatic identification, categorization, and disambiguation of subject terms for image metadata,” Multimedia Tools

and Applications, vol. 42, no. 1, pp. 115–138, Nov. 2008, doi: 10.1007/s11042-008-0253-9.

[24] N. Ghneim, O. Alsharif, and D. Alshamaa, “Emotion Classification in Arabic Poetry using Machine Learning,”

International Journal of Computer Applications, vol. 65, no. 16, pp. 10–15, 2013 [Online]. Available:

https://www.ijcaonline.org/archives/volume65/number16/11006-6300. [Accessed: 23-Jan-2021]

[25] S. Hamidi, F. Razzazi, and M. P. Ghaemmaghami, “Automatic meter classification in Persian poetries using

support vector machines,” IEEE Xplore, 01-Dec-2009. [Online]. Available:

https://ieeexplore.ieee.org/document/5407514

[26] Z.-S. He, W.-T. Liang, L.-Y. Li, and Y.-F. Tian, “SVM-Based Classification Method for Poetry Style,” IEEE

Xplore, 01-Aug-2007. [Online]. Available: https://ieeexplore.ieee.org/document/4370650

[27] A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple languages,” ACM Transactions on

Information Systems, vol. 26, no. 3, pp. 1–34, Jun. 2008, doi: 10.1145/1361684.1361685.

[28] J. Kaur and J. R. Saini, “A Study and Analysis of Opinion Mining Research in Indo-Aryan, Dravidian and

Tibeto-Burman Language Families,” Indian Journals, vol. 4, no. 2, 2014 [Online]. Available:

http://www.indianjournals.com/ijor.aspx?target=ijor:ijdmet&volume=4&issue=2&article=002. [Accessed: 23-Jan-

2021]

[29] Jagarnath Prasad, Chhand Sarawali. Jagarnath Press Bilaspur, 1917 [Online]. Available:

https://archive.org/details/in.ernet.dli.2015.478241. [Accessed: 25-Jan-2021]

[30] J. Prasad, Chhand Prabhakar. Jagarnath Press Bilaspur, 1935 [Online]. Available:

https://archive.org/details/in.ernet.dli.2015.322488/. [Accessed: 25-Jan-2021]

[31] N. Dass, Hindi Chhandolakshan, 2nd ed. Vani Prakashan Publisher, 2005.

[32] “स.ंअि�न पुराण | गीता�ेस, गोरखपुर.” [Online]. Available: https://book.gitapress.org/product-style-5/gita-press-570/

[33] “भारतीय छंद िवधान,” www.openbooksonline.com. [Online]. Available:

http://www.openbooksonline.com/groups/group/show?groupUrl=chhand. [Accessed: 25-Jan-2021]

[34] “छ�द - भारतकोश, �ान का िह�दी महासागर,” Bharatdiscovery.org, 2021. [Online]. Available:

https://bharatdiscovery.org/india/%E0%A4%9B%E0%A4%A8%E0%A5%8D%E0%A4%A6#gsc.tab=0.

[Accessed: 25-Jan-2021]

[35] “िह�दी सािह�य का�य संकलन – िह�दी क� किवता, लघुगीत, बालगीत, िफ�म गीत, गजल, शायरी, धािम�क लोकगीत का िवशाल संकलन.” [Online]. Available:

https://www.hindisahitya.org/. [Accessed: 25-Jan-2021]

[36] “Chhand In Hindi-छंद क� प�रभाषा, भेद और उदाहरण,” hindimeaning.com, 06-Feb-2018. [Online]. Available:

https://www.hindimeaning.com/2018/02/chhand-in-hindi.html

[37] “�व�थ िह�दी समाज: भुजंगी छंद (सािहल),” �व�थ िह�दी समाज, 13-May-2018. [Online]. Available:

http://hhindisamaj.blogspot.com/2018/05/blog-post_13.html

[38] V. Jha, M. N, P. D. Shenoy, and V. K R, “Hindi Language Stop Words List,” data.mendeley.com, vol. 1, Apr.

2018, doi: 10.17632/bsr3frvvjc.1. [Online]. Available: https://data.mendeley.com/datasets/bsr3frvvjc/1. [Accessed:

18-Sep-2020]

[39] P. Ritesh, K. Diptesh, and B. Pushpak, “pyiwn: A Python-based API to access Indian Language WordNets,”

Proceedings of the Global WordNet Conference, vol. 2018, 2018.