Top Banner
Sinmin - Corpus for Sinhala Language 1 Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. de Silva
59
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sinmin final presentation

Sinmin - Corpus for Sinhala Language

1

Upeksha W. D.Wijayarathna D. G. C. D.Siriwardena M. P.Lasandun K. H. L.

Supervisors :Dr. Chinthana WimalasuriyaProf. Gihan DiasMr. N. H. N. D. de Silva

Page 2: Sinmin final presentation

Outline● Introduction● Crawler Implementation and Design● Data Cleaning and Tokenizing Mechanisms● Selecting Data Storage Mechanism● Data Storage Model of SinMin● User Interface Design and Implementation● API Design and Implementation● Unit Testing● Performance Testing of the API● Implemented Sample Usages

2

Page 3: Sinmin final presentation

What is a Corpus??

“A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010)

3

Page 4: Sinmin final presentation

Usages of a Corpus● Implementing translators, spell checkers and grammar

checkers.● Identifying lexical and grammatical features of a language.● Identifying varieties of language of context of usage and

time.● Retrieving statistical details of a language.● Providing backend support for tools like OCR, POS Tagger,

etc.

4

Page 5: Sinmin final presentation

Sinmin is a Corpus for Sinhala language which is➢ Continuously updating

➢ Dynamic (Scalable)

➢ Covers wide range of language (Structured and unstructured)

5

Page 6: Sinmin final presentation

Architecture of Sinmin

6

Page 7: Sinmin final presentation

Identified Sinhala Resources

7

News Academic Creative Writing

Spoken Gazette

News Paper Text books Fiction Subtitle Gazette

News Items Religious Blogs

Wikipedia Magazine

mahawansa

Page 8: Sinmin final presentation

Identified Sinhala Resources

8

Page 9: Sinmin final presentation

Crawler Implementation and Design

9

Page 10: Sinmin final presentation

Crawlers are responsible of finding web pages that contain sinhala

content, fetching, parsing and storing them in a manageable format.

10

Page 11: Sinmin final presentation

Crawler Architecture

11

Page 12: Sinmin final presentation

Sample Xml File With One Article Stored In It

12

Page 13: Sinmin final presentation

Crawler ControllerCrawler controller monitors and handles the status of the web crawlers.

13

Page 14: Sinmin final presentation

14

Page 15: Sinmin final presentation

Data Cleaning and Tokenizing Mechanisms used

15

Page 16: Sinmin final presentation

Identified Issues● Erroneous characters of the texts● Short forms● Consecutive Sinhala vowel sign problem

fixing

16

Page 17: Sinmin final presentation

Erroneous Characters Of The Texts● Invalid Unicode characters

Eg: Characters in a private user area, Replacement character

● SymbolsEg: “,”, “.”, “{“, “(“, “?”

17

Page 18: Sinmin final presentation

Erroneous Characters Of The Texts

● Unwanted non-Sinhala charactersEg: ‘u+200C’, Á, À, ®, ¡, ª, º

● Non-symbolic characters which were terminating words

18

Page 19: Sinmin final presentation

Short Forms

● Short forms consists of full stops. ● But those full stop marks aren’t separating

sentences nor words.

E.g.: පෙ�. ව. (pm), රු�. (Rupees)

19

Page 20: Sinmin final presentation

Identified Common Short Forms

"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."

"පෙ�.", "ව.", "�.", "රු�."

"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."

20

Page 21: Sinmin final presentation

Consecutive Sinhala Vowel Sign Problem

21

Page 22: Sinmin final presentation

Consecutive Sinhala Vowel Sign Problem● Solution: Mapping them into one format

● Convention: Only one vowel sign to a Sinhala letter

22

Page 23: Sinmin final presentation

Consecutive Sinhala Vowel Sign Problem

23

Page 24: Sinmin final presentation

Selecting Data Storage Mechanism for Sinmin

24

Page 25: Sinmin final presentation

The performance of data insertion and retrieval mainly depend on the Data

Storage Mechanism used for the Corpus.

25

Page 26: Sinmin final presentation

We tested performance of several database systems to determine what

should we use to store data.

26

Page 27: Sinmin final presentation

We Considered Following Data Storage Systems

27

Page 28: Sinmin final presentation

We considered performance for inserting data and for retrieving 12

different information needs.Data set and source code

https://github.com/madurangasiriwardena/performance-test

28

Page 29: Sinmin final presentation

Data Insertion Time Comparison

29

Page 30: Sinmin final presentation

Information Retrieval Performance Comparison - Part 1

30

Page 31: Sinmin final presentation

Information Retrieval Performance Comparison - Part 2

31

Page 32: Sinmin final presentation

Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly.So we chose it for implementing

corpus.

32

Page 33: Sinmin final presentation

Data Storage Model of Sinmin

33

Page 34: Sinmin final presentation

● We Used Cassandra as the Main Storage System of Sinmin

● Apache Cassandra version 2.1.2 used.

● cqlsh version 5.0.1 used

34

Cassandra

Page 35: Sinmin final presentation

● Most queries of API are retrieved from Cassandra Database.

● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need

35

Cassandra

Page 36: Sinmin final presentation

● Oracle used as a backup storage server.

36

Oracle

Page 37: Sinmin final presentation

Oracle Schema

37

Page 38: Sinmin final presentation

Wildcard Search FeatureWildcard search feature enables users to run wild-card queries on the corpus

Eg: පෙ ? හ*

38

Page 39: Sinmin final presentation

Wildcard Search Feature● Implemented using Apache Solr● More than 1.2 million distinct words● Supports at most 10 asterisks and atmost 10

question marks

39

Page 40: Sinmin final presentation

Sinhala Vowel Sign Problem At Wildcard Search

In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters

40

Page 41: Sinmin final presentation

Sinhala Vowel Sign Problem At Wildcard SearchSolution: Represent Sinhala letter and vowel sign as one entity

41

Page 42: Sinmin final presentation

User Interface Design and Implementation

42

Page 43: Sinmin final presentation

● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin.

● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort.

43

Page 44: Sinmin final presentation

Sinmin user interface allows to,

● Find the probability of an n-gram● Find the most probable word comes after an n-gram● Compare the usage of n-grams● Find statistics of words, bigrams and trigrams● Wildcard search● Find latest articles for an n-gram

44

Page 45: Sinmin final presentation

45

Page 46: Sinmin final presentation

46

Page 47: Sinmin final presentation

API Design and Implementation

47

Page 48: Sinmin final presentation

REST API● REST API to expose Corpus services

● Much complex and customizable data retrieval and filtering

● Interface for third party applications to consume

48

Page 49: Sinmin final presentation

REST API● Depends on backend databases (Cassandra,

Oracle, Solr)● Cassandra acts as main storage system● Oracle is used as a backup database● Solr is used for wildcard search functions

49

Page 50: Sinmin final presentation

Architecture

50

Page 51: Sinmin final presentation

API Functions● wordFrequency● bigramFrequency● trigramFrequency● frequentWords● frequentBigrams● frequentTrigrams● latestArticlesForWord● latestArticlesForBigram● latestArticlesForTrigram

51

● frequentWordsAroundWord● frequentWordsInPosition● frequentWordsInPositionReverse● frequentWordsAfterWordTimeRange● frequentWordsAfterBigramTimeRange● wordCount● bigramCount● trigramCount

Page 52: Sinmin final presentation

Performance Testing of the API

52

Page 53: Sinmin final presentation

Throughput Under Different Load Conditions

53

Page 54: Sinmin final presentation

Time Taken To Process Requests Under Different Load Conditions

54

Page 55: Sinmin final presentation

Full Stop Predictor For OCR

● One challenge in OCR development is identifying

fullstops.

● This tool is a consumer application of Sinmin that

predicts the full stop marks of Sinhala texts.

55

Page 56: Sinmin final presentation

Publications● Implementing a Corpus for Sinhala Language -

Symposium on Language Technology for South Asia (Presented)

● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted)

56

Page 57: Sinmin final presentation

Future Works

● Annotate Words with POS Taggers and lemmas.

● Implement tools and applications that make use of the corpus

57

Page 58: Sinmin final presentation

Q & A

58

Page 59: Sinmin final presentation

Thank You!

59