CLEAR

CLEAR Sep.2012 1

Approximate/Fuzzy String Matching using Mutation Probability Matrices

We consider the approximate/fuzzy string

matching problem in Malayalam language

and propose a log-odds scoring matrix for

score-based alignment. We report a pilot

study designed and conducted to collect a

statistics about what we have termed as

“accepted mutation probabilities” of

characters in Malayalam, as they naturally

occur. Based on the statistics, we show

how a scoring matrix can be produced for

Malayalam which can be used effectively in

numeric scoring for the approximate/fuzzy

string matching. Such a scoring matrix

would enable search engines to widen the

search operation in Malayalam. Being a

unique and first attempt, we point out a

large number of areas on which further

research and consequent improvement are

required. We limit ourselves to a chosen

set of consonant characters and the matrix

we report is a prototype for further

improvement.

Authors:

Dr. Achuthsankar S Nair Hon. Director, Centre for Bioinformatics University of Kerala

Sajilal Divakaran FTMS School of Computing, Kuala Lumpur

Linguistic Computing issues in non-English languages are

generally being addressed with less depth and breadth,

especially for languages which have small user base.

Malayalam, one such language, is one of the four major

Dravidian languages, with a rich literary tradition. The

native language of the South Indian state of Kerala and the

Lakshadweep Islands in the west coast of India, Malayalam

is spoken by 4% of India‘s population. While Malayalam is

integrated fairly well with computers, with a user base that

may not generate huge market interest, such fine issues of

language computing for Malayalam remains unaddressed

and unattended.

If we were to search Google to look for information on the

senior author of this paper, Achuthsankar, and we gave the

query as Achutsankar or Achudhsankar, in both cases

Google would land us correctly in the official web page of

the author. This ―Did you mean‖ feature of Google is

managed by the Google-diff-match-patch [4]. The match

part of the algorithm uses a technique known as the

approximate string matching or fuzzy pattern matching

[10]. The close/fuzzy match to any query that is received

by he search engine is routine and obvious to the English

language user. However, when a non-English language such

as Malayalam is used to query Google, the same facility is

not seen in action.

When the word (Pathinaayiram – Malayalam word

for the number ten thousand) is used as a

query in Google Malayalam search, we are directed to docu

ments that contain a similar word (Payinaayiara

m a common mispronunciation of the original word ) but

not the word .

This is because approximate/fuzzy string matching has not been addressed in Malayalam. In this paper we make

preliminary attempts toward addressing this very special issue of approximate/fuzzy string matching Malayalam

Approximate/Fuzzy String Matching

The field described as approximate or fuzzy string matching in computer science has been firmly established since

1980s. Patrick & Geoff [5] define approximate string matching problem as follows: Given a string s drawn from

some set S of possible strings (the set of all strings composed of symbols drawn from some alphabet A), find a

string t which approximately matches this string, where t is in a subset T of S. The task is either to find all those

strings in T that are ―sufficiently like‖ s, or the N strings in T that are ―most like‖ s. One of the important

requirements to analyze similarity is to have a scientifically derived measure of similarity. The soundex system of

Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code

of one letter and three digits.

CLEAR Sep.2012 2

Odell and Russell[13] is perhaps one of the earliest of such

attempts to use such a measure. It uses a soundex code of one

letter and three digits.

These have been used successfully in hospital databases and

airline reservation systems [8]. Damerau-Leveshtein metric[2]

proposed a measure - the smallest number of operations

(insertions, deletions, substitutions, or reversals) to change one

string into another. This metric can be used with standard

optimization techniques[14] to derive the optimal score for each

string matching and thereby choose matches in the order of

closeness. Approximate or fuzzy string matching is in vogue not

only in natural languages but also in artificial languages. In fact

approximate string matching has been developed into a fine art in

computational sciences, such as bioinformatics. Bioinformatics

deals mainly with bio sequences derived from

DNA, RNA, and Amino Acid Sequences[9]. Dynamic programming

algorithm (Needleman–Wunch and Smith–Waterman

algorithms)[11] which enable fast approximate string matching

using carefully crafted scoring matrices are in great use in

bioinformatics. The equivalent of Google for modern biologist is

basic local alignment search tool (BLAST)[1], which uses scoring

matrices such as point accepted mutation matrices (PAM)[3] and

BLOcks of Amino Acid SUbstitution Matrix (BLOSUM)[6]. To the

best of the knowledge of the authors, such a scoring system is not

in existence for any natural language including English.

Recently an attempt has been made in this direction for English

language[7]. The statistics for accepted mutation in English was

cleverly derived based on already designed Google searches. In

the case of Malayalam, statistics of character mutations are not

easily derivable from any corpus or any existing search engines

or other language computing tools. Hence, data for this needs to b

e generated to go ahead with development of scoring matrix

system. We will now describe generation of primary data of

natural mutation in Malayalam.

among a small group of school children

(N=30). The observed mistakes (natural

mutations) are tabulated in Table 2 as

probabilities. It is noted that the sample

size of N=30 is inadequate for a linguistic

study of this kind. However, as already

highlighted, this paper reports a pilot

study to demonstrate proof of the

concept. Moreover, the sample size can be

made larger once the research community

whets the approach put forward by us.

Log-odds Scoring Matrix

It is possible to use Table 2 itself for

scoring string matches. However, it might

be unwieldy in practice. For long strings

we will need to multiply probabilities,

which might result in numeric underflow.

Hence, we will use a logarithmic

transformation. Another effect that we will

use is to convert from probability to odds.

The odds can be defined as the ratio of

the probability of occurrence of an event

to the probability that it does not. If the

probability of an event is p, then odds is

p/1-p. We will however not use this

formula directly, but define odds for any

given match i-j as:

Sij = 10 log (Pij/Pi)

In the above equation, pij is the

probability that character i mutates to

character j and pj is the probability of

natural occurrence of character j. Thus

the negative score for a mutation of a less

frequently occurring character will be

more in this scheme. The multiplier 10 is

ed

Occurrence and Mutation Probabilities

Malayalam has a set of 51 characters, and basic statistics of its

occurrence and mutation are required for developing a scoring

matrix. The occurrence probabilities are available, derived

from corpus of considerable size in 1971 and again in 2003[12].

We describe here only a subset of characters in view of economy

of space. In Table 1, we give the probabilities of one set of

consonants, which we have extracted from a small test corpus of

Malayalam text derived from periodicals.

We then designed and conducted a study to extract the character

mutation probabilities. We selected 150 words that cover all the

chosen consonant characters. A dictation was administered

among a

CLEAR Sep.2012 3

used just to bring the scores to a convenient range. Table 3

shows the log- odds score thus derived using occurrence

probabilities and mutation probabilities given in Table 1 and 2.

These can be used to score approximate matches and select the

most similar one.

Results, Discussions, and Conclusion

The prototype scoring matrix we have designed above can be

demonstrated to be capable of scoring approximate matches and

can therefore be a means of selecting the closest match. We will

demonstrate this with an example of scoring four approximate

matches for the word k. Table 4 lists the scores for the four

different matches and the exact match scores best. The next

best match as per the new scoring scheme is കക.

Our demonstration has been on a chosen set of consonant

characters, but it can be expanded to cover all Malayalam

characters. For demonstrating more general words, scoring

matrix for vowels is essential. We have computed the same and

will be reporting it in a forthcoming publication. During our

studies, we also noticed that the grouping of characters as done

conventionally may not suit our studies. For example, we found

that the character is a possible mutation for , very rarely, even

though they are not grouped together conventionally. A

regrouping based on natural mutations is a work we see as

requiring attention.

To the best of our knowledge, our work is a unique proposition

for the Malayalam language, which can be incorporated into

Malayalam search engines. We would like to reiterate that our

work is in prototype stage. The sample size of the corpus as well

as the size of the subjects in the survey is not substantial. The

authors hope to expand the work with a sizable database from

which statistics is extracted and then the scoring matrix can be

made more reliable. We also propose to validate the scoring

approach with sample trials involving language experts.

References [1] Altschul, S F, et al. (1990). ―Basic

local alignment search tool‖, Molecular Biology, 215(3), 403-410.

[2] Damerau, F J (1964). ―A technique for

computer detection and correction of spelling errors‖, ACM C

ommunications, 7(3), 171-176.

[3] Dayhoff, M O, et al. (1978). ―A model of Evolutionary Change in Proteins‖, Atlas

of protein sequence and structure, 5(3), 345-358.

[4] Google-diff-match-patch, [Online].

Available: http://code.google.com/p/google-diff-

match patch/, Accessed on 20 Jan. 2012.

[5] Hall, P A V and Dowling, G R (1980). ―Approximate String Matching‖, ACM

Computing Surveys, 12(4), 381- 402.

[6] Henikoff, S and Henikoff, J G (1992). ―Amino Acid Substitution Matrices from

Protein Blocks‖, Proceedings of the National Academy of Sciences of the

United States of America, 22(22),10915-10919.

[7] Kanitha, D (2011). ―A scoring matrix

for English‖, MPhil Dissertation in Computational Linguistics, Dept. Of

Linguistics, University of Kerala.

[8] Leon, D (1962). ―Retrieval of 24 misspelled names in an airlines passenger

record system‖, ACM Communications, 5,

169-171.

[9] Nair, A S (2007). ―Computational Biology & Bioinformatics: A Gentle

Overview‖, Communications of the Computer Society of India, 31(1), 1-13.

[10] Navarro, G (2001). ―A Guided Tour to

Approximate String Matching‖, ACM Computing Surveys, 33(1), 31 88.

[11] Needleman, S B and Wunsch, C D

(1970). ―A general method applicable to the search for similarities in the amino

acid sequence of two proteins‖, Journal of Molecular Biology, 48(3), 443-453.

[12] Prema, S (2004). ―Report of Study

on Malayalam Frequency Count‖, Dept. Of Linguistics, University of Kerala.

[13] Soundex, [Online]. Available:

http://en.wikipedia.org /wiki/Soundex, Accessed on 2 Dec. 2011.

[14] Wagner, R A and Fischer, M J (1974).

―The String-to-String Correction Problem‖, Journal of the ACM, 21(1), 168-178.

This article was published in CSI MAY

2012 and reused here with author's permission.

http://code.google.com/p/google-diff-match

http://code.google.com/p/google-diff-match

CLEAR Sep.2012 4

Author:

M.Jathavedan, Emeritus Professor,

Department of Computer Applications, CUSAT, Cochin

[email protected]

INDIAN SEMANTICS AND NATURAL LANGUAGE PROCESSING

The history of modern linguistics is chronologically divided into

two as BC (Before Chomsky) and AD (After Dissertation). Here

dissertation means the thesis which Chomsky submitted to

Pennisilvania University for Doctorate degree. His ideas are

considered epoch making comparable to the Darvin‘s theory of

evolution and took time to get recognition like Darvin.

Therefore Chomsky himself published it as ‗Syntactic

Structures‘.

Paninian grammar was introduced to modern linguistics as a

forerunner of Chomsky‘s generative grammar introduced in the

above book. ‘Many linguists, foreign and Indian, joined the

bandwagon and paused as experts in Paninian grammar in

Chomskian terms ( Joshy S.D.). The renewed interest had

influenced the interpretation of Paninian grammar itself as

generative grammar – the idea that grammar consists of

modules in a hierarchy or levels. The first contribution in this

direction was due to Kiparsky and Staal (1969 ) who proposed

a hierarchy of four levels of representation. This was criticized

by Hauben (2002)as they did not permit semantic factors.

Other important contributions are due to Caradona (1976).

Joshy continues: ‗Somewhat later Chomsky had drastically

reversed his ideas and after the enthusiasm for Chormsky

subsided, it became clear that the idea of transformation is

alien to Panini. Now a new type of linguistics has come up,

called Sanskrit Computational Linguistics with three capital

letters. Although Chomsky is out , Panini is still there ready to

be acclaimed as the forerunner of SCL.‘ But SCL was identified

as a branch of study in 2007 only and there were other factors

that led to its formation.

In a paper entitled ‗Knowledge representation in Sanskrit and

Artificial Intelligence‘ a NASA scientist Rick Briggs drew

attention of computer scientists to the works on semantics in

Sanskrit literature instead of Paninium. The important fact to

note is that he was referring the ‗Vaiyakarana Siddhanta

Laghu Manjusha‘ of Bhatta- Nagesa (1730-1810), perhaps the

last Sanskrit scholar in the Indian tradition.

This paper, rightly or wrongly, aroused great enthusiasm among

Sanskrit scholars. Some of them went even to the extent of

claiming that the future direction of research in artificial

language would be decided by Sanskrit. The immediate result

was the ‗ First Seminar on Knowledge Representation and

Samskritam ‗ (1986) held at Bangalore in which Briggs

Presented a paper entitled ‗ Sastric

Sanskrit: An Inter-lingua for Machine

Translation ‗.

Thus computational Sanskrit emerged as

a new branch of research. Apart from

computer assisted teaching and research

of Sanskrit (like any other subject),

automated reconstruction of Sanskrit texts

and machine aided translation (MAT),

designing a working system of Paninian

grammatical framework for machine

translation especially for Indian

Languages, it‘s possible applications in

cognitive science, AI are some areas of

active research in Sanskrit departments of

many universities and computer science

departments of many institutes.

It is a surprising fact that we are not able

to locate any more contribution of Briggs

in this field. Further, comments are

pouring in the internet for and against the

arguments put forward by Briggs. Another

point to be noted is that the authority of

the paper is Briggs in person and not

NASA as ill-conceived by many.

CLEAR Sep.2012 5

A question that naturally raised was the role of Sanskrit as

a dedicated programming language which meant the

development of a compiler for use of Sanskrit instructions.

C-DAC, Bangalore had initiated some work in this direction

in early 1990s itself. It was claimed that Astadhyayi

(Paninium ) was useful in this matter – i.e., meta-rule,

meta-language and linguistic marker system of Panini to

draw up the specification and requirements of such a

processor. To what extent the search has been successful

after twenty years is a question.

The International Symposiums on Sanskrit Computational

Linguistics ( SCL )were the results of the attempt to provide

a common platform for the traditional Sanskrit Scholars and

the computational linguists. It was a culmination of the

World Sanskrit Conferences, especially the thirteenth one

held at Edinburg and the First National Symposium on

Modeling and shallow parsing of Indian Languages in

Mumbai, both held in the year 2006. The first Symposium

was held in France in 2007 and the last one at Jawaharlal

Nehru University, New Delhi (2010).

LINGUISTICS AND PHILOSOPHY

Linguistics is considered as a part of philosophy in India. It

is often said that ‗ the grammatical method of Panini is as

fundamental to the Indian thought as is the geometrical

method of Euclid for the western thought.‘

Semantics in Sanskrit was never a well –defined domain of

a separate discipline ( Hauben, ). Rather, it remained the

battle field for exegetes, logicians and grammarians with

various backgrounds and philosophical commitments. It

was only a few centuries after Bhartrhari (4th century A.D. )

that a sophisticated specialized language and terminology

were developed for discussing semantic problems and

theories of verbal understandings. Thus during the period

from thirteenth to sixteenth centuries semantic issues

were seriously taken up for discussion between different

philosophical schools not only focussing on language but

also from a religious point of view.

The formal categories in their discussions

were mainly those established in Paninium

and investigated semantically and

philosophically by Bhartruhary. We will

consider two or three of them.

As an example we consider the sentence:

„Rama cooks rice‟

In the subdivision of a sentence into

words, the grammarians take the verb as

important. Other words are related to this

meaning-bearing word in one way or other.

Kriya is the action of the verb in the

sentence. The other words which are

―factors in the action ―of the verb are

called karakas. Panini has defined six

karakas.

For the sentence in our example the

grammarians may give the following

analytical description:

It is the activity of cooking, taking place in

the present time, having an agent which is

identical with Rama, having an object

identical with rice.

Thus the sentence is split into elements

such as stem, root, affix, ending and the

attribution of well-defined meaning to

each linguistic element. The central

element in this analysis is the meaning

expressed by the verb ‗cooks‘, or to be

more precise, the meaning of the verb root

‗to cook‘ (pac). The verbal form (in

Sanskrit the verbal ending ti in pa(ca)ti )

indicates that the activity takes place in

the present time. The agent of the action

is expressed by the grammatical subject,

Rama, the object of the action is the

grammatical object rice.

“Kriya is the action of the verb in

the sentence. The other words

which are “factors in the action “of

the verb are called karakas.”

CLEAR Sep.2012 6

For the Mimamsa thinkers also the verb is the central

element in a sentence. While grammarians take the verbal

root and the activity expressed by it as more important than

the verbal ending and its meaning, the latter are more

important for Mimamsakas. According to them the basic

meaning of all verbs is a creative urge which stimulates

action. This basic urge is expressed – transmitted to the

listener – by the verbal ending, not by the verbal root which

merely qualifies this creative urge. Thus according to them

the sentence in our example can be given the following

structural description:

“It is the creative urge which is conducive to cooking , taking

place in the present time, having the same substratum as the

agent residing in Rama, having as object rice. ―

Now for the Nyaya school, it is not the verb which is the

central element in the sentence, but, generally the noun in

the first ending ( nominative ). Thus the structure of the

verbal knowledge in our example according to them is:

― It is Rama who possesses the volitional effort conducive to

cooking which produces the softening and moistening which

is based in rice. ―

Underlying all these descriptions is the presupposition that

the main structural relation in the sentence is that between

qualifier and the thing to be qualified (visesana/visesya ) and

unlike grammarians and Mimamsakas for whom the visesya

is verb, for Nyaya thinkers the visesya is the noun in the first

ending.

SANSKRIT COMPUTATIONAL LINGUISTICS

I have already quoted S.D.Joshy. The sentences were from

his paper ‗ Background of the Astadhyayi ‗ read in the third

International Symposium on Sanskrit Computational

Linguistics held in 2009 at Hyderabad. He continues: ‘

Contrary to some western misconceptions the starting point

of Panini‘s analysis is not meaning or the intention of the

speaker, but words from elements. Panini starts from

morphology to arrive at a finished word.‘ But ‗he developed a

number of theoretical concepts which can be applied to other

languages also.‘

Coming back to Briggs, we note that in contrast to other

works his paper has for the first time drew attention of

computer scientists to the semantic theories available in

Sanskrit. Since it is meaning that is important in a sentence,

syntax is developed to tackle the semantic problem.

But centuries were elapsed before

Bhartrhari (4th century AD )developed his

sphota theory after Panini (4th century

B.C.). Again centuries elapsed before

Bhatta Nagesa gave completion to sphota

theory in eighteenth century. The later

development of linguistics can be

considered as a continuation of this.

There are four factors involved in a

proper cognition – expectancy, mutual

compatibility, proximity and intention of

the speaker. It is difficult to include the

last one in any syntatic solution. According

to Bhartrhari a speaker can seldom

communicate through words all that he

intended to and the hearer understands

more or at times less than what he hears!

Thus there is mutual dependency of Indian

theories of syntax and semantics. It is

said that the Indian linguists of the fifth

century B.C. knew more of the subject

than western linguists of the nineteenth

century A.D. Further, if there is any area

where the ancient Sanskrit scholars have

been much ahead of modern

developments, it is in the field of

semantics and systems of knowledge

representation.

REFERENCES: 1. Briggs, Rick, 1985, Knowledge

representation in Sanskrit and artificial

intelligence, The AI magazine.

2. Briggs, Rick, 1986, Shastric Sanskrit:

an interlingua for machine translation, First National Conferece on Knowledge

Representation, Bangalore.

3. Chormsky, N, 1957, Syntactic Structures, The Hague, Mouton.

4. Caradona, George, 1976, Panini: A survey of Research, The Hague, Mouton.

5 .Kiparsky, Paul and Staal J.F., 1969, Syntactic and semantic relations in

Panini, FL 5.

6. Hauben, E.M, 2002, Semantic in the

Sanskrit tradition on the eve of

colonialism, Project report, Leiden University.

7. Joshy, S.D., 2009, Background of the

Astadhyayi, Third International Symposium on Sanskrit Linguistics,

Hyderabad.

CLEAR Sep.2012 7

Overview of Question Answering System

Interaction between humans and computers is one of the most important active areas of research in this modern world.

Particularly interaction with natural language becomes more popular. Natural Language Processing is a computational

technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to achieve

human-like language processing for a wide range of applications. One of the most powerful applications of NLP is

Question Answering System. The need for automated question answering systems becomes more urgent due to the

enormous growth of digital information in text form. QA system involves analysis of both questions and answers. In this

overview, we focus on Question Type Classification, Question Generation, and Answer Generation for both closed and

open domain.

Introduction

Research in Natural Language Processing [1] has

been going on for several decades dating back to the

late 1940s. The goal of NLP is to accomplish human-

like language processing. The discipline and practice

of NLP are: Linguistics - focuses on formal,

structural models of language and the discovery of

language universals - in fact the field of NLP was

originally referred to as Computational Linguistics;

Computer Science - is concerned with developing

internal representations of data and efficient

processing of these structures, and; Cognitive

Psychology - looks at language usage as a window

into human cognitive processes, and has the goal of

modelling the use of language in a psychologically

plausible way.

The most explanatory method for presenting what

actually happens within a Natural Language

Processing system is by means of the ‗levels of

language‘ approach. Phonology concerns how words

are related to the sounds that realize them.

Morphology concerns how words are constructed

from more basic meaning units called morphemes. A

morpheme is the primitive unit of meaning in a

language. Syntax level concerns how words can be

put together to form correct sentences and

determines what structural role each word plays in

the sentence and what phrases are subparts of what

other phrases. Semantic level concerns what words

mean and these meanings combine in sentences to

form sentence meanings. Pragmatic level concerns

how sentences are used in different situations and

how use effects the interpretation of the sentence.

Discourse level concerns how the immediately

preceding sentences affect the interpretation of the

next sentence. World knowledge includes the

general knowledge about the structure of the world

that language users must have in order to maintain

a conversation.

Natural language processing is used for a wide

range of applications. The most frequent

applications utilizing NLP includes Information

Retrieval (IR), Information Extraction (IE),

Question-Answering, Summarization, Machine

Translation, Dialogue Systems. In this paper we

discuss more towards Question-Answering.

Question-Answering system can be performed in

two domains: Closed and Open Domain. Closed-

domain question answering [4] deals with

questions under a specific domain (for example,

medicine or automotive maintenance), and can be

seen as an easier task because NLP systems can

exploit domain-specific knowledge frequently

formalized in ontologies. Alternatively, closed-

domain might refer to a situation where only a

limited type of questions are accepted, such as

questions asking for descriptive rather than

procedural information. Open-domain question

answering [4] deals with questions about nearly

anything, and can only rely on general ontologies

and world knowledge. On the other hand, these

systems usually have much more data available

Authors

K.M. Arivuchelvan, Research Scholar, Periyar Maniammai

University.

K. Lakshmi Professor, Periyar Maniammai University.

http://en.wikipedia.org/wiki/Ontology_(computer_science)

CLEAR Sep.2012 8

from which to extract the answer.

Question Answering [5] is a specialized form of

information retrieval. Given a collection of

documents, a Question Answering system attempts

to retrieve correct answers to questions posed in

natural language. Open-domain question answering

requires question answering systems to be able to

answer questions about any conceivable topic. Such

systems cannot, therefore, rely on hand crafted

domain specific knowledge to find and extract the

correct answers.

Question Classification

Question Classification [2] is an important task in

Question-Answering. The most well known question

taxonomy was one proposed by Graesser and

Person (1994) based on their two studies about

human tutors and students‘ questions during

tutoring sessions in a college research method

course and middle school algebra course. Six

trained human judges coded the questions in the

transcripts, obtained from the tutoring sessions, on

four dimensions: Question Identification, Degree

Specification (e.g. High Degree means questions

contain more words that refer to the elements of

desired information), Question-content Category,

and Question Generation mechanism (the reasons

for generating questions include knowledge deficit

in the learner own knowledge base, common

ground between dialogue participants, social

actions among dialogue participants, and

conversation control ). They defined following 18

question categories according to the content of

information sought rather than on the interrogative

words (i.e. why, how, where, etc).

11. Verification: invites a yes or no answer.

12. Disjunctive: Is X, Y, or Z the case?

13. Concept completion: Who? What? When?

Where?

14. Example: What is an example of X?

15. Feature specification: What are the

properties of X?

16. Quantification: How much? How many?

17. Instrumental/procedural: How did an

agent do X?

18. Comparison: How is X similar to Y?

1. Interpretation: What does X mean?

2. Causal antecedent: Why/how did X

occur?

3. Causal consequence: What next?

What if?

4. Goal orientation: Why did an agent

do X?

5. Instrumental/procedural: How did an

agent do X?

6. Enablement: What enabled X to

occur?

7. Expectation: Why didn‘t X occur?

8. Judgmental: What do you think of X

9. Assertion:

10. Request/Directive

After analyzing 5,117 questions in the research

methods and 3,174 questions in the algebra

sample, they found four frequent question

categories: verification, instrumental-procedural,

concept completion, and quantification questions.

Question Generation (QG)

For the first time in history [], a person can ask a

question on the web and receive answers in a few

seconds. Twenty years ago it would take hours or

weeks to receive answers to the same questions

as a person hunted through documents in a

library. In the future, electronic textbooks and

information sources will be main stream and they

will be accompanied by sophisticated question

asking and answering facilities.

Applications of automated QG facilities are

endless and far reaching. Below are listed a small

sample, some of which are addressed in this

report:

1. Suggested good questions that learners

might ask while reading documents and

other media.

2. Questions that human and computer

tutors might ask to promote and assess

deeper learning.

3. Suggested questions for patients and

caretakers in medicine.

CLEAR Sep.2012 9

4. Suggested questions that might be asked in

legal contexts by litigants or in security

contexts by interrogators.

5. Questions automatically generated from

information repositories as candidates for

Frequently Asked Question (FAQ) facilities.

The time is ripe for a coordinated effort to tackle QG

in the field of computational linguistics and to launch

a multi-year campaign of shared tasks in Question

Generation (QG). We can build on the disciplinary

and interdisciplinary work on QG that has been

evolving in the fields of education, the social

sciences and computer science. The QG system

operates directly on the input text, executes

implemented QG algorithms, and consults relevant

information sources. Very often there are specific

goals that constrain the QG system.

Question Answering

Today‘s question answering [7] is not limited by the

type of document or data repository – it can address

both traditional databases and more advanced ones

that contain text, images, audio and video.

Structured and unstructured data collections can be

considered as information sources in question

answering. Unstructured data allows querying of raw

features (for example, words in a body of text),

extracting information with clear semantics

attached. Related to this distinction between

structured and unstructured data there is a

traditional distinction between restricted domain

question answering, or RDQA, and open domain

question answering (ODQA).

RDQA systems are designed to answer questions

posed by users in a specific domain of competence,

and usually rely on manually constructed data or

knowledge sources. They often target a category of

users who know and use the domain-specific

terminology in their query formulation, as, for

example, in the medical domain. ODQA focuses on

answering questions regardless of the subject

domain. Extracting answers from a large corpus of

textual documents is a typical example of an ODQA

system. Recently, we have witnessed an approach of

question answering involving semi-structured data.

These data often comprise text documents in

which the structure of the document or certain

extracted information is expressed by a markup.

Such markups can be attributed manually (e.g.,

the structure of a document) and/or in an

automatic way, e.g., markups for identified

person and company names and their

relationships in newspaper articles.

Conclusion

Question answering is a complex task needing

effective improvements of different research

areas including, question generation, question

ranking, question classification, information

retrieval, natural language processing, database

technologies, Semantic Web technologies,

human computer interaction, speech processing

and computer vision.

REFERENCES

1. Liddy, E. D. In Encyclopaedia of Library and

Information Science, 2nd Ed. Marcel Decker,

Inc.

2. Ming Liu Rafael A. Calvo ―G-Asks: An

Intelligent Automatic Question Generation

System for Academic Writing Support‖ Dialogue

and Discourse 3(2) (2012) 101–124.

3. Mark Andrew Greenwood ―Open-Domain

Question Answering‖ September 2005.

4. http://en.wikipedia.org/wiki/Question_answer

ing.

5. Andrew Lampert ―A Quick Introduction to

Question Answering‖ December 2004.

6. Workshop Report ―The Question Generation

Shared Task and Evaluation Challenge‖

Sponsored by the National Science Foundation.

7. Oleksandr Kolomiyets, Marie-Francine Moens

―A survey on question answering technology

from an information retrieval perspective‖

Information Sciences 181 (2011) 5412–5434.

http://en.wikipedia.org/wiki/Question_answering



CLEAR Sep.2012 10

I-Search....

Future of Search Engines

In this web-age, searching-or more precisely

surfing the web may be a casual phrase in day to

day business. The netizens continuously enrich the

web-vocabulary by words like ―Googling‖. What this

speaks is how search engines are important in this

digital era. A web search engine is designed to

search for information on the World Wide Web.

Today‘s search engines come in two types.

Directory-based engines, like Yahoo, are still built

manually. What that means is that you decide what

your directory categories are going to be Business,

and Health, and Entertainment and then you put a

person in charge of each category, and that person

builds up an index of relevant links. Crawler-based

engines, like Google, employ a software program —

called a crawler — hat goes out and follows links,

grabs the relevant information, and brings it back to

build your index. Then you have an index engine

that allows you to retrieve the information in some

order, and an interface that allows you to see it. It‘s

all done automatically.

As the Web continues to grow, however, and to be

more and more important for commerce,

communication, and research, information-retrieval

problems become a more serious handicap. The

percentage of Web content that shows up on search

engines continues to wane. And as search engines

struggle to add more and more content, the

information they provide may be increasingly out-of-

date.

Recent advances in intelligent search suggest that

these limitations can be partially overcome by

providing search engines with more intelligence and

with the user‘s underlying knowledge. That is called

natural language processing. It might also have to

understand what the user need, even when he

doesn‘t say it. And that requires some knowledge of

the user. These ideas lead to the birth of a new

generation of web technologies, popularly known as

Semantic Web.

Semantic Search

A semantics search engine attempts to make

sense of search results based on context. It

automatically identifies the concepts structuring

the texts. For instance, if you search for

―election‖ a semantic search engine might

retrieve documents containing the words ―vote‖,

―campaigning‖ and ―ballot‖, even if the word

―election‖ is not found in the source document.

Semantic Search systems consider various

points including context of search, location,

intent, variation of words, synonyms,

generalized and specialized queries, concept

matching and natural language queries to

provide relevant search results. Major search

engines like Google and Bing incorporate some

elements of Semantic Search. The objective of

this article is to discuss the recent advances in

area of Semantic Search.

Google's Knowledge Graph:

Google usually returns the search result for any

query based on the text and the content. To put

it right, it does not understand the exact

meaning of the words. It matches the keywords

of the query with those of the sites and returns

pages that have a significant authority on those

words.

Amit Singhal, Google‘s senior VP of engineering,

said [1]: “The introduction of Knowledge Graph

enables Google to understand whether a search

for „Mars‟ refers to the planet or the

confectionary manufacturer. 'Search is a lot

about discovery' – the basic human need to

learn and broaden your horizons”.

Author

Manu Madhavan M. Tech Computational Linguistics

Govt. Engg. College, Sreekrishnapuram [email protected]

CLEAR Sep.2012 11

“The introduction of Knowledge Graph enables Google to understand whether a search for

„Mars‟ refers to the planet or the confectionary manufacturer. 'Search is a lot about

discovery' – the basic human need to learn and broaden your horizons”. Amit Singhal

Bing's Semantics Search

Microsoft specifically brands Bing as a "decision

engine," and not as a general purpose search

engine--even though it provides that functionality

as well--in order to differentiate it from Google

Search. Bing's search is based on semantic

technology from Powerset that was acquired by

Microsoft in 2008. Notable changes include the

listing of search suggestions as queries are

entered and a list of related searches (called

"Explore pane"). Bing features semantic

capabilities like presenting more readable

captions based on linguistic and semantic

analysis of content. The concept of entity

extraction is leveraged in Bing, providing

knowledge on phrases and what they uniquely

refer to. [2]

Bing‘s new product Adaptive Search strives to

capitalize on semantic search technology.

Adaptive Search will take into consideration your

user behaviour, then tailor your Bing results to be

most appropriate. So if you‘ve searched for a

word then clicked on a specific site previously,

Bing will predict that it‘s likely that what you‘re

searching for falls into the context of that site,

thus it can provide you with results that are more

tailored. [5].

Powerset

Powerset is a Microsoft owned Company building

a transformative consumer search engine based

on natural language processing. Their unique

innovations in search are rooted in breakthrough

technologies that take advantage of the structure

and nuances of natural language. Using these

advanced techniques; Powerset is building a

large-scale search engine that breaks the

confines of keyword search.

By making search more natural and intuitive,

Powerset is fundamentally changing how we

search the web, and delivering higher quality

results. [3]

Hakia:

Hakia is a general purpose semantic search

engine, that search structured corpora (text)

like Wikipedia. For some queries (typically

popular queries and queries where there is

little ambiguity), Hakia produces resumes.

These are portals to all kinds of information on

the subject. Every resume has an index of

links to the information presented on the page

for quick reference. Often, Hakia will propose

related queries, which is also great for

research. [3]

Cognition

Cognition has a search business based on a

semantic map, built over the past 24 years,

which the company claims is the most

comprehensive and complete map of the

English language available today. It is used in

support of business analytics, machine

translation, document search, context search,

and much more. [3]

Swoogle:

Swoogle, the Semantic web search engine, is a

research project carried out by the ubiquity

research group in the Computer Science and

Electrical Engineering Department at the

University of Maryland. It‘s an engine tailored

towards finding documents on the semantic

web.

CLEAR Sep.2012 12

Swoogle is capable of searching over 10,000 ontologies and indexes more

that 1.3 million web documents. It also computes the importance of a

Semantic Web document. The techniques used for indexing are the more

Google-type page ranking and also mining the documents for inter-

relationships that are the basis for the semantic web. [4]

Conclusion

NLP is a complex area of research, requiring a solid understanding of

grammars (not just grammar), and a good grounding in computational

linguists (in order to apply the techniques to machine, which is not always

easy). Understanding the techniques used in NLP allows us to provide the

best format and patterns for the search engine. Seeing as NLP seeks to

mimic human language understanding, using common sense is a good

idea. But before any broader, more sophisticated sort of intelligence can

be placed into a machine we humans will have to get a better grasp on

just what intelligence is.

References:

1. http://mashable.com

2. http://semanticweb.com

3. http://thenextweb.com

4.

http://web2innovations.com

5. http://blogs.wsj.com

Google synonyms and natural language processing

Google just blogged about synonyms as they related to searcher intent. They provide several examples of how a

concept as simple as a synonym complicates natural language processing. This also brings up some important

recommendations for site owners with respect to SEO.

Prospective customers type in all kinds of variations on your most obvious keywords (hence the need for keyword

research). Often they make use of synonyms, some common, some not. These variations often represent less

competitive opportunities for high search engine rankings if you can incorporate those synonyms into your

website. In particular:

Use common variations within your existing copy rather than using the same phrase repeatedly. (This

also tends to make long blocks of text more readable.)

Develop pages that specifically focus on each of the most common and valuable synonyms.

If there are enough synonyms and industry-specific terms, consider developing a glossary of terms.

Find opportunities to talk about the synonyms, such as a blog post or article that talks about how

synonyms may actually be somewhat different or whose similarity is up for debate (e.g. SEM vs. Search

Engine Advertising).

http://www.web1marketing.com

PyLucene

PyLucene is a GCJ-compiled

version of Java Lucene

integrated with Python. Its

goal is to allow you to use

Lucene's text indexing and

searching capabilities from

Python.

http://www.web1marketing.com/

CLEAR Sep.2012 13

Remolding Professional sectors: the SaaS way..

SaaS : Purpose and Functions

The costs and time to market benefits of outsourcing business

services like payroll, Storage space, Customer Relationship

Management (CRM) applications, and company websites has

been proven for many businesses. The term for these types of

outsourced services is most recently known as Software As A

Service (SaaS).

Author

Dr. Sudheer S Marar

MCA MBA PhD Associate Professor and

HOD, Department of MCA Nehru College of Engineering

and Research Centre

Introducing new technology is an expensive undertaking,

usually requiring high capital outlays and can take many

months of training, installation and integration before service

can be delivered in network. Outsourcing these services to

organizations that are experts in the technology lowers costs,

increases uptime, accelerates revenue realization and provides

increased flexibility & functionality.

Due to these results, hosting for these critical business

functions continues to grow and many companies are looking

for similar opportunities in other operational areas.

Effects of Downturn

As stated in Movius Corporation annual report, The economic

downturn has globally forced many companies to reduce

spending across the board. This has put companies that are in

highly competitive and innovation driven industries, say

telecommunications in a exigent balancing act. While they

need to try to control expenses, if they are not also continuing

to introduce the latest applications and services, they will

quickly begin to lose their market share.

The ideal situation for a carrier would be, to almost suddenly

introduce new services without risking precious in hand

capital. Under the best possible scenario the carrier could

begin generating revenue in a matter of weeks after making

the decision to launch a new service. If the service could be

introduced without the need to add additional staff, the

solution is essentially risk free.

Some applications are an

immediate success in the

market while others take time

or in the worst case never get

toehold in a given market. The

ideal introduction scenario for a

carrier would be that they could

try a new service in a particular

market without having to make

a significant investment all the

while gaining key market data.

Therefore, companies today are

faced with the challenges of

controlling equipment and

operating costs, protecting their

current investments and having

the ability to deploy new

applications quickly. To add to

the challenge, many carriers

are faced with older application

platforms that are limited in

capability and potentially

approaching end of life. These

companies need cost-effective

solutions that permit the

conversion from legacy

networks to IP infrastructure

without major changes to

network infrastructure or

application design.

CLEAR Sep.2012 14

Enterprise-level applications

As extracted from a lead article of IDC-SAP initiated

paper, ―..Professional service firms focus their business

management energy on optimizing the utilization of an

expert's or a consultant's time. They attempt to develop

service offerings or skill sets that clients will find

compelling. Ultimately, they focus on properly charging

and receiving payment from clients. Larger firms tend to

broaden their offerings to ensure a greater wallet share.

Meanwhile Smaller firms tend toward key-field focusing

and deep industry expertise, hoping to foster continuing

relationships with a small number of clients.‖

In short, All firms balance developing a talent pipeline

with maximizing utilization rates. Client satisfaction and

trusting relationships drive both repeat business and

referrals in most professional services segments.

Therefore, firms seek to ensure deliverables of the

highest possible quality and strive to fully meet client

expectations throughout the engagement process. Firms

increasingly use technology to support all parts of their

business: Finance and scheduling software are common,

knowledge management and data warehouse capability

help improve service quality, and client management and

engagement management software are increasingly used

to monitor and maximize customer satisfaction. The

increased use of technology has both aided and hindered

professional services firm-constrains to improve their key

value propositions.

Benefits of SaaS

The cost of a complex business management software

implementation is often the starting point for a discussion

and often a point where the discussion meets a quick

end. In their research, IDC has identified several areas

where SaaS system delivery costs differ from on-premise

delivery costs. Primarily, They are the following:

• License fees. Both initial and Maintenance cost.

• Hardware costs.

• IT infrastructure costs.

• Test Environment maintaining development cost

• IT personnel/support costs.

• Security, backups, and disaster recovery.

The Futuristic.

Clearly SaaS applications are maturing. The

number of companies that either are using

SaaS applications or plan to use SaaS

applications in the next year has grown

considerably over the past few years,

suggesting that the barriers to adoption —

either real or perceived — are being

overcome. We see a bright future for SaaS

across a broad range of application areas

and for large and small professional services

firms.

SaaS is not without its problems, however.

Functionality and security concerns hang

back, and while these concerns are more a

perception than reality, it is important when

considering applications from a SaaS vendor

that appropriate due diligence be applied to

ensure that the functionality meets critical

business needs. It‘s important for any

corporate client to have a good choice on its

SaaS vendor, not all are created equal. As

this domain is a maturing capability, one

should make it sure to select a vendor that

brings experience, financial stability, and a

good reputation for working effectively with

professional services of the company,

thereby ensuring the client on its business

benefits, scalable growth, and business

continuity.

CLEAR Sep.2012 15

Apple's SIRI

Author

Robert Jesuraj K

M. Tech Computational Linguistics

Govt. Engg College Sreekrishnapuram

[email protected]

What is Siri?

Siri (Speech Interpretation and Recognition Interface) is

an intelligent personal assistant and knowledge navigator which

works as an application for Apple's iOS. The application uses a

natural language user interface to answer questions, make

recommendations, and perform actions by delegating requests to

a set of Web services.

Siri was originally introduced as an iOS application

available in the App Store by Siri, Inc. Siri, Inc. was

acquired by Apple on April 28, 2010. Siri, Inc. had

announced that their software would be available

for BlackBerry and for Android-powered phones,

but all development efforts for non-Apple platforms

were cancelled after the acquisition by Apple.

Siri is now an integral part of iOS 5, and available

only on the iPhone 4S, launched on October 14,

2011. On November 8, 2011, Apple publicly

announced that it had no plans to support Siri on

any of its older devices

Using Siri

The app transcribes spoken text and then takes

these commands and routes them to the right web

services. If you try to book a table at a Thai

restaurant ("get me a table at a good Thai

restaurant nearby"), for example, Siri will check

where you are, query Yelp for reviews of nearby

Thai restaurants, show you the options and then

pre-populate a reservation form on OpenTable with

your information. All you have to do is to confirm

Siri's selection.

The software is surprisingly good at translating

voice queries into text. The application works so

well because it is able to recognize the context of

your queries. This kind of semantic analysis is a

very computing intensive problem, so most of the

actual number crunching happens on Siri's servers.

Siri outsources the voice recognition to Nuance and

if you are not comfortable with speaking into your

phone, you can always use a regular text query as

well.

Obviously, Siri won't be able to answer every

query - and sadly the app doesn't use Wolfram

Alpha to give you answers to factual questions

(yet). Should that happen, Siri will just route

your query to a search engine and display the

search results. As the Siri team told us,

however, users tend to learn which queries

work best pretty quickly (just like we learned

how to structure effective queries for Google).

To use the iPhone app, you just have to say

aloud a command like "Book a table for six at

7pm at McDonalds" (I'm sure you're classier

than that, but let's stick with it for now), and

then using speech-recognition technology and

the iPhone's GPS capabilities, your command is

translated and processed by the app,

responding with confirmation of booking—or

lack of availability.

CLEAR Sep.2012 16

Siri, which has ties with Stanford Research Institude

and DARPA, has collaborated with OpenTable,

MovieTickets, StubHub, CitySearch and TaxiMagic to

help with bookings and information, which pretty

much wipes out the reason why you'd want to

download any of those services' apps individually.

Siri is all this and something that could only be held

to the definition of true synergy, e.g.: ―Two or more

things functioning together to produce a result not

independently obtainable‖. None of the individual

parts are "new" but the combination Siri created has

never really been seen before.

It has been the Holy Grail of computer researchers

to one day create a device that could become

conversational and intelligent in such a way that it

would appear that the dialog is human generated.

Apple Siri can speak Hindi now

When Siri was announced with the iPhone 4S,

everyone thought the device would never

understand the Indian accent let alone be able to

speak Hindi. We were however left bewildered when

we found a video online where Siri responds to

users queries in Hindi!

Siri‘s support for Hindi comes to us courtesy Kunal

Kaul. The hack connects Siri to Kunal‘s Google API

server and interacts in Hindi.

Another interesting aspect of the video is that the

questions are asked in English and the responses

given by Siri are in Hindi and the devanagari script

appears on screen. The face that the questions are

asked in English has led us to believe that Siri does

not understand questions asked in Hindi.Another

interesting aspect of the video is that the questions

are asked in English and the responses given by Siri

are in Hindi and the devanagari script appears on

screen. The face that the questions are asked in

English has led us to believe that Siri does not

understand questions asked in Hindi.

DARPA Helps Invent The Internet And

Helps Invent Siri

With Siri, Apple is using the results of over 40

years of research funded by DARPA

(http://www.darpa.mil/ ) via SRI

International‘s Artificial Intelligence Center

(http://www.ai.sri.com/ Siri Inc. was a spin off

of SRI Intentional) through the Personalized

Assistant That Learns Program (PAL,

https://pal.sri.com) and Cognitive Agent that

Learns and Organizes Program (CALO).

This includes the combined work from research

teams from Carnegie Mellon University, the

University of Massachusetts, the University of

Rochester, the Institute for Human and

Machine Cognition, Oregon State University,

the University of Southern California, and

Stanford University. This technology has come

a very long way with dialog and natural

language understanding, machine learning,

evidential and probabilistic reasoning, ontology

and knowledge representation, planning,

reasoning and service delegation.

Similar applications for hand-held devices

1) S Voice is a intelligent personal assistant

and knowledge navigator which works as an

application for Samsung's Android

smartphones, similar to Apple inc's Siri on the

iPhone. It first appeared on the Samsung

Galaxy S III on May 3, 2012. The application

uses a natural language user interface to

answer questions, make recommendations,

and perform actions by delegating requests to

a set of Web services.

2) Assistant is the codename of a rumored

upcoming Google application that will integrate

voice recognition and a virtual assistant into

Android. It is expected to launch in Q4 of

2012. Before March 2, 2012, the project was

known as "Google Majel", and that name was

originated from Majel Barrett-Roddenberry, the

actress best known as the voice of the

Federation Computer from Star Trek.

http://www.darpa.mil/#_blank

http://www.ai.sri.com/#_blank

https://pal.sri.com/#_blank

CLEAR Sep.2012 17

The software is an evolution of Google's Voice

Actions that is currently available on most Android

phones while adding natural language processing.

Where Voice Actions required the users to issue

specific commands like "send text to…" or

"navigate to…", "Assistant" will allow the users to

perform actions in their natural language.

According to search engineer Mike Cohen, the

"Assistant" project has three parts: "getting the

world's knowledge into a format a computer can

understand, creating a personalization layer —

Experiments like Google +1 and Google+ are

Google's way of gathering data on precisely how

people interact with content; building a mobile,

voice-cantered "Do engine" ('Assistant') that's less

about returning search results and more about

accomplishing real-life goals".

3) Iris is a personal assistant application for

Android. The application uses natural language

processing to answer questions based on user

voice request. Iris currently supports Call, Text,

Contact Lookup, and Web Search actions including

playing videos, looking for: lyrics, movies reviews,

recipes, news, weather, places and others. It was

developed in 8 hours by Narayan Babu and his

team at Dexetra Software Solutions Private

Limited, a Kochi (India) based firm. The name is

actually Siri spelled backwards, which is the

original application for the same use built by Apple

Inc.

With the app, an Android user can just "ask"

Iris instead of "Google-searching" for

information. The developers claim Iris can talk

on topics ranging from Philosophy, Culture,

History, science to general conversation.

However, Android users need to have "Voice

Search" and "TTS library" installed in their

phones for Iris to work. Among its features are

voice actions including calling, texting,

searching on the web, and looking for a

contact.

About Whoosh

Whoosh is a fast, featureful full-text

indexing and searching library

implemented in pure Python. Programmers

can use it to easily add search functionality

to their applications and websites. Every

part of how Whoosh works can be

extended or replaced to meet your needs

exactly.

Some of Whoosh's features include:

Pythonic API.

Pure-Python. No compilation or

binary packages needed no

mysterious crashes.

Fielded indexing and search.

Fast indexing and retrieval --

faster than any other pure-Python

search solution I know of. See

Benchmarks.

Pluggable scoring algorithm

(including BM25F), text analysis,

storage, posting format, etc.

Powerful query language.

Pure Python spell-checker (as far as I know, the only one).

http://packages.python.org/Whoosh/quickstart.html#a-quick-introduction

https://bitbucket.org/mchaput/whoosh/wiki/Benchmarks

CLEAR Sep.2012 18

Inviting Articles for

CLEAR Dec2012

We are cordially inviting thought-provoking articles, interesting dialogues and healthy

debates on multi-faceted aspects of Computational Linguistics, for the second issue of

CLEAR, publishing on Dec 2012. The topics of the articles would preferably be related to

the areas of Natural Language Processing, Computational Linguistics and

Information Retrieval.

Authors are requested to send their articles in doc/odt format to the Editor, before

15th November 2012, by email [email protected].

-Editor

Thanks To

Principal, Govt. Engg. College Sreekrishnapuram,

Staffs and Students, Dept. of CSE, Govt. Engg. College Sreekrishnapuram,

Authors of CLEAR Sep 2012- Dr. Achutsankar, Prof. Jathavedan M, Dr. Sudheer S Marar,

Mr. Sajilal D, Dr. Lakshi K, Mr. Arivuchelvan

mailto:[email protected]

CLEAR

Documents

malayalam language

google malayalam search

string s

scoring matrix

native language

cases google

feature of google

english language user