Mar 06, 2016
CLEAR Sep.2012 1
Approximate/Fuzzy String Matching using Mutation Probability Matrices
We consider the approximate/fuzzy string
matching problem in Malayalam language
and propose a log-odds scoring matrix for
score-based alignment. We report a pilot
study designed and conducted to collect a
statistics about what we have termed as
“accepted mutation probabilities” of
characters in Malayalam, as they naturally
occur. Based on the statistics, we show
how a scoring matrix can be produced for
Malayalam which can be used effectively in
numeric scoring for the approximate/fuzzy
string matching. Such a scoring matrix
would enable search engines to widen the
search operation in Malayalam. Being a
unique and first attempt, we point out a
large number of areas on which further
research and consequent improvement are
required. We limit ourselves to a chosen
set of consonant characters and the matrix
we report is a prototype for further
improvement.
Authors:
Dr. Achuthsankar S Nair Hon. Director, Centre for Bioinformatics University of Kerala
Sajilal Divakaran FTMS School of Computing, Kuala Lumpur
Linguistic Computing issues in non-English languages are
generally being addressed with less depth and breadth,
especially for languages which have small user base.
Malayalam, one such language, is one of the four major
Dravidian languages, with a rich literary tradition. The
native language of the South Indian state of Kerala and the
Lakshadweep Islands in the west coast of India, Malayalam
is spoken by 4% of India‘s population. While Malayalam is
integrated fairly well with computers, with a user base that
may not generate huge market interest, such fine issues of
language computing for Malayalam remains unaddressed
and unattended.
If we were to search Google to look for information on the
senior author of this paper, Achuthsankar, and we gave the
query as Achutsankar or Achudhsankar, in both cases
Google would land us correctly in the official web page of
the author. This ―Did you mean‖ feature of Google is
managed by the Google-diff-match-patch [4]. The match
part of the algorithm uses a technique known as the
approximate string matching or fuzzy pattern matching
[10]. The close/fuzzy match to any query that is received
by he search engine is routine and obvious to the English
language user. However, when a non-English language such
as Malayalam is used to query Google, the same facility is
not seen in action.
When the word (Pathinaayiram – Malayalam word
for the number ten thousand) is used as a
query in Google Malayalam search, we are directed to docu
ments that contain a similar word (Payinaayiara
m a common mispronunciation of the original word ) but
not the word .
This is because approximate/fuzzy string matching has not been addressed in Malayalam. In this paper we make
preliminary attempts toward addressing this very special issue of approximate/fuzzy string matching Malayalam
Approximate/Fuzzy String Matching
The field described as approximate or fuzzy string matching in computer science has been firmly established since
1980s. Patrick & Geoff [5] define approximate string matching problem as follows: Given a string s drawn from
some set S of possible strings (the set of all strings composed of symbols drawn from some alphabet A), find a
string t which approximately matches this string, where t is in a subset T of S. The task is either to find all those
strings in T that are ―sufficiently like‖ s, or the N strings in T that are ―most like‖ s. One of the important
requirements to analyze similarity is to have a scientifically derived measure of similarity. The soundex system of
Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code
of one letter and three digits.
CLEAR Sep.2012 2
Odell and Russell[13] is perhaps one of the earliest of such
attempts to use such a measure. It uses a soundex code of one
letter and three digits.
These have been used successfully in hospital databases and
airline reservation systems [8]. Damerau-Leveshtein metric[2]
proposed a measure - the smallest number of operations
(insertions, deletions, substitutions, or reversals) to change one
string into another. This metric can be used with standard
optimization techniques[14] to derive the optimal score for each
string matching and thereby choose matches in the order of
closeness. Approximate or fuzzy string matching is in vogue not
only in natural languages but also in artificial languages. In fact
approximate string matching has been developed into a fine art in
computational sciences, such as bioinformatics. Bioinformatics
deals mainly with bio sequences derived from
DNA, RNA, and Amino Acid Sequences[9]. Dynamic programming
algorithm (Needleman–Wunch and Smith–Waterman
algorithms)[11] which enable fast approximate string matching
using carefully crafted scoring matrices are in great use in
bioinformatics. The equivalent of Google for modern biologist is
basic local alignment search tool (BLAST)[1], which uses scoring
matrices such as point accepted mutation matrices (PAM)[3] and
BLOcks of Amino Acid SUbstitution Matrix (BLOSUM)[6]. To the
best of the knowledge of the authors, such a scoring system is not
in existence for any natural language including English.
Recently an attempt has been made in this direction for English
language[7]. The statistics for accepted mutation in English was
cleverly derived based on already designed Google searches. In
the case of Malayalam, statistics of character mutations are not
easily derivable from any corpus or any existing search engines
or other language computing tools. Hence, data for this needs to b
e generated to go ahead with development of scoring matrix
system. We will now describe generation of primary data of
natural mutation in Malayalam.
among a small group of school children
(N=30). The observed mistakes (natural
mutations) are tabulated in Table 2 as
probabilities. It is noted that the sample
size of N=30 is inadequate for a linguistic
study of this kind. However, as already
highlighted, this paper reports a pilot
study to demonstrate proof of the
concept. Moreover, the sample size can be
made larger once the research community
whets the approach put forward by us.
Log-odds Scoring Matrix
It is possible to use Table 2 itself for
scoring string matches. However, it might
be unwieldy in practice. For long strings
we will need to multiply probabilities,
which might result in numeric underflow.
Hence, we will use a logarithmic
transformation. Another effect that we will
use is to convert from probability to odds.
The odds can be defined as the ratio of
the probability of occurrence of an event
to the probability that it does not. If the
probability of an event is p, then odds is
p/1-p. We will however not use this
formula directly, but define odds for any
given match i-j as:
Sij = 10 log (Pij/Pi)
In the above equation, pij is the
probability that character i mutates to
character j and pj is the probability of
natural occurrence of character j. Thus
the negative score for a mutation of a less
frequently occurring character will be
more in this scheme. The multiplier 10 is
ed
Occurrence and Mutation Probabilities
Malayalam has a set of 51 characters, and basic statistics of its
occurrence and mutation are required for developing a scoring
matrix. The occurrence probabilities are available, derived
from corpus of considerable size in 1971 and again in 2003[12].
We describe here only a subset of characters in view of economy
of space. In Table 1, we give the probabilities of one set of
consonants, which we have extracted from a small test corpus of
Malayalam text derived from periodicals.
We then designed and conducted a study to extract the character
mutation probabilities. We selected 150 words that cover all the
chosen consonant characters. A dictation was administered
among a
CLEAR Sep.2012 3
used just to bring the scores to a convenient range. Table 3
shows the log- odds score thus derived using occurrence
probabilities and mutation probabilities given in Table 1 and 2.
These can be used to score approximate matches and select the
most similar one.
Results, Discussions, and Conclusion
The prototype scoring matrix we have designed above can be
demonstrated to be capable of scoring approximate matches and
can therefore be a means of selecting the closest match. We will
demonstrate this with an example of scoring four approximate
matches for the word k. Table 4 lists the scores for the four
different matches and the exact match scores best. The next
best match as per the new scoring scheme is കക.
Our demonstration has been on a chosen set of consonant
characters, but it can be expanded to cover all Malayalam
characters. For demonstrating more general words, scoring
matrix for vowels is essential. We have computed the same and
will be reporting it in a forthcoming publication. During our
studies, we also noticed that the grouping of characters as done
conventionally may not suit our studies. For example, we found
that the character is a possible mutation for , very rarely, even
though they are not grouped together conventionally. A
regrouping based on natural mutations is a work we see as
requiring attention.
To the best of our knowledge, our work is a unique proposition
for the Malayalam language, which can be incorporated into
Malayalam search engines. We would like to reiterate that our
work is in prototype stage. The sample size of the corpus as well
as the size of the subjects in the survey is not substantial. The
authors hope to expand the work with a sizable database from
which statistics is extracted and then the scoring matrix can be
made more reliable. We also propose to validate the scoring
approach with sample trials involving language experts.
References [1] Altschul, S F, et al. (1990). ―Basic
local alignment search tool‖, Molecular Biology, 215(3), 403-410.
[2] Damerau, F J (1964). ―A technique for
computer detection and correction of spelling errors‖, ACM C
ommunications, 7(3), 171-176.
[3] Dayhoff, M O, et al. (1978). ―A model of Evolutionary Change in Proteins‖, Atlas
of protein sequence and structure, 5(3), 345-358.
[4] Google-diff-match-patch, [Online].
Available: http://code.google.com/p/google-diff-
match patch/, Accessed on 20 Jan. 2012.
[5] Hall, P A V and Dowling, G R (1980). ―Approximate String Matching‖, ACM
Computing Surveys, 12(4), 381- 402.
[6] Henikoff, S and Henikoff, J G (1992). ―Amino Acid Substitution Matrices from
Protein Blocks‖, Proceedings of the National Academy of Sciences of the
United States of America, 22(22),10915-10919.
[7] Kanitha, D (2011). ―A scoring matrix
for English‖, MPhil Dissertation in Computational Linguistics, Dept. Of
Linguistics, University of Kerala.
[8] Leon, D (1962). ―Retrieval of 24 misspelled names in an airlines passenger
record system‖, ACM Communications, 5,
169-171.
[9] Nair, A S (2007). ―Computational Biology & Bioinformatics: A Gentle
Overview‖, Communications of the Computer Society of India, 31(1), 1-13.
[10] Navarro, G (2001). ―A Guided Tour to
Approximate String Matching‖, ACM Computing Surveys, 33(1), 31 88.
[11] Needleman, S B and Wunsch, C D
(1970). ―A general method applicable to the search for similarities in the amino
acid sequence of two proteins‖, Journal of Molecular Biology, 48(3), 443-453.
[12] Prema, S (2004). ―Report of Study
on Malayalam Frequency Count‖, Dept. Of Linguistics, University of Kerala.
[13] Soundex, [Online]. Available:
http://en.wikipedia.org /wiki/Soundex, Accessed on 2 Dec. 2011.
[14] Wagner, R A and Fischer, M J (1974).
―The String-to-String Correction Problem‖, Journal of the ACM, 21(1), 168-178.
This article was published in CSI MAY
2012 and reused here with author's permission.
CLEAR Sep.2012 4
Author:
M.Jathavedan, Emeritus Professor,
Department of Computer Applications, CUSAT, Cochin
INDIAN SEMANTICS AND NATURAL LANGUAGE PROCESSING
The history of modern linguistics is chronologically divided into
two as BC (Before Chomsky) and AD (After Dissertation). Here
dissertation means the thesis which Chomsky submitted to
Pennisilvania University for Doctorate degree. His ideas are
considered epoch making comparable to the Darvin‘s theory of
evolution and took time to get recognition like Darvin.
Therefore Chomsky himself published it as ‗Syntactic
Structures‘.
Paninian grammar was introduced to modern linguistics as a
forerunner of Chomsky‘s generative grammar introduced in the
above book. ‘Many linguists, foreign and Indian, joined the
bandwagon and paused as experts in Paninian grammar in
Chomskian terms ( Joshy S.D.). The renewed interest had
influenced the interpretation of Paninian grammar itself as
generative grammar – the idea that grammar consists of
modules in a hierarchy or levels. The first contribution in this
direction was due to Kiparsky and Staal (1969 ) who proposed
a hierarchy of four levels of representation. This was criticized
by Hauben (2002)as they did not permit semantic factors.
Other important contributions are due to Caradona (1976).
Joshy continues: ‗Somewhat later Chomsky had drastically
reversed his ideas and after the enthusiasm for Chormsky
subsided, it became clear that the idea of transformation is
alien to Panini. Now a new type of linguistics has come up,
called Sanskrit Computational Linguistics with three capital
letters. Although Chomsky is out , Panini is still there ready to
be acclaimed as the forerunner of SCL.‘ But SCL was identified
as a branch of study in 2007 only and there were other factors
that led to its formation.
In a paper entitled ‗Knowledge representation in Sanskrit and
Artificial Intelligence‘ a NASA scientist Rick Briggs drew
attention of computer scientists to the works on semantics in
Sanskrit literature instead of Paninium. The important fact to
note is that he was referring the ‗Vaiyakarana Siddhanta
Laghu Manjusha‘ of Bhatta- Nagesa (1730-1810), perhaps the
last Sanskrit scholar in the Indian tradition.
This paper, rightly or wrongly, aroused great enthusiasm among
Sanskrit scholars. Some of them went even to the extent of
claiming that the future direction of research in artificial
language would be decided by Sanskrit. The immediate result
was the ‗ First Seminar on Knowledge Representation and
Samskritam ‗ (1986) held at Bangalore in which Briggs
Presented a paper entitled ‗ Sastric
Sanskrit: An Inter-lingua for Machine
Translation ‗.
Thus computational Sanskrit emerged as
a new branch of research. Apart from
computer assisted teaching and research
of Sanskrit (like any other subject),
automated reconstruction of Sanskrit texts
and machine aided translation (MAT),
designing a working system of Paninian
grammatical framework for machine
translation especially for Indian
Languages, it‘s possible applications in
cognitive science, AI are some areas of
active research in Sanskrit departments of
many universities and computer science
departments of many institutes.
It is a surprising fact that we are not able
to locate any more contribution of Briggs
in this field. Further, comments are
pouring in the internet for and against the
arguments put forward by Briggs. Another
point to be noted is that the authority of
the paper is Briggs in person and not
NASA as ill-conceived by many.
CLEAR Sep.2012 5
A question that naturally raised was the role of Sanskrit as
a dedicated programming language which meant the
development of a compiler for use of Sanskrit instructions.
C-DAC, Bangalore had initiated some work in this direction
in early 1990s itself. It was claimed that Astadhyayi
(Paninium ) was useful in this matter – i.e., meta-rule,
meta-language and linguistic marker system of Panini to
draw up the specification and requirements of such a
processor. To what extent the search has been successful
after twenty years is a question.
The International Symposiums on Sanskrit Computational
Linguistics ( SCL )were the results of the attempt to provide
a common platform for the traditional Sanskrit Scholars and
the computational linguists. It was a culmination of the
World Sanskrit Conferences, especially the thirteenth one
held at Edinburg and the First National Symposium on
Modeling and shallow parsing of Indian Languages in
Mumbai, both held in the year 2006. The first Symposium
was held in France in 2007 and the last one at Jawaharlal
Nehru University, New Delhi (2010).
LINGUISTICS AND PHILOSOPHY
Linguistics is considered as a part of philosophy in India. It
is often said that ‗ the grammatical method of Panini is as
fundamental to the Indian thought as is the geometrical
method of Euclid for the western thought.‘
Semantics in Sanskrit was never a well –defined domain of
a separate discipline ( Hauben, ). Rather, it remained the
battle field for exegetes, logicians and grammarians with
various backgrounds and philosophical commitments. It
was only a few centuries after Bhartrhari (4th century A.D. )
that a sophisticated specialized language and terminology
were developed for discussing semantic problems and
theories of verbal understandings. Thus during the period
from thirteenth to sixteenth centuries semantic issues
were seriously taken up for discussion between different
philosophical schools not only focussing on language but
also from a religious point of view.
The formal categories in their discussions
were mainly those established in Paninium
and investigated semantically and
philosophically by Bhartruhary. We will
consider two or three of them.
As an example we consider the sentence:
„Rama cooks rice‟
In the subdivision of a sentence into
words, the grammarians take the verb as
important. Other words are related to this
meaning-bearing word in one way or other.
Kriya is the action of the verb in the
sentence. The other words which are
―factors in the action ―of the verb are
called karakas. Panini has defined six
karakas.
For the sentence in our example the
grammarians may give the following
analytical description:
It is the activity of cooking, taking place in
the present time, having an agent which is
identical with Rama, having an object
identical with rice.
Thus the sentence is split into elements
such as stem, root, affix, ending and the
attribution of well-defined meaning to
each linguistic element. The central
element in this analysis is the meaning
expressed by the verb ‗cooks‘, or to be
more precise, the meaning of the verb root
‗to cook‘ (pac). The verbal form (in
Sanskrit the verbal ending ti in pa(ca)ti )
indicates that the activity takes place in
the present time. The agent of the action
is expressed by the grammatical subject,
Rama, the object of the action is the
grammatical object rice.
“Kriya is the action of the verb in
the sentence. The other words
which are “factors in the action “of
the verb are called karakas.”
CLEAR Sep.2012 6
For the Mimamsa thinkers also the verb is the central
element in a sentence. While grammarians take the verbal
root and the activity expressed by it as more important than
the verbal ending and its meaning, the latter are more
important for Mimamsakas. According to them the basic
meaning of all verbs is a creative urge which stimulates
action. This basic urge is expressed – transmitted to the
listener – by the verbal ending, not by the verbal root which
merely qualifies this creative urge. Thus according to them
the sentence in our example can be given the following
structural description:
“It is the creative urge which is conducive to cooking , taking
place in the present time, having the same substratum as the
agent residing in Rama, having as object rice. ―
Now for the Nyaya school, it is not the verb which is the
central element in the sentence, but, generally the noun in
the first ending ( nominative ). Thus the structure of the
verbal knowledge in our example according to them is:
― It is Rama who possesses the volitional effort conducive to
cooking which produces the softening and moistening which
is based in rice. ―
Underlying all these descriptions is the presupposition that
the main structural relation in the sentence is that between
qualifier and the thing to be qualified (visesana/visesya ) and
unlike grammarians and Mimamsakas for whom the visesya
is verb, for Nyaya thinkers the visesya is the noun in the first
ending.
SANSKRIT COMPUTATIONAL LINGUISTICS
I have already quoted S.D.Joshy. The sentences were from
his paper ‗ Background of the Astadhyayi ‗ read in the third
International Symposium on Sanskrit Computational
Linguistics held in 2009 at Hyderabad. He continues: ‘
Contrary to some western misconceptions the starting point
of Panini‘s analysis is not meaning or the intention of the
speaker, but words from elements. Panini starts from
morphology to arrive at a finished word.‘ But ‗he developed a
number of theoretical concepts which can be applied to other
languages also.‘
Coming back to Briggs, we note that in contrast to other
works his paper has for the first time drew attention of
computer scientists to the semantic theories available in
Sanskrit. Since it is meaning that is important in a sentence,
syntax is developed to tackle the semantic problem.
But centuries were elapsed before
Bhartrhari (4th century AD )developed his
sphota theory after Panini (4th century
B.C.). Again centuries elapsed before
Bhatta Nagesa gave completion to sphota
theory in eighteenth century. The later
development of linguistics can be
considered as a continuation of this.
There are four factors involved in a
proper cognition – expectancy, mutual
compatibility, proximity and intention of
the speaker. It is difficult to include the
last one in any syntatic solution. According
to Bhartrhari a speaker can seldom
communicate through words all that he
intended to and the hearer understands
more or at times less than what he hears!
Thus there is mutual dependency of Indian
theories of syntax and semantics. It is
said that the Indian linguists of the fifth
century B.C. knew more of the subject
than western linguists of the nineteenth
century A.D. Further, if there is any area
where the ancient Sanskrit scholars have
been much ahead of modern
developments, it is in the field of
semantics and systems of knowledge
representation.
REFERENCES: 1. Briggs, Rick, 1985, Knowledge
representation in Sanskrit and artificial
intelligence, The AI magazine.
2. Briggs, Rick, 1986, Shastric Sanskrit:
an interlingua for machine translation, First National Conferece on Knowledge
Representation, Bangalore.
3. Chormsky, N, 1957, Syntactic Structures, The Hague, Mouton.
4. Caradona, George, 1976, Panini: A survey of Research, The Hague, Mouton.
5 .Kiparsky, Paul and Staal J.F., 1969, Syntactic and semantic relations in
Panini, FL 5.
6. Hauben, E.M, 2002, Semantic in the
Sanskrit tradition on the eve of
colonialism, Project report, Leiden University.
7. Joshy, S.D., 2009, Background of the
Astadhyayi, Third International Symposium on Sanskrit Linguistics,
Hyderabad.
CLEAR Sep.2012 7
Overview of Question Answering System
Interaction between humans and computers is one of the most important active areas of research in this modern world.
Particularly interaction with natural language becomes more popular. Natural Language Processing is a computational
technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to achieve
human-like language processing for a wide range of applications. One of the most powerful applications of NLP is
Question Answering System. The need for automated question answering systems becomes more urgent due to the
enormous growth of digital information in text form. QA system involves analysis of both questions and answers. In this
overview, we focus on Question Type Classification, Question Generation, and Answer Generation for both closed and
open domain.
Introduction
Research in Natural Language Processing [1] has
been going on for several decades dating back to the
late 1940s. The goal of NLP is to accomplish human-
like language processing. The discipline and practice
of NLP are: Linguistics - focuses on formal,
structural models of language and the discovery of
language universals - in fact the field of NLP was
originally referred to as Computational Linguistics;
Computer Science - is concerned with developing
internal representations of data and efficient
processing of these structures, and; Cognitive
Psychology - looks at language usage as a window
into human cognitive processes, and has the goal of
modelling the use of language in a psychologically
plausible way.
The most explanatory method for presenting what
actually happens within a Natural Language
Processing system is by means of the ‗levels of
language‘ approach. Phonology concerns how words
are related to the sounds that realize them.
Morphology concerns how words are constructed
from more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a
language. Syntax level concerns how words can be
put together to form correct sentences and
determines what structural role each word plays in
the sentence and what phrases are subparts of what
other phrases. Semantic level concerns what words
mean and these meanings combine in sentences to
form sentence meanings. Pragmatic level concerns
how sentences are used in different situations and
how use effects the interpretation of the sentence.
Discourse level concerns how the immediately
preceding sentences affect the interpretation of the
next sentence. World knowledge includes the
general knowledge about the structure of the world
that language users must have in order to maintain
a conversation.
Natural language processing is used for a wide
range of applications. The most frequent
applications utilizing NLP includes Information
Retrieval (IR), Information Extraction (IE),
Question-Answering, Summarization, Machine
Translation, Dialogue Systems. In this paper we
discuss more towards Question-Answering.
Question-Answering system can be performed in
two domains: Closed and Open Domain. Closed-
domain question answering [4] deals with
questions under a specific domain (for example,
medicine or automotive maintenance), and can be
seen as an easier task because NLP systems can
exploit domain-specific knowledge frequently
formalized in ontologies. Alternatively, closed-
domain might refer to a situation where only a
limited type of questions are accepted, such as
questions asking for descriptive rather than
procedural information. Open-domain question
answering [4] deals with questions about nearly
anything, and can only rely on general ontologies
and world knowledge. On the other hand, these
systems usually have much more data available
Authors
K.M. Arivuchelvan, Research Scholar, Periyar Maniammai
University.
K. Lakshmi Professor, Periyar Maniammai University.
CLEAR Sep.2012 8
from which to extract the answer.
Question Answering [5] is a specialized form of
information retrieval. Given a collection of
documents, a Question Answering system attempts
to retrieve correct answers to questions posed in
natural language. Open-domain question answering
requires question answering systems to be able to
answer questions about any conceivable topic. Such
systems cannot, therefore, rely on hand crafted
domain specific knowledge to find and extract the
correct answers.
Question Classification
Question Classification [2] is an important task in
Question-Answering. The most well known question
taxonomy was one proposed by Graesser and
Person (1994) based on their two studies about
human tutors and students‘ questions during
tutoring sessions in a college research method
course and middle school algebra course. Six
trained human judges coded the questions in the
transcripts, obtained from the tutoring sessions, on
four dimensions: Question Identification, Degree
Specification (e.g. High Degree means questions
contain more words that refer to the elements of
desired information), Question-content Category,
and Question Generation mechanism (the reasons
for generating questions include knowledge deficit
in the learner own knowledge base, common
ground between dialogue participants, social
actions among dialogue participants, and
conversation control ). They defined following 18
question categories according to the content of
information sought rather than on the interrogative
words (i.e. why, how, where, etc).
11. Verification: invites a yes or no answer.
12. Disjunctive: Is X, Y, or Z the case?
13. Concept completion: Who? What? When?
Where?
14. Example: What is an example of X?
15. Feature specification: What are the
properties of X?
16. Quantification: How much? How many?
17. Instrumental/procedural: How did an
agent do X?
18. Comparison: How is X similar to Y?
1. Interpretation: What does X mean?
2. Causal antecedent: Why/how did X
occur?
3. Causal consequence: What next?
What if?
4. Goal orientation: Why did an agent
do X?
5. Instrumental/procedural: How did an
agent do X?
6. Enablement: What enabled X to
occur?
7. Expectation: Why didn‘t X occur?
8. Judgmental: What do you think of X
9. Assertion:
10. Request/Directive
After analyzing 5,117 questions in the research
methods and 3,174 questions in the algebra
sample, they found four frequent question
categories: verification, instrumental-procedural,
concept completion, and quantification questions.
Question Generation (QG)
For the first time in history [], a person can ask a
question on the web and receive answers in a few
seconds. Twenty years ago it would take hours or
weeks to receive answers to the same questions
as a person hunted through documents in a
library. In the future, electronic textbooks and
information sources will be main stream and they
will be accompanied by sophisticated question
asking and answering facilities.
Applications of automated QG facilities are
endless and far reaching. Below are listed a small
sample, some of which are addressed in this
report:
1. Suggested good questions that learners
might ask while reading documents and
other media.
2. Questions that human and computer
tutors might ask to promote and assess
deeper learning.
3. Suggested questions for patients and
caretakers in medicine.
CLEAR Sep.2012 9
4. Suggested questions that might be asked in
legal contexts by litigants or in security
contexts by interrogators.
5. Questions automatically generated from
information repositories as candidates for
Frequently Asked Question (FAQ) facilities.
The time is ripe for a coordinated effort to tackle QG
in the field of computational linguistics and to launch
a multi-year campaign of shared tasks in Question
Generation (QG). We can build on the disciplinary
and interdisciplinary work on QG that has been
evolving in the fields of education, the social
sciences and computer science. The QG system
operates directly on the input text, executes
implemented QG algorithms, and consults relevant
information sources. Very often there are specific
goals that constrain the QG system.
Question Answering
Today‘s question answering [7] is not limited by the
type of document or data repository – it can address
both traditional databases and more advanced ones
that contain text, images, audio and video.
Structured and unstructured data collections can be
considered as information sources in question
answering. Unstructured data allows querying of raw
features (for example, words in a body of text),
extracting information with clear semantics
attached. Related to this distinction between
structured and unstructured data there is a
traditional distinction between restricted domain
question answering, or RDQA, and open domain
question answering (ODQA).
RDQA systems are designed to answer questions
posed by users in a specific domain of competence,
and usually rely on manually constructed data or
knowledge sources. They often target a category of
users who know and use the domain-specific
terminology in their query formulation, as, for
example, in the medical domain. ODQA focuses on
answering questions regardless of the subject
domain. Extracting answers from a large corpus of
textual documents is a typical example of an ODQA
system. Recently, we have witnessed an approach of
question answering involving semi-structured data.
These data often comprise text documents in
which the structure of the document or certain
extracted information is expressed by a markup.
Such markups can be attributed manually (e.g.,
the structure of a document) and/or in an
automatic way, e.g., markups for identified
person and company names and their
relationships in newspaper articles.
Conclusion
Question answering is a complex task needing
effective improvements of different research
areas including, question generation, question
ranking, question classification, information
retrieval, natural language processing, database
technologies, Semantic Web technologies,
human computer interaction, speech processing
and computer vision.
REFERENCES
1. Liddy, E. D. In Encyclopaedia of Library and
Information Science, 2nd Ed. Marcel Decker,
Inc.
2. Ming Liu Rafael A. Calvo ―G-Asks: An
Intelligent Automatic Question Generation
System for Academic Writing Support‖ Dialogue
and Discourse 3(2) (2012) 101–124.
3. Mark Andrew Greenwood ―Open-Domain
Question Answering‖ September 2005.
4. http://en.wikipedia.org/wiki/Question_answer
ing.
5. Andrew Lampert ―A Quick Introduction to
Question Answering‖ December 2004.
6. Workshop Report ―The Question Generation
Shared Task and Evaluation Challenge‖
Sponsored by the National Science Foundation.
7. Oleksandr Kolomiyets, Marie-Francine Moens
―A survey on question answering technology
from an information retrieval perspective‖
Information Sciences 181 (2011) 5412–5434.
CLEAR Sep.2012 10
I-Search....
Future of Search Engines
In this web-age, searching-or more precisely
surfing the web may be a casual phrase in day to
day business. The netizens continuously enrich the
web-vocabulary by words like ―Googling‖. What this
speaks is how search engines are important in this
digital era. A web search engine is designed to
search for information on the World Wide Web.
Today‘s search engines come in two types.
Directory-based engines, like Yahoo, are still built
manually. What that means is that you decide what
your directory categories are going to be Business,
and Health, and Entertainment and then you put a
person in charge of each category, and that person
builds up an index of relevant links. Crawler-based
engines, like Google, employ a software program —
called a crawler — hat goes out and follows links,
grabs the relevant information, and brings it back to
build your index. Then you have an index engine
that allows you to retrieve the information in some
order, and an interface that allows you to see it. It‘s
all done automatically.
As the Web continues to grow, however, and to be
more and more important for commerce,
communication, and research, information-retrieval
problems become a more serious handicap. The
percentage of Web content that shows up on search
engines continues to wane. And as search engines
struggle to add more and more content, the
information they provide may be increasingly out-of-
date.
Recent advances in intelligent search suggest that
these limitations can be partially overcome by
providing search engines with more intelligence and
with the user‘s underlying knowledge. That is called
natural language processing. It might also have to
understand what the user need, even when he
doesn‘t say it. And that requires some knowledge of
the user. These ideas lead to the birth of a new
generation of web technologies, popularly known as
Semantic Web.
Semantic Search
A semantics search engine attempts to make
sense of search results based on context. It
automatically identifies the concepts structuring
the texts. For instance, if you search for
―election‖ a semantic search engine might
retrieve documents containing the words ―vote‖,
―campaigning‖ and ―ballot‖, even if the word
―election‖ is not found in the source document.
Semantic Search systems consider various
points including context of search, location,
intent, variation of words, synonyms,
generalized and specialized queries, concept
matching and natural language queries to
provide relevant search results. Major search
engines like Google and Bing incorporate some
elements of Semantic Search. The objective of
this article is to discuss the recent advances in
area of Semantic Search.
Google's Knowledge Graph:
Google usually returns the search result for any
query based on the text and the content. To put
it right, it does not understand the exact
meaning of the words. It matches the keywords
of the query with those of the sites and returns
pages that have a significant authority on those
words.
Amit Singhal, Google‘s senior VP of engineering,
said [1]: “The introduction of Knowledge Graph
enables Google to understand whether a search
for „Mars‟ refers to the planet or the
confectionary manufacturer. 'Search is a lot
about discovery' – the basic human need to
learn and broaden your horizons”.
Author
Manu Madhavan M. Tech Computational Linguistics
Govt. Engg. College, Sreekrishnapuram [email protected]
CLEAR Sep.2012 11
“The introduction of Knowledge Graph enables Google to understand whether a search for
„Mars‟ refers to the planet or the confectionary manufacturer. 'Search is a lot about
discovery' – the basic human need to learn and broaden your horizons”. Amit Singhal
Bing's Semantics Search
Microsoft specifically brands Bing as a "decision
engine," and not as a general purpose search
engine--even though it provides that functionality
as well--in order to differentiate it from Google
Search. Bing's search is based on semantic
technology from Powerset that was acquired by
Microsoft in 2008. Notable changes include the
listing of search suggestions as queries are
entered and a list of related searches (called
"Explore pane"). Bing features semantic
capabilities like presenting more readable
captions based on linguistic and semantic
analysis of content. The concept of entity
extraction is leveraged in Bing, providing
knowledge on phrases and what they uniquely
refer to. [2]
Bing‘s new product Adaptive Search strives to
capitalize on semantic search technology.
Adaptive Search will take into consideration your
user behaviour, then tailor your Bing results to be
most appropriate. So if you‘ve searched for a
word then clicked on a specific site previously,
Bing will predict that it‘s likely that what you‘re
searching for falls into the context of that site,
thus it can provide you with results that are more
tailored. [5].
Powerset
Powerset is a Microsoft owned Company building
a transformative consumer search engine based
on natural language processing. Their unique
innovations in search are rooted in breakthrough
technologies that take advantage of the structure
and nuances of natural language. Using these
advanced techniques; Powerset is building a
large-scale search engine that breaks the
confines of keyword search.
By making search more natural and intuitive,
Powerset is fundamentally changing how we
search the web, and delivering higher quality
results. [3]
Hakia:
Hakia is a general purpose semantic search
engine, that search structured corpora (text)
like Wikipedia. For some queries (typically
popular queries and queries where there is
little ambiguity), Hakia produces resumes.
These are portals to all kinds of information on
the subject. Every resume has an index of
links to the information presented on the page
for quick reference. Often, Hakia will propose
related queries, which is also great for
research. [3]
Cognition
Cognition has a search business based on a
semantic map, built over the past 24 years,
which the company claims is the most
comprehensive and complete map of the
English language available today. It is used in
support of business analytics, machine
translation, document search, context search,
and much more. [3]
Swoogle:
Swoogle, the Semantic web search engine, is a
research project carried out by the ubiquity
research group in the Computer Science and
Electrical Engineering Department at the
University of Maryland. It‘s an engine tailored
towards finding documents on the semantic
web.
CLEAR Sep.2012 12
Swoogle is capable of searching over 10,000 ontologies and indexes more
that 1.3 million web documents. It also computes the importance of a
Semantic Web document. The techniques used for indexing are the more
Google-type page ranking and also mining the documents for inter-
relationships that are the basis for the semantic web. [4]
Conclusion
NLP is a complex area of research, requiring a solid understanding of
grammars (not just grammar), and a good grounding in computational
linguists (in order to apply the techniques to machine, which is not always
easy). Understanding the techniques used in NLP allows us to provide the
best format and patterns for the search engine. Seeing as NLP seeks to
mimic human language understanding, using common sense is a good
idea. But before any broader, more sophisticated sort of intelligence can
be placed into a machine we humans will have to get a better grasp on
just what intelligence is.
References:
1. http://mashable.com
2. http://semanticweb.com
3. http://thenextweb.com
4.
http://web2innovations.com
5. http://blogs.wsj.com
Google synonyms and natural language processing
Google just blogged about synonyms as they related to searcher intent. They provide several examples of how a
concept as simple as a synonym complicates natural language processing. This also brings up some important
recommendations for site owners with respect to SEO.
Prospective customers type in all kinds of variations on your most obvious keywords (hence the need for keyword
research). Often they make use of synonyms, some common, some not. These variations often represent less
competitive opportunities for high search engine rankings if you can incorporate those synonyms into your
website. In particular:
Use common variations within your existing copy rather than using the same phrase repeatedly. (This
also tends to make long blocks of text more readable.)
Develop pages that specifically focus on each of the most common and valuable synonyms.
If there are enough synonyms and industry-specific terms, consider developing a glossary of terms.
Find opportunities to talk about the synonyms, such as a blog post or article that talks about how
synonyms may actually be somewhat different or whose similarity is up for debate (e.g. SEM vs. Search
Engine Advertising).
http://www.web1marketing.com
PyLucene
PyLucene is a GCJ-compiled
version of Java Lucene
integrated with Python. Its
goal is to allow you to use
Lucene's text indexing and
searching capabilities from
Python.
CLEAR Sep.2012 13
Remolding Professional sectors: the SaaS way..
SaaS : Purpose and Functions
The costs and time to market benefits of outsourcing business
services like payroll, Storage space, Customer Relationship
Management (CRM) applications, and company websites has
been proven for many businesses. The term for these types of
outsourced services is most recently known as Software As A
Service (SaaS).
Author
Dr. Sudheer S Marar
MCA MBA PhD Associate Professor and
HOD, Department of MCA Nehru College of Engineering
and Research Centre
Introducing new technology is an expensive undertaking,
usually requiring high capital outlays and can take many
months of training, installation and integration before service
can be delivered in network. Outsourcing these services to
organizations that are experts in the technology lowers costs,
increases uptime, accelerates revenue realization and provides
increased flexibility & functionality.
Due to these results, hosting for these critical business
functions continues to grow and many companies are looking
for similar opportunities in other operational areas.
Effects of Downturn
As stated in Movius Corporation annual report, The economic
downturn has globally forced many companies to reduce
spending across the board. This has put companies that are in
highly competitive and innovation driven industries, say
telecommunications in a exigent balancing act. While they
need to try to control expenses, if they are not also continuing
to introduce the latest applications and services, they will
quickly begin to lose their market share.
The ideal situation for a carrier would be, to almost suddenly
introduce new services without risking precious in hand
capital. Under the best possible scenario the carrier could
begin generating revenue in a matter of weeks after making
the decision to launch a new service. If the service could be
introduced without the need to add additional staff, the
solution is essentially risk free.
Some applications are an
immediate success in the
market while others take time
or in the worst case never get
toehold in a given market. The
ideal introduction scenario for a
carrier would be that they could
try a new service in a particular
market without having to make
a significant investment all the
while gaining key market data.
Therefore, companies today are
faced with the challenges of
controlling equipment and
operating costs, protecting their
current investments and having
the ability to deploy new
applications quickly. To add to
the challenge, many carriers
are faced with older application
platforms that are limited in
capability and potentially
approaching end of life. These
companies need cost-effective
solutions that permit the
conversion from legacy
networks to IP infrastructure
without major changes to
network infrastructure or
application design.
CLEAR Sep.2012 14
Enterprise-level applications
As extracted from a lead article of IDC-SAP initiated
paper, ―..Professional service firms focus their business
management energy on optimizing the utilization of an
expert's or a consultant's time. They attempt to develop
service offerings or skill sets that clients will find
compelling. Ultimately, they focus on properly charging
and receiving payment from clients. Larger firms tend to
broaden their offerings to ensure a greater wallet share.
Meanwhile Smaller firms tend toward key-field focusing
and deep industry expertise, hoping to foster continuing
relationships with a small number of clients.‖
In short, All firms balance developing a talent pipeline
with maximizing utilization rates. Client satisfaction and
trusting relationships drive both repeat business and
referrals in most professional services segments.
Therefore, firms seek to ensure deliverables of the
highest possible quality and strive to fully meet client
expectations throughout the engagement process. Firms
increasingly use technology to support all parts of their
business: Finance and scheduling software are common,
knowledge management and data warehouse capability
help improve service quality, and client management and
engagement management software are increasingly used
to monitor and maximize customer satisfaction. The
increased use of technology has both aided and hindered
professional services firm-constrains to improve their key
value propositions.
Benefits of SaaS
The cost of a complex business management software
implementation is often the starting point for a discussion
and often a point where the discussion meets a quick
end. In their research, IDC has identified several areas
where SaaS system delivery costs differ from on-premise
delivery costs. Primarily, They are the following:
• License fees. Both initial and Maintenance cost.
• Hardware costs.
• IT infrastructure costs.
• Test Environment maintaining development cost
• IT personnel/support costs.
• Security, backups, and disaster recovery.
The Futuristic.
Clearly SaaS applications are maturing. The
number of companies that either are using
SaaS applications or plan to use SaaS
applications in the next year has grown
considerably over the past few years,
suggesting that the barriers to adoption —
either real or perceived — are being
overcome. We see a bright future for SaaS
across a broad range of application areas
and for large and small professional services
firms.
SaaS is not without its problems, however.
Functionality and security concerns hang
back, and while these concerns are more a
perception than reality, it is important when
considering applications from a SaaS vendor
that appropriate due diligence be applied to
ensure that the functionality meets critical
business needs. It‘s important for any
corporate client to have a good choice on its
SaaS vendor, not all are created equal. As
this domain is a maturing capability, one
should make it sure to select a vendor that
brings experience, financial stability, and a
good reputation for working effectively with
professional services of the company,
thereby ensuring the client on its business
benefits, scalable growth, and business
continuity.
CLEAR Sep.2012 15
Apple's SIRI
Author
Robert Jesuraj K
M. Tech Computational Linguistics
Govt. Engg College Sreekrishnapuram
What is Siri?
Siri (Speech Interpretation and Recognition Interface) is
an intelligent personal assistant and knowledge navigator which
works as an application for Apple's iOS. The application uses a
natural language user interface to answer questions, make
recommendations, and perform actions by delegating requests to
a set of Web services.
Siri was originally introduced as an iOS application
available in the App Store by Siri, Inc. Siri, Inc. was
acquired by Apple on April 28, 2010. Siri, Inc. had
announced that their software would be available
for BlackBerry and for Android-powered phones,
but all development efforts for non-Apple platforms
were cancelled after the acquisition by Apple.
Siri is now an integral part of iOS 5, and available
only on the iPhone 4S, launched on October 14,
2011. On November 8, 2011, Apple publicly
announced that it had no plans to support Siri on
any of its older devices
Using Siri
The app transcribes spoken text and then takes
these commands and routes them to the right web
services. If you try to book a table at a Thai
restaurant ("get me a table at a good Thai
restaurant nearby"), for example, Siri will check
where you are, query Yelp for reviews of nearby
Thai restaurants, show you the options and then
pre-populate a reservation form on OpenTable with
your information. All you have to do is to confirm
Siri's selection.
The software is surprisingly good at translating
voice queries into text. The application works so
well because it is able to recognize the context of
your queries. This kind of semantic analysis is a
very computing intensive problem, so most of the
actual number crunching happens on Siri's servers.
Siri outsources the voice recognition to Nuance and
if you are not comfortable with speaking into your
phone, you can always use a regular text query as
well.
Obviously, Siri won't be able to answer every
query - and sadly the app doesn't use Wolfram
Alpha to give you answers to factual questions
(yet). Should that happen, Siri will just route
your query to a search engine and display the
search results. As the Siri team told us,
however, users tend to learn which queries
work best pretty quickly (just like we learned
how to structure effective queries for Google).
To use the iPhone app, you just have to say
aloud a command like "Book a table for six at
7pm at McDonalds" (I'm sure you're classier
than that, but let's stick with it for now), and
then using speech-recognition technology and
the iPhone's GPS capabilities, your command is
translated and processed by the app,
responding with confirmation of booking—or
lack of availability.
CLEAR Sep.2012 16
Siri, which has ties with Stanford Research Institude
and DARPA, has collaborated with OpenTable,
MovieTickets, StubHub, CitySearch and TaxiMagic to
help with bookings and information, which pretty
much wipes out the reason why you'd want to
download any of those services' apps individually.
Siri is all this and something that could only be held
to the definition of true synergy, e.g.: ―Two or more
things functioning together to produce a result not
independently obtainable‖. None of the individual
parts are "new" but the combination Siri created has
never really been seen before.
It has been the Holy Grail of computer researchers
to one day create a device that could become
conversational and intelligent in such a way that it
would appear that the dialog is human generated.
Apple Siri can speak Hindi now
When Siri was announced with the iPhone 4S,
everyone thought the device would never
understand the Indian accent let alone be able to
speak Hindi. We were however left bewildered when
we found a video online where Siri responds to
users queries in Hindi!
Siri‘s support for Hindi comes to us courtesy Kunal
Kaul. The hack connects Siri to Kunal‘s Google API
server and interacts in Hindi.
Another interesting aspect of the video is that the
questions are asked in English and the responses
given by Siri are in Hindi and the devanagari script
appears on screen. The face that the questions are
asked in English has led us to believe that Siri does
not understand questions asked in Hindi.Another
interesting aspect of the video is that the questions
are asked in English and the responses given by Siri
are in Hindi and the devanagari script appears on
screen. The face that the questions are asked in
English has led us to believe that Siri does not
understand questions asked in Hindi.
DARPA Helps Invent The Internet And
Helps Invent Siri
With Siri, Apple is using the results of over 40
years of research funded by DARPA
(http://www.darpa.mil/ ) via SRI
International‘s Artificial Intelligence Center
(http://www.ai.sri.com/ Siri Inc. was a spin off
of SRI Intentional) through the Personalized
Assistant That Learns Program (PAL,
https://pal.sri.com) and Cognitive Agent that
Learns and Organizes Program (CALO).
This includes the combined work from research
teams from Carnegie Mellon University, the
University of Massachusetts, the University of
Rochester, the Institute for Human and
Machine Cognition, Oregon State University,
the University of Southern California, and
Stanford University. This technology has come
a very long way with dialog and natural
language understanding, machine learning,
evidential and probabilistic reasoning, ontology
and knowledge representation, planning,
reasoning and service delegation.
Similar applications for hand-held devices
1) S Voice is a intelligent personal assistant
and knowledge navigator which works as an
application for Samsung's Android
smartphones, similar to Apple inc's Siri on the
iPhone. It first appeared on the Samsung
Galaxy S III on May 3, 2012. The application
uses a natural language user interface to
answer questions, make recommendations,
and perform actions by delegating requests to
a set of Web services.
2) Assistant is the codename of a rumored
upcoming Google application that will integrate
voice recognition and a virtual assistant into
Android. It is expected to launch in Q4 of
2012. Before March 2, 2012, the project was
known as "Google Majel", and that name was
originated from Majel Barrett-Roddenberry, the
actress best known as the voice of the
Federation Computer from Star Trek.
CLEAR Sep.2012 17
The software is an evolution of Google's Voice
Actions that is currently available on most Android
phones while adding natural language processing.
Where Voice Actions required the users to issue
specific commands like "send text to…" or
"navigate to…", "Assistant" will allow the users to
perform actions in their natural language.
According to search engineer Mike Cohen, the
"Assistant" project has three parts: "getting the
world's knowledge into a format a computer can
understand, creating a personalization layer —
Experiments like Google +1 and Google+ are
Google's way of gathering data on precisely how
people interact with content; building a mobile,
voice-cantered "Do engine" ('Assistant') that's less
about returning search results and more about
accomplishing real-life goals".
3) Iris is a personal assistant application for
Android. The application uses natural language
processing to answer questions based on user
voice request. Iris currently supports Call, Text,
Contact Lookup, and Web Search actions including
playing videos, looking for: lyrics, movies reviews,
recipes, news, weather, places and others. It was
developed in 8 hours by Narayan Babu and his
team at Dexetra Software Solutions Private
Limited, a Kochi (India) based firm. The name is
actually Siri spelled backwards, which is the
original application for the same use built by Apple
Inc.
With the app, an Android user can just "ask"
Iris instead of "Google-searching" for
information. The developers claim Iris can talk
on topics ranging from Philosophy, Culture,
History, science to general conversation.
However, Android users need to have "Voice
Search" and "TTS library" installed in their
phones for Iris to work. Among its features are
voice actions including calling, texting,
searching on the web, and looking for a
contact.
About Whoosh
Whoosh is a fast, featureful full-text
indexing and searching library
implemented in pure Python. Programmers
can use it to easily add search functionality
to their applications and websites. Every
part of how Whoosh works can be
extended or replaced to meet your needs
exactly.
Some of Whoosh's features include:
Pythonic API.
Pure-Python. No compilation or
binary packages needed no
mysterious crashes.
Fielded indexing and search.
Fast indexing and retrieval --
faster than any other pure-Python
search solution I know of. See
Benchmarks.
Pluggable scoring algorithm
(including BM25F), text analysis,
storage, posting format, etc.
Powerful query language.
Pure Python spell-checker (as far as I know, the only one).
http://packages.python.org/Whoosh/quickstart.html#a-quick-introduction
CLEAR Sep.2012 18
Inviting Articles for
CLEAR Dec2012
We are cordially inviting thought-provoking articles, interesting dialogues and healthy
debates on multi-faceted aspects of Computational Linguistics, for the second issue of
CLEAR, publishing on Dec 2012. The topics of the articles would preferably be related to
the areas of Natural Language Processing, Computational Linguistics and
Information Retrieval.
Authors are requested to send their articles in doc/odt format to the Editor, before
15th November 2012, by email [email protected].
-Editor
Thanks To
Principal, Govt. Engg. College Sreekrishnapuram,
Staffs and Students, Dept. of CSE, Govt. Engg. College Sreekrishnapuram,
Authors of CLEAR Sep 2012- Dr. Achutsankar, Prof. Jathavedan M, Dr. Sudheer S Marar,
Mr. Sajilal D, Dr. Lakshi K, Mr. Arivuchelvan