-
Automatic Acquisition Of
Bilingual Comparable Corpus
Thesis Submitted In Partial Fulfillment of the Requirement for
the Degree In
Master of Technology
In
Computer Science & Engineering
Submitted by
Mayank Gupta (06CS6015)
Under the guidance of
Prof. Sudeshna Sarkar Department Of Computer Science and
Engineering
IIT Kharagpur
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT
KHARAGPUR
Indian Institute of Technology
Kharagpur, West Bengal – 721302
9 May 2008
-
ii
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the thesis entitled “Automatic
Acquisition Of
Bilingual Comparable Corpora” is a record of bona fide work
carried out
by Mr. Mayank Gupta (06CS6015), under my supervision and
guidance,
during the academic session 2007-2008, in partial fulfillment of
the
requirement for the degree of Master of Technology in Computer
Science and
Engineering, Department of Computer Science & Engineering,
Indian
Institute of Technology, Kharagpur. The results presented in
this thesis have
not been submitted elsewhere for the award of any other degree
or diploma.
Dr. Sudeshna Sarkar
Department of Computer Science & Engineering,
Indian Institute of Technology Kharagpur- 721302.
Date: 30 April 2008
Place: Kharagpur
-
-
iii
ACKNOWLEDGEMENTS
I wish to extend my sincere thanks to my supervisor Prof.
Sudeshna
Sarkar, for keeping faith in me, and for being a driving force
all through the
way. The project would not be so smooth and so interesting
without her
encouragement. I am indebted to the Department of Computer
Science &
Engineering and IIT Kharagpur for providing me with all the
required
facility to carry my work in a congenial environment.
I also wish to thank Mr. Debasis Mandal and Mr. Sujan Saha for
providing
me some valuable resources for my project, and to add a lot to
my expertise
by their invaluable suggestions. I extend my gratitude to the
CSE department
lab staff for providing me the needful time to time whenever
requested.
Above all, I am grateful to my parents, friends and well-wishers
for their
patience and continuous supply of inspirations and suggestions
for my ever-
growing performance. Last but not the least; I thank the
Almighty for making
me a part of the world.
MAYANK GUPTA 06CS6015
-
-
iv
Dedicated to…
My mom, who loves me more than I love myself;
And to all those who love her…
-
-
v
Contents
Abstract......................................................................................................................
x
Introduction.........................................................................................................
1–12
1.1 Cross-Language Information Retrieval
........................................1–13
1.2 Multilingual nature of web
.............................................................1–14
1.3 Motivation
........................................................................................1–15
1.4 Problem Statement
..........................................................................1–15
1.5 Organization of thesis
.....................................................................1–16
Background
.........................................................................................................
2–19
2.1 Notion of
corpora.............................................................................2–20
2.2 Comparable Corpora
......................................................................2–20
2.3 Classification of CC
.........................................................................2–22
2.4 Some Examples of Comparable Corpora
.....................................2–22
2.5 Applications of Comparable Corpora
...........................................2–24
2.6 Limitations of CC
............................................................................2–24
Related Work
......................................................................................................
3–26
3.1 Effect of poor dictionary
.................................................................3–28
Work
Done...........................................................................................................
4–33
4.1 Newspaper Sites: Open Source of CC
...........................................4–33
-
-
vi
4.2
Methodology.....................................................................................4–33
4.3 Details of the
implementation.........................................................4–34
4.3.1 Gathering News
.........................................................................4–35
4.3.2 Named Entity Recognition (NER)
...........................................4–35
4.3.3
Preprocessing.............................................................................4–35
4.3.4
Translation.................................................................................4–36
4.3.4.1 Dictionary Lookup
(DL).......................................................4–36
4.3.4.2 Gazetteer List Lookup (GL)
.................................................4–36
4.3.4.3 Abbreviation List Lookup (AL)
...........................................4–36
4.3.4.4 Transliteration Similarity (TS)
.............................................4–37
4.5 Filtering
............................................................................................4–39
4.6 Conflict Resolution
..........................................................................4–39
4.7 Example
............................................................................................4–40
Results
..................................................................................................................
5–43
5.1 Sample data
......................................................................................5–44
5.2 Evaluations
.......................................................................................5–44
5.3
Analysis.............................................................................................5–47
5.3.1 Positive results
...........................................................................5–47
5.3.2 Negative results
.........................................................................5–48
5.3.2.1 Pair unable to cross the thresholds
.......................................5–48
5.3.2.2 Absence of corresponding news article in target
language..5–49
5.3.2.3 Effect of similar names found in dissimilar
stories..............5–50
5.3.2.4 Effect of similar dictionary words in stories depicting
news on
similar topic, yet describing different events
...................................5–53
5.3.2.5 Effect of wrongly identified pair of names
..........................5–54
5.3.2.6 Effect of poor dictionary
......................................................5–55
5.4 Improvements
..................................................................................5–56
-
-
vii
Conclusion
..........................................................................................................
6–59
6.1 Future work
.....................................................................................6–60
Bibliography
........................................................................................................
6–62
-
-
viii
List of Figures
1. Difference between the volumes occupied by English-speaking
users on the web, as against that occupied by non-English speaking
people......1–14
2. Frequency distribution of number of translations in Hindi
bilingual
dictionary
...............................................................................................3–28
3. Recall plotted versus Precision for the six runs for Bengali and
Hindi to
English CLIR Evaluation
experiment....................................................3–30
4. Precision versus evaluation of system on ten random points for a
sample
data of stories collected over a period of one
month.............................5–44 5. Average precision values
for the evaluations for one month sample data 5–
45 6. Number of documents retrieved versus precision obtained
for one month
test data
..................................................................................................5–46
7. Number of documents retrieved versus precision obtained on
improved
system
....................................................................................................5–57
-
-
ix
List of Tables
1. Cross language runs submitted in CLEF 2007
......................................3–30
2. Summary of bilingual runs of the CLIR evaluation experiment
...........3–30
3. Phonetic rules for recognizing similar sounding
names........................4–37
4. Number of retrieved documents versus the Precision values for
top 70
pairs........................................................................................................5–47
5. Number of retrieved documents versus the Precision values for
top 70
pairs for improved
case..........................................................................5–57
-
-
x
Abstract Corpora are the main knowledge foundation for progress
in the field of information
retrieval [Arora]. Processing of multilingual corpora helps in
the construction of efficient
language-specific resources and in cross-lingual information
access [Peters] [Mandal].
This report presents the work aimed at constructing a comparable
corpus automatically in
English, and the most common Indian language, Hindi, by
collecting similar news stories
from newspaper websites. Our system identifies comparable news
articles by recognizing
intersecting proper names, and content overlap using a medium
coverage dictionary,
transliteration similarity, temporal closeness and filtering
mechanisms at various levels.
The system scored 67.4% and 38.4% precision for the best and
worst case respectively,
evaluated randomly on our sample set of Hindi news articles of
30 consecutive days. We
adopted some enhancements in the present system after in-depth
scrutiny and achieved
substantial improvements in the results.
Based on achieved results, we deduce the observation in
accordance with the general
perception, that former half of the news article furnishes the
largest part of the essential
information conveyed. However, it is the consideration of whole
story, which results in
better identification of similar stories. Consistent performance
of the results when we
consider the whole stories over the fluctuating precision values
for the case when we
examine the first half of the stories helps us arrive at the
conclusion. Detection of
association in a source-target pair on intersecting proper names
is indispensable, but
should not be the only criteria, specifically for languages like
Hindi, which is highly
undersupplied with valuable cross-language assets.
-
-
xi
-
-
Chapter 1
Introduction The first step towards advancement in research in
various emerging areas of natural
language processing and interconnected fields is the
availability of huge collections of
easily accessible text [Arora] [Resnik]. Such corpora are useful
in the extraction of
knowledge for the participating languages, and for the
construction of many efficient
resources for these languages. Resnik and Smith [Resnik] lists
some of the uses of Parallel
corpora, and elucidate their algorithm for building one through
mining the web
(STRAND).
Corpora could be monolingual or multilingual. Monolingual, as
the name implies, is a
corpora where only one language is involved. Multilingual is a
corpora linking two or
more languages, with further classification as parallel or
comparable. Parallel corpora refer
to near-exact translations of text in participating languages,
whereas Comparable corpora
refer to collection of texts in different languages according to
some criterion of similarity.
Examples of parallel corpora could be translations of a text in
different languages, such as
user manuals, educational books or proceedings of some event
either written
independently by different persons, or translated from a single
source. In case of
comparable data, the text may be similar in content (comparable
content) or any other like
time or domain. Newspaper articles in multiple languages from
independent press
agencies, or description of a product or occasion by different
people are some example of
a comparable text.
Aligning the comparable text to sentence and word level can help
create automatically
efficient resources for the languages deficient in these assets
at present. Such a corpus
acts as the starting step of cross-lingual studies. It also
helps in multilingual
summarization, which is the automatic procedure for extracting
information from multiple
texts written about the same topic in multiple languages. It
offers a wide scope of
-
1–13
applications for research in Discourse analysis (Analysing
written, spoken or signed
language use), Pragmatics (ability to understand the intended
meaning of speaker,
pragmatic competence), Information retrieval etc.
1.1 Cross-Language Information Retrieval
Cross-Language Information Retrieval (CLIR) is an enormous and
ever-growing field, and
is not limited to one particular subject or discipline, rather
is multidisciplinary. People
from interconnected fields like Information Retrieval, NLP,
Machine Translation, and
Speech Processing etc come closer for information access. There
are various Information
Processing issues (Cross Lingual information access, speech
processing) and a strong need
of Language resources (dictionaries, thesauri, corpora, test
collections).
The escalating demand of CLIR is due to various reasons. Growing
Internationalization
had made many developed and developing countries multilingual,
where the number of
speakers of non-native lingo is not less than that of native
speakers. United States, Canada
and even India are some of those countries where no single
language dominates, and
people from all parts of the world with their own culture and
language share the
boundaries. Globalization of the economy has reduced the problem
of localization of
employment, and the trend of MNCs has have employees &
customers from all parts of
the world coming and working under one roof, speaking and using
multiple languages.
Global Information Society has evaporated the physical
boundaries of the world, and
shrinking of the space between the human races for want of
educational needs and
entertainment.
Of course, to attain better communication, desire to achieve
excellence in some common
language at such places, in addition to the native speech often
arises. For many it is not a
setback, but for those with no background in other languages,
communicating becomes a
matter of nuisance. Availability of resources like translators,
multilingual dictionaries and
other cross-lingual packages has made information access in
multiple languages much
easier than ever. For the ease of customers and for producers
too, official documents and
manuals are no longer restricted to a single language. Distance
Learning, Digital Libraries
and other resources provided to students worldwide by top
educational institutions are no
-
-
1–14
longer constraining students to follow the path to success in a
language they are not
comfortable.
1.2 Multilingual nature of web A survey shows that Internet is
no longer monolingual and non-English content is growing
rapidly [CNNIC]. Figure 1 shows the difference between the
volumes occupied by
English-speaking users on the web, as against that occupied by
non-English speaking
people. This is the place where the need for MLIA arises. Carol
Peters [Carol] estimates
78% of Internet Users to be Non-English Speaking by 2005. The
latest figures put 70% of
the online population to comprise of the non-English group, as
on 30 June 2007. Even
though the number of such corpus collections is increasing, the
number of languages in
which such collections are available is still limited [Somers].
The coverage of such
collections for a particular need is also not satisfactory
always.
Figure 1
Difference between the volumes occupied by English-speaking
users on the web, as against that occupied by non-English speaking
people
Though English still tops the chart, more and more people with
diverse backgrounds are
connecting to the web. In addition to the largely exploited
language English, the work in
European languages such as French, German and Portuguese, and
Asian languages such as
Chinese, Korean and Japanese has also escalated in recent times
[McEnery]. With the
-
-
1–15
advancement in technology, more and more languages are becoming
a part of the efforts in
the field of Natural Language Processing and associated
areas.
1.3 Motivation
With 22 listed regional languages listed in the eighth schedule
of the Constitution1, India
has been a multi-language country with a wealthy unexplored
reserve of Knowledge.
[Yale] [Languages]. According to the latest survey, Hindi is the
fifth and Bengali seventh
highest spoken language in the world, with Chinese and English
having ranks one and
three respectively [Ethnologue]. Most of the inhabitants of
India are bilingual in nature,
and are exposed to English or Hindi (or both), in addition to
their mother tongue, in
general [Mandal]. They, as well as those who are not, seek
information from different
domains (like news) and often face trouble in doing so. The
motivation of this venture is
easing the troubles in information access between Hindi and
English. We selected Hindi to
maximize the effect of the project.
However, merely collecting texts from different sources does not
constitute a corpus.
Inferring knowledge from the corpora is as important as
selection of corpora and the
development of tools. In addition to collecting pairs of similar
news stories in the three
languages, we develop as a by-product of our project, a list of
English Named Entities
collected from English news stories, and an automatically
generated gazetteer list that
contains English names for Hindi names that the system recognize
as proper nouns.
1.4 Problem Statement
Our work is an attempt in the direction of making a comparable
corpus in the most
common Indian language, that is, Hindi, by collecting similar
news articles automatically
from online news websites. The objective of this project is
easing the information access
between Hindi and English, and the collection to act as the
initiator to many researches in
language technology. We selected Hindi to maximize the effect of
the project.
1 http://languages.iloveindia.com/
-
-
1–16
The process extracts proper names from the news stories in each
language. For languages
other than English, we use the limited coverage bilingual
lexicon available with us for
translation of some key words. The similarity between two
stories in different languages is
a function of the extent of correspondence between the
identified names, and of
translations. We use various phonetic substitutions to identify
names with same
pronunciation in the two languages. The pair closest to each
other in terms of temporal
closeness in addition to similarity in names and translations is
tagged the best match.
1.5 Organization of thesis
We present the thesis in the following manner.
Chapter 1 introduces the concept of Cross-Language Information
Retrieval, highlighting
the escalating call for proficient cross-language resources. The
motivation of this project,
enhancing the resources for Indian languages is also explained
alongwith the problem
statement.
The next chapter, that is, Chapter 2, discusses the background
of the area of natural
language processing, particularly the explanation of what a
corpus is. This chapter
familiarizes the reader with the concept of comparable corpus,
classifications and its
applications. A section for describing some of the available
corpuses is also present.
Chapter 3 describes the efforts worldwide for creation of text
corpora. The section
includes our own effort to understand the need of a corpus for
enrichment of the deficient
and undersupplied resources in Indian languages Hindi and
Bengali. We highlight the need
of an effective bilingual lexicon for noticeable Cross-Language
Information Retrieval, in
addition to other language-specific needs like Named Entity
Recognizer and Feedback
System.
Chapter 4 explains the algorithm we adapted for identifying
similar stories in the crawled
news stories from the websites in participating languages
English and Hindi.
-
-
1–17
Chapter 5 analyzes the results of our test run on a sample data
set. The set was a
collection of more than 1700 stories in the source language
Hindi, collected over the
period of 30 days. The target collection contained English news
articles collected over the
same period. We analyzed the causes of failure of the present
work and evaluated the
system again after adopting some improvement measures. The
section discusses the
achieved performance and the accomplished results in sync with
the expected.
We conclude the thesis with Chapter 6, where we discuss the
future work we wish to
acknowledge in this area. The references follow the scope and
limitations of our present
work.
-
-
1–18
-
-
2–19
Chapter 2
Background
According to Computer Linguistics, a corpus is a self-contained
compilation of texts,
spoken and/or written, accumulated and assembled on a set of
clearly defined criterion.
Intention of corpus collection is generally to serve a
particular purpose to the person
gathering it, and to others working in the participating
language the corpus is in, through
its exploitation for various resources and language studies.
ICAME (International
Computer Archive of Modern English) is a centre that aims to
organize and assist the
sharing of computer-based corpora. Some of the examples of
available English corpora are
available at Computer Linguistics website2
Some of the corpora in English are as follows. In British
English, we have The BNC, a
corpus in written and spoken British English, used extensively
by researchers and for the
Oxford University Press, Chambers and Longman publishing houses.
CANCODE
(Cambridge Nottingham Corpus of the Discourse of English) is a
corpora in spoken
British English, and used at length by researchers and Cambridge
University Press. In
addition, there is ICE (International Corpus of English),
including in itself international
varieties of spoken and written English. The corpus has a major
drawback that most of the
corpus is not yet available.
Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus
is a parallel corpora of
written texts, but is now rather outdated. The Bank of English
is a compilation of written
and spoken English, an important resource for researchers and
for the COBUILD series of
English language books. London-Lund Corpus (Survey of English
Usage) is a collection
of spoken British English, but it is now quite old. Santa
Barbara Corpus collects text in
spoken American English. This corpus has the drawback similar to
ICE corpus; most of
the corpus is not yet accessible for use. Hong Kong Corpus of
Spoken English is an under-
compilation corpus.
2
http://www.engl.polyu.edu.hk/corpuslinguist/corpus.htm#Definition%20of%20a%20corpus.
-
-
2–20
2.1 Notion of corpora
There are two approaches to multilingual corpora: parallel and
comparable. Parallel
Corpora is a compilation of texts, each of which translated into
at least one language other
than the original and thus, clubs together perfectly aligned
(parallel) translated text. The
simplest case exists where only two languages are participating:
one corpus is exact
transformation of other, with virtually non-existent direction
of translation. Examples of
parallel corpora and some efforts worldwide can be found at
[Arora].
In order to analyze a parallel or comparable text, some kind of
text alignment is essential,
which identifies equivalent text segments like sentences or
words. One example of parallel
corpus is European Parliament, a corpus of pair wise-aligned
files created by Philipp
Koehn. The corpus is available in Danish-English,
German-English, Greek-English,
Spanish-English, Finnish-English, French-English,
Italian-English, Dutch-English,
Portuguese-English and Swedish-English. Each corpus is about 100
MB [Athel].
TRIPTIC, that is, TRIlingual Parallel Text Information Corpus,
forms part of the empirical
data used for research on the contrastive analysis of
prepositions. Developed in English,
French and Dutch, the corpus investigates the way in which
languages converge and
diverge in the semantic structure of so-called function words.
According to [Athel], the
paragraph-aligned corpus consists of 20 lakh words consisting of
10 lakh fiction and non-
fiction data each. The corpus has the facility of automatic
selection of the n-th paragraph
in the languages the trilingual corpus is.
Parallel corpora are objects of curiosity because of the
prospect it offers for aligning
original and translation and gaining insights into the nature of
translation. Tools to aid
translation can be formulated. In addition to it, probabilistic
machine translation systems
can be trained on such a collection of parallel text.
2.2 Comparable Corpora
Comparable Corpora choose alike texts in more than one language
or variety, based on
some criterion of similarity. The sub corpora are not exact
translations of each other, but
-
-
2–21
collected on either same sampling frame or some measure of
comparability. In simpler
words, comparable corpora are corpora chosen under “...similar
circumstances of
communication”. There is no strict agreement on the nature of
the similarity and there are
very few examples of comparable corpora [Resnik]. One such
example is ICE -- the
International Corpus of English.
One example of comparable content could be description of a new
product written
independently by different people in the language they are
comfortable. The style of
writing and the presentation would vary a lot. In addition, one
may highlight more features
depending upon his or her perception of the product, and may
even include comments and
suggestions for further improvement. Some author in the
depiction may also include
feedback of other users known to the writer. Even with such a
variability in the account of
same product, high chances are there of sentences in different
language speaking of the
same feature of the product. We can identify and extract such
sentences, and exploit a
huge collection of such accounts of different products for an
enormous collection of highly
valuable multilingual corpora.
Such corpora would be an example of corpora generated based on
content. If there exists a
time-bound on the selection of descriptions, assigning more
closeness to descriptions
within a specified time-bound, then the corpora thus chosen is
concurrent corpora.
Newspaper reports are example of such corpora. The next
subsection describes major
classifications of comparable corpora based on the similarity
condition used.
Comparable corpus enjoys many advantages over parallel corpora
in terms of availability,
versatility, extensibility and accessibility. Moreover, parallel
corpora works on the
assumption that the amount of variation in the texts under
consideration is limited. The
procedures of acquisition of comparable corpora generally relax
the limitation largely.
Mountains of comparable text exists online in the form of news
reports; and in print in the
form of legal texts, socially conventional texts like marriages,
announcements and
advertisements, books and magazines etc. Academic and scientific
text written in
accordance with neighborhood conventions is a high-quality
source of related data.
Comparable text benefits from its nature of being easily
extensible, with negligible data
acquisition issues in most cases, something parallel corpora is
deficient in.
-
-
2–22
2.3 Classification of CC
Based on the similarity measure used, text can be treated as
comparable on the basis of
four types. The first one is the form of data, that is, the size
of files, number of words,
sentences, paragraphs or even the length of texts. It can also
be on the file’s format - .txt,
.doc, .html, .xml etc. Content can be compared for finding
similar documents. The corpus
can be in general language or talking of specialised domains.
Newspaper articles, reports
of war or politics, interviews and discussions, views and
reviews all come under this
category. Structure of the documents can be considered too,
where the text can be formal,
carefully constructed texts like Legal texts, or Informal,
loosely organized discourse like
transcriptions of conversation. Mode is the fourth category
where the similarity measure is
based on whether the text is spoken: e.g. speech, formal
dialogue, conversation, or written:
e.g. book, essay, instruction manual. Very large-sized corpora
can be treated as
comparable to some other corpora similar in size to it, and is
constructed according to
same criteria of quantity and quality of text types. Many more
measures are possible other
than the ones described above, to categorize collected data as
comparable.
2.4 Some Examples of Comparable Corpora
This section describes three major efforts globally to acquire
comparable corpus.
2.4.1 ICE (International Corpus of English)
It is a corpus of around 1 million words in each of many
varieties of English around the
world. It began in 1990 with the primary aim of collecting
material for comparative
studies of English worldwide. Fifteen research teams around the
world are preparing
electronic corpora of their own national or regional variety of
English.
Each ICE corpus consists of 1 million words of spoken and
written English produced after
1989. To ensure compatibility among the constituent corpora,
each team is following a
common corpus design, as well as a common scheme for grammatical
annotation. More
information is available on the website
http://www.ucl.ac.uk/english-usage/ice/index.htm.
-
-
2–23
2.4.2 The Brown Corpus (American English)
The Brown Corpus of Standard American English is the first of
the modern, computer
readable, general corpora compiled by W.N. Francis and H.
Kucera, Brown University,
Providence, RI. It has 1 million words of American English texts
printed in 1961 sampled
from fifteen different text categories to make the corpus a good
standard reference. The
corpus is undersized, slightly passé but still utilized and
clichéd by other corpus compilers
The LOB corpus (British English) and the Kolhapur Corpus (Indian
English) are two
examples of corpora made to match the Brown corpus. Comparison
of same language
existent in different varieties such as English is easy with the
availability of these corpora.
2.4.3 LOB Corpus (British English)
Researchers in Lancaster, Oslo and Bergen compiled the
Lancaster-Oslo-Bergen Corpus.
It has 1 million words of British English texts from 1961
sampled from fifteen different
text categories. Each text is just over 2,000 words long (cut at
the first sentence boundary
after 2,000 words for longer texts) and the number of texts in
each category varies. The
corpus has been grammatically tagged (all words have been given
a word-class label). The
tagged and untagged versions of the corpus are available through
ICAME. This corpus is
the British counterpart of the Brown Corpus of American English,
which contains texts
printed in the same year for comparison between both
varieties.
2.4.4 Kolhapur Corpus (Indian English)
This corpus is comparable to the Brown and the LOB corpora. The
motive behind the
construction of this corpus with to serve as source material for
comparative studies of
American, British and Indian English, and is drawn from
materials printed and published
in 1961 for Brown and LOB corpora and 1978 for the Indian
corpus. It consists of 500
texts sampled from 15 different text categories, each consisting
of just over 2,000 words
drawn mainly from Govt. Documents, foundation reports, industry
reports, College
catalogue, Fiction, Religion, Press Editorials etc.
-
-
2–24
2.5 Applications of Comparable Corpora
Comparable texts in different languages can extract multilingual
lexicons, paraphrases and
other language-specific resources, and available ones in the
same languages can enrich
themselves. This is particularly helpful in creating effective
bilingual dictionaries for
language pairs for which either no dictionary exists, or the
available ones are very
ineffective. It also helps in multilingual summarization, which
is the automatic procedure
for extracting information from multiple texts written about the
same topic in multiple
languages. It offers a wide scope of applications for research
in Discourse analysis
(Analyzing written, spoken or signed language use), Pragmatics
(ability to understand the
intended meaning of speaker, pragmatic competence), Information
retrieval etc.
2.6 Limitations of CC Disadvantages of comparable corpus lies in
the difficulty in managing the corpus for more
delicate analysis. Also, comparable data is not applicable to
all areas of language
technology, and becomes nnecessary for certain types of
research.
-
-
2–25
-
-
Chapter 3 Related Work Karunesh Arora et. al, CDAC, Noida
[Arora] have made a Parallel Corpora for 12 Indian
languages (Hindi, Punjabi, Marathi, Bengali, Oriya, Gujarati,
Telugu, Tamil, Malayalam,
Assamese, Kannada ) including Nepali aligned at paragraph level.
The corpus was OCRed
from various books available in different Indian languages like
National Book Trust India,
Sahitya Akademi, and Navjivan Publications. They found candidate
pairs for paragraph
alignment based on the size of content. The paper also describes
some of the major efforts
towards the acquisition of corpus in many languages all over the
world.
Harold Somers [Somers], England deals with the techniques of
building parallel corpora
from web through ‘tricks’, like names of filename (.fr, .en) or
‘anchors’ in the text. They
find the candidate pairs based on content, that is, the amount
of text available between
each anchor. In addition, they highlight the issues related to
alignment of thus obtained
parallel data. Using identification of anchor points and
evaluating the extent of match of
the text between these anchors, or some other features like
usage of machine-readable
dictionaries and other language specific resources are some of
the techniques discussed.
Almeida and Alberto [Almeida] grab the parallel corpora from web
by getting file pairs
from a list of URLs; user supplied file pairs; from the result
of queries on search engines;
from a web site etc. They go for in-depth validation techniques,
such as file size
comparison, string normalization & edit distance. They use a
variety of normalization
techniques, such as they convert “index_pt.html” to “index”, for
the normalization of
“index_pt.html” (for Portuguese) and “index_en.html” (for
English) to “index”. Shinmaya
& Sekine [Shinmaya], 2003 took Japanese newspapers on
stories involving deaths. They
performed comparing the sentences on extent of match of
Named-Entities present in the
sentences.
-
3–27
Resnik and Smith [Resnik] utilize the Internet Archive for
grabbing parallel data, and
using their STRAND (Structural Translation Recognition,
Acquiring Natural Data) Web-
Mining architecture to identify pages that might be translations
of each other in different
languages. Their algorithm first locates the pages that might
have text parallel in multiple
languages, spawning candidate pairs that could be translations
through URL-matching
algorithm, and applying structural filtering to throw out
negative pairs. They use their
algorithm to build English-Arabic corpus having 2910 translation
pairs.
Parallel Text Miner (PTMiner) used by Chen and Nie [Chen]
applies the concept of
utilizing existing search engines to locate parallel data
through a query in a specific
language. The search engine returns links to the pages, which
they first verify for the
language of interest using length filter and automatic language
identification. Once they
locate candidate multilingual sites, they are crawled deeply for
data. Based on
examination, Chen and Nie accumulate 118MB/135MB English-French
corpus having a
95% precision, and 137MB/117MB English-Chinese corpus of 90%
precision [Chen]
[Resnik].
Ma and Liberman [Ma] in 1999 proposed BITS, that is, Bilingual
Internet Text Search,
wherein multilingual pages from a pre-specified list of domains
are identified using
language identification. The recognized pages are crawled
exhaustively, filtered and
compared based on content and filtered on threshold.
To highlight the need of bilingual lexicon for effective CLIR,
we developed a system to
retrieve relevant documents from English target collection in
response to queries in Hindi
and Bengali using Machine Translation approach. The system
submitted the results in
ranked order. The best MAP values (Mean Average Precision) for
Bengali and Hindi
CLIR for our experiment were 7.26% and 4.77%, which were 20% and
13% of our best
monolingual retrieval, respectively. The system became a part of
our first participation in
Cross-Language Evaluation Forum (CLEF) 2007 [Mandal].
We followed the dictionary-based Machine Translation approach to
generate the
corresponding English query out of Indian language (Hindi,
Bengali) topics. Our main
challenge was to work with a limited coverage dictionary (of
coverage ~20%) that was
-
-
3–28
available for Hindi-English, and virtually non-existent
dictionary for Bengali-English.
Therefore, we depended mostly on a phonetic transliteration
system to overcome this. We
had access to a Hindi-English bilingual lexicon of approximately
26000 Hindi words, a
Bengali bio-chemical lexicon of around 9000 Bengali words, a
Bengali morphological
analyzer and a Hindi Stemmer. In order to achieve a successful
retrieval under this limited
set of resources, we adopted the following strategies:
Structured Query Translations,
phoneme-based followed by a list-based named entity
transliterations, and performing no
relevance judgment. Finally, we fed the English query into
Lucene search engine and
retrieved the documents along with their normalized scores,
which follows the Vector
Space Model (VSM) of Information Retrieval.
3.1 Effect of poor dictionary
We emphasize in this section the effect of underprivileged
bilingual lexicon on the overall
results.
Figure 2
Frequency distribution of number of translations in Hindi
bilingual dictionary
The above graph (Figure 2) shows the frequency distribution of
number of translations for
Hindi words in the Hindi bilingual dictionary we used. With the
increase of lexical entries
-
-
3–29
and Structured Query Translation (SQT), more and more ‘noisy
words’ were incorporated
into final query in the absence of any translation
disambiguation algorithm, thus bringing
down the overall performance. The average English translations
per Hindi word in the
lexicon were 1.29, with 14.89% Hindi words having two or more
translations. For
example, the Hindi word ‘रोकना’ (to stop) had 20 translations in
the dictionary, making it
highly susceptible towards noise. The process followed for CLIR
was as follows:
STOPWORD REMOVAL
STRUCTURED QUERY TRANSLATION
BILINGUAL LEXICON LOOKUP (ALL TRANSLATIONS USED)
CORPUS PROCESSING
QUERY GENERATION
STEMMING
INDEXING (LUCENE)
LANGUAGE – SPECIFIC STOPWORDS REMOVAL
MORPHOLOGICAL ANALYZER (BENGALI)
STEMMER (HINDI)
LUCENE STEMMER (ENGLISH)
(ALL POSSIBLE STEMS)
DOCUMENT RETRIEVAL
TRANSLITERATION (ITRANS)
LUCENE
RESULTS
1.35 LAKH DOCS ~433 MB SIZE
EDIT – DISTANCE ALGO USED FOR
NAMED-ENTITY MATCHING
The topics were in the Indian languages Hindi and Bengali, with
a target collection of
English documents. The topics consisted of three fields namely
Title, Description and
Narration. Table 1 shows the various Cross language runs
submitted in CLEF 2007. Table
2 shows the summary of bilingual runs of the CLIR evaluation
experiment.
-
-
3–30
Table 1
Cross language runs submitted in CLEF 2007
Table 2
Summary of bilingual runs of the CLIR evaluation experiment
Figure 3
Recall plotted versus Precision for the six runs for Bengali and
Hindi to English CLIR Evaluation
-
-
3–31
The above graph (Figure number 3 shows Recall plotted versus
Precision for the six runs
for Bengali and Hindi to English CLIR Evaluation experiment we
evaluated the system
on, two for each language English (monolingual), Hindi and
Bengali. We figured out that
difference in performance is due to missing specialized
vocabulary, missing general terms,
wrong translation due to ambiguity and correct identical
translation. There is a strong need
for effective translation, memory & processing capacity,
effective bilingual lexicon and
the availability of a parallel corpus to build statistical
lexicon.
Translation disambiguation during query generation explains the
anomalous behavior of
Hindi. Query wise score breakup revealed that the queries with
more named entities
always provided better results than those lacking them. The
poorer performance of our
system with respect to other resource-rich participants clearly
pointed out the necessity of
a rich bilingual lexicon, a good transliteration system, and
effective Named Entity
recognition. We found that a trilingual comparable corpus in the
three languages English,
Hindi and Bengali could prove to be the key in constructing the
needful resources. Since at
present no such corpus exists that could serve our purpose, we
decided to construct one
such corpus in English and Hindi. The next phase of the project
is towards the fulfillment
of the preliminary requirements without which no CLIR in these
two languages can be of
much help.
-
-
Chapter 4
Work Done
4.1 Newspaper Sites: Open Source of CC
For this project, the source of comparable data is the freely
available news articles from
online newspapers Navbharat Times (NBT) for Hindi and Times of
India (TOI) for
English. The choice of online news articles is the fact that
news items offer high variety
and easy availability, with no legal issues of data
acquisition.
In case of comparable corpus, there is no complete information
overlap, particularly for
languages like English and Hindi with very different
foundations. Two texts written by
different press agencies, describing the same incident or event,
can vary widely in
presentation and content, depending on the times the two events
are covered, and the
author. One may contain additional facts or comments, and can
have different arrangement
of the appearance of comparable information, adding to the
“noise”. We assumed
beforehand that the amount of insertion and deletions between
the texts is limited across
the similar stories. If these cases do not hold, the chances of
false positives cropping up in
the final pairs’ list escalate.
4.2 Methodology
Different authors writing the same event in their respective
newspaper will result in
numerous versions of similar content uploaded on websites.
However, certain kinds of
noun phrases such as names, dates and numbers behave as
“anchors” which are shared by
similar articles. Our key inspiration is to identify these
anchors among comparable articles
and find the similarity score. This way we can extract stories
that convey the similar
information.
-
4–34
4.3 Details of the implementation
NEWS ARTICLES
CRAWLER NEWS ACCUMULATION HINDI ENGLISH
WWW
PREPROCESSING
TRANSLATION
NAMED ENTITY RECOGNITION
DICTIONARY LOOKUP
TRANSLITERATION
SIMILARITY GAZETTEER LIST
LOOKUP
ABBREVIATIONS LIST LOOKUP
COMPARABLE CORPUS
TEMPORAL FILTERING
SIZE FILTERING
CONFLICT RESOLUTION
SIMILARITY FINDING
THRESHOLD FILTERING
-
-
4–35
4.3.1 Gathering News
The first step was to gather news items from the mentioned web
sites automatically. On a
daily basis, we collected crawled news stories and extracted the
actual news article
alongwith the title, date and place of publication. We saved
special reports with no place,
like letters to editor, discussions or reviews with no place
defined. This helped in matching
of such special reports in different newspapers.
4.3.2 Named Entity Recognition (NER)
For identifying proper names in English story, we used the
freely available LingPipe
Named Entity demo code3 that identifies proper names present in
English files, and tags
them in three classes: Person, Organization and Location.
However, the quantity of names
identified by the system was small and included some false
positives. To enhance the
collection of identified names, we made a “crude” Named Entity
recognizer for English
that considers any word with a capital letter a proper name,
including sentence’s first
words. It treated multi-word names like “Prime Minister” as
single name. To suppress
false positives, we constructed a new stopword list that
contains common sentence
initiators such as “Currently” and “Today”, in addition to the
common stopwords such as
“The”.
For identifying names in Hindi story, we used a Named Entity
recognizer4 and collected
the Hindi proper names identified. The Named Entity recognizer
translates Hindi file and
extracts and tags the proper names present in the supplied file
to three classes: Name,
Location, Organization and Date. Therefore, the system tags
“Manmohan Singh” to
PERSON, and 20 June 2008 to DATE.
4.3.3 Preprocessing
Stopword removal and processing of special cases for both
languages were undertaken.
Special cases include normalizing currencies like $20 to 20
dollars, mathematical 3
http://alias-i.com/lingpipe/web/demo-ne.html4 Courtesy Mr. Sujan
Saha, Communication Empowerment Laboratory, Department of Computer
Science and Engineering, IIT Kharagpur
-
http://alias-i.com/lingpipe/web/demo-ne.html
-
4–36
quantities like 35.5% to 35.5 percent, breaking range of years
like 2007-09 to 2007 2009,
and others. To take care of domain-dependency of stopwords, we
constructed news-
specific stopwords list for both the languages. For instance,
words like “report”, “article”,
“incident”, which though have meaning, do not add much to the
semantic content of the
text when present in news. We appended a list of such words to
the existing stopwords list.
In addition, common abbreviations were expanded using
abbreviations’ list for English
articles (AL). For preprocessing the articles in languages other
than English, we used the
techniques of dictionary lookup (DL), gazetteer list lookup
(GL), abbreviation list lookup
(AL) and Transliteration Similarity (TS).
4.3.4 Translation
4.3.4.1 Dictionary Lookup (DL)
The translation step at DL used Hindi-English dictionary with
24824 Hindi words. The
list was one-to-many, that is, in many cases, more than one
English translation were
present for one Hindi word in the dictionary. This dictionary
was different from the one
we used for our earlier results. To suppress the effect of
noise, we considered only a
maximum of first two translations for each Hindi word found.
4.3.4.2 Gazetteer List Lookup (GL)
We manually constructed a list containing the corresponding
English names for regularly
found Hindi proper names. The list was incremental in nature,
that is, it automatically
appended all the proper names identified by the system for which
the corresponding
English names found by TS, were in the list at the end of the
execution for future use.
4.3.4.3 Abbreviation List Lookup (AL)
We used a list of commonly found acronyms and their expansions.
In addition, print media
tends to adopt many compression techniques and abbreviations,
generally to shorten the
length of title, in a bid to pack more significant terms in the
title. For instance, “President”
-
-
4–37
is frequently written as “Prez”, “Prime Minister” and “Chief
Minister” as PM and CM
respectively. Such abbreviations were also included in the
gazetteer list to facilitate better
detection of correspondence between titles and even news
story.
4.3.4.4 Transliteration Similarity (TS)
For identifying names, we exploited the phonetic correspondence
of alphabets and sub-
strings in English-Hindi. For example, “ph” and “f” both map to
the same sound of “फ” (f).
Likewise, “sha” in Hindi (as in Roshan) and “tio” in English (as
in ration) sound similar.
Prior to executing content-based similarity-finding algorithm,
we used TS approach on the
list of English Named Entities identified by the system
collected. Using edit-distance
method, we collected, and appended mappings identified by the
system as a valid name-
match, to the present gazetteer list for future use.
For each of the cases, we calculated the similarity between
identified proper names for
each source language and target language pair. The system
considered the pair with
maximum similarity, above a certain threshold level a valid
pair. To accomplish ease in
similarity based on phonetics, we wrote language-specific rules
for both the languages.
We list some of them below in Table 3:
Table 3
ष -> श (sh) ट / ठ / त / थ -> ट (t) ढ / ड / द / ध / ङ ->
द (d)
ज / झ -> ज (j) ब / भ / व -> व (v)श / प / फ -> प (p)
ग / घ -> ग (g) न / ण / ञ -> न (n) ख / क -> क (k)
Phonetic rules for recognizing similar sounding names
-
-
4–38
4.4 Similarity Calculation
The final similarity between two stories is the sum total of
similarity of names found in
them, and the matches between words appearing in the title as
well as actual news story.
The prime focus was on appending through some techniques the
similarity value between
a true positive pair, as well as penalizing a lot the sure-shot
mismatches; thus pushing the
true matching pairs above all others.
Extent of Match = NSource Target×
- (1)
Where
N: number of intersecting words in source string and target
string
SOURCE: number of words in source language string
TARGET: number of words in target language string
We calculated the similarity between source and target string by
computing the cosine of
the angle between the two string vectors. Matching proper names
were assigned the
highest weight (6.0) followed by title matches (3.0) and matches
in story (2.0), and
multiplied to the individual extent value. The total similarity
value was a sum total of the
similarity value found from the correspondence in the three
cases, viz. Proper names, title
and actual story. We normalize the total value to the range 0 to
100 depicting no match at
all, and perfect match respectively.
A match in titles increased the chances of the pair being
comparable, as the chances of
related stories, having high resemblance in titles, are not high
in reality. Often newspapers
tend to present the titles in sensational ways, to attract
viewers, and often deviating from
the “real” content.
Based on the closeness of the stories, we put intense penalties
on pairs with different dates
of publish and/or different places. The thought is that even if
a matching pair appears on
websites at a wide time-gap, the higher similarity value would
make for the losses. This
-
-
4–39
step reduces the chances of mismatching stories attaining higher
similarity value due to
accidental similarities in content.
4.5 Filtering
Rejecting pairs of stories that cannot be similar under any
circumstances at an early stage
can lead to a substantial saving in time and effort required to
get the results. Various
filtering techniques to reject obvious-looking mismatches and
attain substantial saving in
time and effort required included a combination of temporal and
size filtering. We
permitted a tolerance of seven days on either side between dates
of publish of source and
target stories. The tolerance was adjustable to any extent but
we kept it to a considerable
amount to attain reduction in time required to run the program,
without losing any true
pair.
We also use length filtering to throw away stories with large
size variation in them. For a
very less value of match between titles, added with a high
difference between the lengths
of stories, we rejected the current target language text, and
continued with next. As already
stated, we assume beforehand that the amount of insertions
and/or deletions at both ends
of a matching pair is limited. After a careful analysis of news
stories, we kept the ratio of
sizes to four. This step reduced the total number of target
files for each source story,
escaping the comparison of obvious mismatching stories.
4.6 Conflict Resolution
At each point of time, for one source pair, we find the
similarity value for the
corresponding target story. If the calculated similarity is
above a certain pre-specified
threshold, and more than the maximum similarity found until now,
we save the
corresponding pair.
It is possible that at certain point of time, the total
similarity value attained for a source
story is same for two or more target stories. In this case, we
select the appropriate target
-
-
4–40
story through the process of conflict resolution. The process
followed when two or more
pairs have same total similarity value for same source story was
as follows. First, we
check the temporal closeness of both target stories with the
source news. Temporal
closeness is the measure of closeness of two stories based on
the variance in date and
place of publish. A pair with same date as well as place of
publication is closest. Then
with lesser closeness, comes the case wherein date of
publication is one day on ether side,
and place of publication is same. This takes care of those
instances where two different
press agencies publish the same story on adjacent dates for
various reasons. Closeness
reduces when the date of publication is same but the place of
publication is different. This
helps differentiate dissimilar articles published on the same
date. The final case with lesser
closeness is the case where place is different and date is one
day on either side. After that,
if none of the case holds, the closeness is the function of
difference between the dates of
publishes of the source-target pair, and goes on reducing as the
difference enlarges.
TWO PAIRS OF NEWS ITEMS SAME MAX SIM FOR ONE SOURCE FILE
CHOOSE PAIR WITH MAX TEMPORAL CLOSENESS
IF SAME, ADD BOTH PAIRS TO FINAL SET
IF SAME, CHOOSE PAIR WITH MORE SIMILARITY FROM STORY
4.7 Example
This hypothetical situation demonstrates the series of incidents
occurring for a source
story (at the center) surrounded by a set of target stories.
Some of the target stories are not
comparable to the source, but some articles describe the same
happening, and there is only
one (or more) story in the target language that matches with the
source. The source story is
in Hindi, temporarily translated in the figure. The target
stories are in English.
The breadth of the arrow between the pair depicts the closeness
of a story to the source
story in the center. For the stories on the left hand side, the
dotted arrows indicate that the
stories are far away from the source story, and the ones at the
right are closer to it. In each
bubble, the first line shows the title, followed by date of
publish, and the text marked
between hyphens (- ABC -) shows the possible reason for either
its selection or rejection.
-
-
4–41
PM talks to Sonia Gandhi
08/25/2007 - Uncommon names -
Salman Khan released from jail
Aug 25, 2007 - Names uncommon -
Sanjay Dutt’s movie attracting crowds
Aug 25, 2007 - TOTAL SIM THRESHOLD
-
3 killed in Nandigram
Aug 25, 2007 - NE -
Sanjay arrested Aug 21, 2007
- TOTAL SIM -
Sanjay Dutt released Aug 25, 2006
- TEMPORAL FILTERING -
Munnabhai’s story till now – ’93–’07
Aug 25, 2007 - SIZE -
Sanjay Dutt released from Pune jail
August 25, 2007 - (HINDI) -
Sanjay celebrates freedom Aug 26, 2007
- TOTAL SIM -
Dutt to be released Aug 24, 2007
- CLOSENESS -
Sanjay’s case’ hearing today Aug 24, 2007
- SAME DATE -
Sanjay’s verdict deferred Aug 23, 2007
- TOTAL SIM -
Sanjay released from Pune Aug 25, 2007 - MAX SIM -
Bollywood relieved, Sanjay back home
Aug 25, 2007 - SAME MAXSIM -
-
-
4–42
-
-
Chapter 5
Results Precision is the ratio of true positives and total
retrieved pairs, whereas recall is the
number of relevant documents retrieved. We evaluated our system
on the precision, that is,
the number of pairs actually comparable divided by the number of
pairs tagged positive by
the system. The desire is to get a high-precision system, though
a system high on recall
alongwith precision is difficult to attain.
Out of the total pairs retrieved, we pick some random pairs and
evaluate our system on
those pairs. For manual evaluations of the returned results, we
considered a pair valid
matches if at least one of the three described conditions hold
good.
A news story pair would obviously be comparable if it talks of
the same event. News
describing Indian cricket team winning Twenty-20 World Cup match
in one particular
year in both the newspapers is an example of such a case. When
both the stories describe
same events, with related contents and/or actions, we treat them
as comparable. For
instance, a report in one language on Indian team winning
Twenty-20 cricket World Cup
match, and an interview of the captain of the winning team for
the same competition in
another. Both the stories describe similar events, with at least
one comparable sentence
present in the pair, are also comparable, as they serve our
purpose of getting similar news,
and comparable sentences. A report on Indian team winning
Twenty-20 cricket World Cup
match in one language, and a report on World Cup matches won by
India against the same
team in the past in another, is an example of such news.
For each source-target pair, we compared Proper names (N)
present in the whole stories in
the pairs for calculation of similarity. In addition, depending
upon the amount of content
taken from the news article compared, we calculated similarity
for the following four
cases:
-
5–44
1. : Title only (T), the actual news was NOT considered
2. : Title plus first half of the news story (HS) (up-to a max
of first 10 lines).
3. : Title and the whole news story (FS).
4. : only names, no title or news story.
5.1 Sample data
The sample data set consisted of 1711 different stories in the
source language Hindi,
collected for a period of 30 days (29 March 2008 to 27 April
2008). The target language
(English) pair had 3500+ stories collected over a period of 27
March 2008 to 1 May 2008
around the dates of Hindi collection, to take care of time gap
between uploading of
comparable items on websites. We are in a process of evaluating
the system on larger data
set. For the improvements, we used a slightly bigger data set
consisting of 2400+ source
language files collected for 35 days, and corresponding English
data of similar sampling
period.
5.2 Evaluations
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9Random Points of evaluation
Prec
isio
n
10
N+T N+T+HS N+T+FS N
Figure 4
Precision versus evaluation of system on ten random points for a
sample data of stories
collected over a period of one month
-
-
5–45
The graph (Figure 4) shows Precision on Y-axis plotted for
evaluations on ten random
points plotted on X-axis. The next graph (Figure 5) shows the
average precision values for
the evaluations for four cases considered.
0
20
40
60
80
100
1 2 3 4
AVERAGE PRECISION
Figure 5
Average precision values for the evaluations for one-month
sample data
The overall system behaved satisfactorily and the average
precision for the evaluations
turned out to be 67.4 percent for the best case () and 37.4
percent for the worst
case () (Figure 4). We infer that comparing the news articles
based on only their
proper names is not at all a good choice, and in general leads
to false positives. Comparing
only titles alongwith the named entities is better at some
places, as similar words in title
boost the similarity value of similar news stories. However, as
stated earlier, in general,
similar news stories might have mismatching titles, twisted and
compressed in a bid to
make them more striking to the reader. Two dissimilar stories
also can bear common
words in titles, due to various reasons (analyzed next).
The graph clearly depicts the outperforming of Case 2 over Case
3 in most of the cases. In
fact, the highest precision attained at any point of time is
78%, achieved at two points
(point 4 and 10), which is higher than the highest precision
obtained by case 3 (76% at
point 4). We realize that for majority of cases, the first half
of the news caters for the
largest part of the vital information. This is in accordance
with the regular scenario where
first the few sentences highlight the actual current event, and
trailing sentences provide
-
-
5–46
additional related information or comments, and are generally
repetitions of some data
from precedent incidents.
However, we deduce on investigation that for appropriate
alignment of news articles at the
document-level that we need to consider the key features, like
names and dictionary words
in the whole story. The average precision graph (Figure 5) shows
Case 3 (N+T+FS)
outshining Case 2 (N+T+HS) over all the random points taken
together.
Figure 6 shows the number of documents retrieved versus
precision obtained.
Nevertheless, for alignment of stories at the document level,
considering only the first half
of the story is not a good choice, as depicted in the average
precision curve. This is due to
the consistent performance of case 3 over the fluctuating
precision of Case 2 in the
preceding graph. In any case, considering only the titles is not
a good choice.
0
20
40
60
80
100
10 20 30 40 50 60 70 80 90 100 110 120 130 140Retrieved
Documents
Prec
isio
n
N+T N+T+HS N+T+FS N
Figure 6
Number of documents retrieved versus precision obtained for top
140 results
Table 4 on the next page shows the number of retrieved documents
versus the Precision
values for top 70 pairs.
-
-
5–47
Table 4
CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70
N+T 100 100 96.66 92.5 88 86.67 85.71N+T+HS 100 100 90 85 86 85
84.29N+T+FS 90 95 90 87.5 86 83.33 81.43
N 80 70 73.33 75 70 63.33 65.71
Number of retrieved documents versus the Precision values for
top 70 pairs for improved
case
5.3 Analysis
5.3.1 Positive results
-
-
5–48
A pair is termed as legitimate, if it confirms to the
specifications we laid, as described in
the start of this chapter, irrespective of the dates and places
of publication of the paired
articles. The image on the previous page shows a comparable
story marked positive by the
system.
The system identifies the pair correctly even when both the
dates and places of publish are
different. The result can be attributed to the high number of
common proper name “मन”ु
(Manu) found in both the stories (frequency 6 in Hindi and 10 in
English). As expected,
even after a high penalty for the stories due to differences in
date of publish and place
mismatch, the high score of proper names helped the total
similarity value above all
others. We tuned the penalties to make sure the similarity
values of pairs like these remain
above the threshold level after the application of punishment,
and thus are a part of the
final set if no other target story achieves a higher
similarity.
5.3.2 Negative results
As explained earlier, at some places, the method did not behave
the way it intended to. We
identified the following points of failure of the present
arrangement leading to wrong
results.
5.3.2.1 Pair unable to cross the thresholds
With less matches in names, added with the lower number of
dictionary words translated,
the total similarity value between a correct pair might not be
adequate to reach the other
side of the threshold boundary. The example below shows such a
case.
-
-
5–49
The similarity of the shown pair was more than the threshold
when considering only
proper names, and made it to the final list of comparable
documents. Nevertheless, due to
absence of adequate number of translations, the total similarity
value reduced for other
cases. To boost the similarity value for such pairs, we enhanced
the effect of match in
names to an extent where the effect of named entities is
substantial, without
overshadowing the effect of other parameters. Tuning the weight
associated with the
names, we decided to put it twice the weight for title, and
thrice that for story.
5.3.2.2 Absence of corresponding news article in target
language
With no story to match, accidental matches in a dissimilar
target language story can cross
the thresholds and counted as “positive” pair. Accidental
matches could be in proper
names and in dictionary words. Sometimes, the figures also match
leading to increment in
similarity value. However, we tried to put a certain threshold
value to suppress the effect
-
-
5–50
of such “accidental” matches on the overall results, and thus
the precision, cannot rule out
the possibility of occurrence of such cases.
5.3.2.3 Effect of similar names found in dissimilar stories
When similar names occur in dissimilar stories, chances are that
the match in names
results in high similarity value for that pair. Two stories with
one on population of India,
and other one on some world event with India as one of the
participants, with similar name
India occurring frequently, can have a high score of similarity
from names. Since the total
similarity depends on both source and target file, even after
the presence of comparable
target story, a target story with higher density of such names
might generate higher
correspondence.
The next example shows the situation where a highly mismatching
pair is tagged positive
by the system due to occurrence of same name “राहुल” “Rahul” in
the news.
-
-
5–51
While the Hindi story speaks of cricket, and contains the name
“Rahul Dravid”, the target
story describes the latest fluctuations in politics, with “Rahul
Gandhi” occurring
frequently. We analyzed and found that no corresponding news was
present in the target
collection, due to which the system pairs the above target story
with the source story at the
end of the simulation.
There was also a pair found in our sample data set where the
system identified correct
target story for cases 1 to 3, but at case 4, the system paired
the same source story with
some other mismatching story because of the presence of more
intersecting names. The
images below show two pairs. The first one is the correct pair
identified when some
content from the story is compared, viz. title in case 1, first
half of the story in case 2, and
full story in case 3. However, in case 4, we compare only names
for calculating the total
similarity. The next figure depicts such a case.
-
-
5–52
The total number of proper names common to the true pair was six
(“इरान” (Iran) occurring
six times in source story and nine times in target), whereas in
the mismatching pair, “इरान”
(Iran) occurred six times each, and “तेहरान” (Tehran) occurred
two times, making the total
number of proper names common to both as eight. When a
comparison occurs on the
names only, the pair with higher number of names achieved a
higher total.
Topic identification could be one possible solution to minimize
such negativity. A
classification of news stories into independent fields on
identification of the context can
group similar news. In that case, the above two stories would
constitute part of separate
cluster. Penalizing matches from dissimilar clusters can reduce
false positives like this to
great degree.
-
-
5–53
In addition, we identified many pairs with high effect of
matching proper names in
mismatching stories. We discovered most of them d to have a
negligible value of
similarity in their story. To suppress such pairs where the
total similarity value is a result
of similarity in proper names alone, we rejected pairs with very
low similarity in story. In
this way, a pair needs to have some correspondence in all the
measures applied, to come in
the final list.
5.3.2.4 Effect of similar dictionary words in stories depicting
news on similar topic, yet
describing different events
This happened when the names were either all mismatching, or the
effect was negligible.
However, there were considerable dictionary translations common
in both the stories. For
instance, separate accidents occurring at different place, with
unintentional match of
dictionary words and/or even names resulted in higher similarity
value from story and/or
title.
Some of the pairs when analyzed revealed that the dictionary too
was responsible for
erroneous matches. A translation found in the dictionary was
actually a stopword for the
news domain, or otherwise, and was included in the final
similarity calculation procedure.
For example, the DL step translated the Hindi word “अब” to
“now”, and a story pair with
this commonly found word got a boost in the correspondence
value. Particularly, we had
in our test data a pair with “now” in the title itself. This
helped the pair to cross the
minimum similarity levels.
Below is a snapshot of the case where the words “train” and
“rail” were prominent in both
the stories, speaking of different happenings. The case is not
comparable yet marked
positive by the system.
-
-
5–54
We tuned the weights associated with similarities to minimize
the effect of matches in
dictionary words alone. Particularly in titles, small
coincidental matches led to high boost
to similarity. To suppress the effect, at the time of matching,
we rechecked the stopwords
list in target language to filter out any such words. This
reduced such accidental matches
in the final evaluations.
5.3.2.5 Effect of wrongly identified pair of names
This was the result of error in the TS stage, with some
mismatching pair crossing the
threshold and marked “true” by the system. This occurred when
match occurred in some
proper name identified by the system in the target language with
some source language
word. Considering the target word a translation of the source
word, it was included in the
final list of identified names. For instance, “बु कग” (booking)
in Hindi transformed into
name “Viking” after passing the TS step. The corresponding word
booking, not being a
-
-
5–55
proper name, was not present in English named entity list. The
example shows the case
where the word in Hindi “ली” (to take) gets mapped to “Lee”, and
comes closer to the
name “Lee” present in the target story.
To suppress mismatches, we assigned higher threshold levels for
similarity in TS. We
tuned the thresholds on length of the compared strings, with
higher thresholds required for
higher lengths.
5.3.2.6 Effect of poor dictionary
The bilingual dictionary used for this experiment is still in
the development stage. Last
used, the dictionary contained 24824 Hindi words. Inclusion of
more source language
-
-
5–56
words will lead to more translations identified in the DL step,
escalating the effect of
similarity in dictionary words, and thus resulting in
identification of more similar pairs.
5.4 Improvements
To achieve a further improvement in the results, we adapted some
more techniques.
Firstly, we adopted a three-level weight system for the English
names found in the stories.
We assigned highest weight to proper names identified by both
LingPipe code and our
custom-made NER, followed by the names identified only by
LingPipe followed by the
names found by our recognizer. This ensured the maximum effect
of sure-shot names
identified by both the systems for English language. At the time
of finding intersecting
names, we checked the weight of identified Hindi name in the
English name collection,
and the total similarity was a function of that weight.
In addition, a deep analysis of mismatching pairs revealed a
common trend where the
stories came closer due to high similarity in names, but had a
significantly low
correspondence in actual news item. We reduced the effect of
such mismatches by
rejecting a pair low on similarity value for news story. As
expected, the system responded
positively and we attained a graph much better than the previous
run for the same sample
set.
Figure 7 shows the graph we obtained on the improved version of
our system, for a larger
collection. The graph shows number of top retrieved documents
taken 10 at a time versus
precision. This collection included Hindi news for a period of
35 days with supporting
target collection in English. The graph met our expectations and
show better precision for
the case when we include whole story for document-level
alignment, contrary to Figure 6
where only title included with the proper names for similarity
calculation shows better
precision. The values obtained representing the numbers of
retrieved documents versus
precision values for top 70 retrieved pairs are in Table 5.
-
-
5–57
0
20
40
60
80
100
10 20 30 40 50 60 70
N+T+HS N+T+FS N N+T
Figure 7
Table 5
CASE P@10 P@20 P@30 P@40 P@50 P@60 P@70
N+T 100 100 96.67 97.5 92 90 87.14N+T+HS 100 100 96.67 97.5 96
90 87.14N+T+FS 100 100 100 100 96 90 85.71
N 100 95 96.67 90 82 73.33 70
Number of documents retrieved versus precision obtained for the
improved system
Number of retrieved documents versus the Precision values for
top 70 pairs for improved case
Thus, based on improved results, we restate the observation in
accordance with the general
perception that former half of the news article furnishes the
largest part of the essential
information conveyed. However, it is the consideration of whole
story, which results in
better identification of similar stories. Detection of
association in a source-target pair on
intersecting proper names is indispensable, but should not be
the only criteria, specifically
for languages like Hindi, which is highly undersupplied with
valuable cross-language
assets.
-
-
5–58
-
-
Chapter 6
Conclusion
The size of corpus thus obtained is far from any appreciable
usage in the applicable areas
of research and technology. We were able to generate a
repository consisting of more than
100 true pairs of comparable stories at the time of writing this
thesis. We look forward to
accumulate many such pairs aligned at document level for it to
be helpful to researchers all
over. In addition, the present work is for the language pair
Hindi-English. With language-
specific resources in hand, it is possible to extend the system
to generate multi-lingual
corpus in the participating languages. The resources for Bengali
language, the second most
spoken language in India [Languages], are in the development
stage. We wish to extend
the present bilingual corpus to trilingual corpus
(English-Hindi-Bengali).
The present system crawl the news articles from only one website
for each language.
NavBharatTimes lack the archives, so a story missed on a
particular day might not be
crawled to the final collection. We found Bhaskar.com to contain
archives for news
articles published in preceding years. We wish to extend the
present system to many more
websites like Bhaskar.com to extract a huge amount of comparable
data.
The next step towards the project is to align the found
comparable documents at sentence
level, thus gathering comparable sentences from amongst the
similar documents.
This project is the foundation stone for progress in numerous
areas associated directly or
indirectly to natural language processing. Appreciable amounts
of this corpus collected
will act as a catalyst to several researches the languages we
built it in, and in many related
languages. We expect the effort of ours helps the Indian
community in attaining our
purpose of shrinking the space between English and prominent
Indian languages like
Hindi and Bengali, and make information access easier.
-
6–60
6.1 Future work We did not include Word Sense Disambiguation in
the present system. Many words have
different meanings depending upon the words around it in the
sentence it is used. For
instance, “bank” can refer to the financial institution as well
as the bank of river. Some
places use the word as a synonym for trust. A dissimilar pair
using the word in different
meanings can come closer harming the precision.
Though precision is acceptable, the recall of the present system
is low. The result is the
cascading effect of inefficiency of Named Entity Recognizers in
both the languages on the
overall similarity. There were a significant number of documents
with a low percentage of
proper names mined by the recognizers in both the languages.
Improving the Named
Entity Recognition systems to make them extract most of the
named entities will help the
positive pairs to attain higher similarity values, thus reducing
the amount of false
positives, and assisting many rightly identified pairs with
total association value less than
the associated threshold values, to cross the boundary
values.
-
-
6–61
-
-
Bibliography [Arora] Karunesh Kr. Arora, Sunita Arora, Vijay
Gugnani, V N Shukla, Dr S S Agrawal,
GyanNidhi: A Parallel Corpus for Indian Languages including
Nepali.
[Shinyama] Yusuke Shinyama, Satoshi Sekine, Paraphrase
Acquisition For Information
Extraction, Computer Science Department, New York University,
Proceedings of the
second international workshop on Paraphrasing - Volume 16,
Sapporo, Japan. Pages: 65 -
71 Year of Publication: 2003.
[Resnik] P Resnik, NA Smith, Web as a Parallel Corpus, Resnik -
Computational
Linguistics, Vol. 29, No. 3, Pages 349-380, Year of Publication:
2003.
[Somers] Harold Somers, Bilingual Parallel Corpora and Language
Engineering,
Department of Language Engineering, UMIST, Manchester,
England.
[Almeida] Jose Jo~ao Almeida, Alberto Manuel Sim~oes, Jose Alves
de Castro,
Grabbing parallel corpora from the Web, Departamento de
Informatica, Universidade do
Minho.
[Ma] Ma, Xiaoyi and Mark Liberman.. BITS: A method for bilingual
text search over the
web, In Machine Translation Summit VII, Year of Publication:
1999.
[Barzilay] Regina Barzilay , Noemie Elhadad, Sentence alignment
for monolingual
comparable corpora, Theoretical Issues In Natural Language
Processing, , Proceedings of
the 2003 conference on Empirical methods in natural language
processing - Volume 10
Pages: 25 - 32 Year of Publication: 2003
[Mandal] Debasis Mandal, Sandipan Dandapat, Mayank Gupta,
Pratyush Banerjee,
Sudeshna Sarkar, Bengali and Hindi to English Cross-language
Text Retrieval under
Limited Resources, In Working Notes for the CLEF 2007 Workshop
(2007).
[Liddy] Elizabeth D. Liddy , How Might CLIR Be Accomplished,
ASIST Annual
Meeting, Chicago, IL 11/13/2000
http://www.cnlp.org/presentations/slides/CLIR.pdf
[Ethnologue] Ethnologue list of most spoken languages in the
world
http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages
[Languages] Languages of India
http://languages.iloveindia.com/
[Arist] Arist Chapter, Douglas W. Oard & Anne R. Diekema,
Cross-Language
Information Retrieval.
http://portal.acm.org/author_page.cfm?id=81100158311&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://portal.acm.org/author_page.cfm?id=81100617563&coll=GUIDE&dl=GUIDE&trk=0&CFID=64051802&CFTOKEN=85963839http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages
-
6–63
[McEnery] Anthony Mcenery & Zhonghua Xiao, Parallel and
comparable corpora:
What are they up to?
[Maia] Belinda Maia, Creating parallel and comparable corpora
for work in domain
specific areas of language, FLUP.
[Peters] Carol Peters, Multilingual Information Access for
Digital Libraries, ISTI-CNR,
Pisa
[CNNIC] Serving the Needs of the Community — IDN and
Alternatives
[Yale] Mohan Yale, Building a Sustainable Framework for a
Multilingual Internet,
Internationalized Domain Names (IDNs), A2K2 Conference, New
Haven, USA April 27,
2007
Tools and resources:
LingPipe home-http://alias-i.com/lingpipe/