Top Banner
17 January 2006 Hughes @ OpenRoad 2006 1 Towards a Web Search Service for Minority Language Communities Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne [email protected]
18

Towards a Web Search Service for Minority Language Communities

Jan 16, 2015

Download

Business

Baden Hughes

Talk at OpenRoad 2006 (17 January 2006, Melbourne)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 1

Towards a Web Search Service for Minority

Language Communities

Baden HughesDepartment of Computer Science and

Software EngineeringThe University of Melbourne

[email protected]

Page 2: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 2

Diversity in Australia

� Well recognised cultural and linguistic diversity of Australia’s population� SIL Ethnologue

� 311 languages (14th edition, 2000)� 318 languages (15th edition, 2005)� Australia in top 10 countries for linguistic diversity

( = languages in a country / languages globally )� ABS: 364 languages (2005)

� Considerable number of low density languages used within immigrant communities

Page 3: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 3

Inefficiency of Web Search

� General web search is a low precision activity in the best case scenario� Google: 8 billion web pages

� Web search for materials in lesser-used languages is even lower precision than the general case

� Web search for minority (“low density”) languages is even lower precision again� Mining the ‘long tail’ of the web is a specialist

domain of research

Page 4: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 4

Harvesting vs Enabling

� Previous work in linguistically-oriented data mining of web content to create derivative works: corpora, dictionaries� None of these address the low precision issues for

generalized web search� Our work is aimed at increasing the likelihood that

end users searching for resources in minority languages on the web will find useful results from searching� Developing use-case specific tools for web search and

leveraging existing broad coverage web search tools

Page 5: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 5

Open Language Archives Community (OLAC)

� OLAC is a consortium of linguistic data archives� http://www.language-archives.org/� 34 archives, 28K+ objects in catalogue

� OLAC metadata is based on Dublin Core, with extensions for specifically linguistically-oriented properties eg language, data type, subject language, linguistic subject

� OLAC is an Open Archives Initiative (OAI) subcommunity� Uses standard OAI Protocol for Metadata Harvesting to

promote data access and integration

Page 6: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 6

In vs About

� OLAC Metadata crucially distinguishes between� The language a resource is in (‘language’)� The language a resource is about (‘subject

language’)

� Such differentiation allows for additional precision in classifying, indexing and searching for low density language resources� ‘In-ness’ is more interesting than ‘About-ness’

Page 7: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 7

Service Architecture

� Building on previous work in developing robust strategies for identifying web resources for lesser used languages on the web, the LangGator service architecture provides� Language-centric web resource identification and

acquisition� Language-centric resource description� Language-aware end-user resource discovery

Page 8: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 8

Crawler Internals

� Crawl seeded by language name variants (Ethnologue), place and country names and variants (Getty TGN), lexical items (Rosetta)

� Programmatic queries against Google, Yahoo, A9, DogPile� Essentially guided metasearch

� Resulting URIs merged and sorted using rank aggregation techniques

� Highly ranked documents from metasearch used for focused crawling around URI� TF/IDF for low frequency items in found documents

Page 9: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 9

Crawler Status

� Running intermittently since July 2004 on high bandwidth research infrastructure

� >1.6 million web resources have been identified in over 3000 languages

� Some exposed via standard OLAC search� Majority exposed to standard search engines via

DP9 gateway� Full circle exploitation of web search� Evaluation of precision improvement is ongoing

� More details in the paper (or Hughes 2005 paper)

Page 10: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 10

Metadata Descriptions

� Describing resources separately from their realization is required since the web based language-centric resources are not held locally

� Metadata creation is an effort intensive process� Automatic description generation is well studied in the

general digital libraries community (eg Paynter 2005)� Some metadata elements are well supported by

existing automatic metadata creation tools� We focus particularly on language vs subject

language metadata creation since it is of primary importance

Page 11: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 11

Metadata Descriptions Status

� We use a combination of machine learning approaches to compare and classify a given resource against human curated gold standard data for known languages� Primary data points: encoding, word n-grams, character n-

grams� Secondary data points: geographical referent colocation,

lexical item occurrence, URI� Currently described around 40% of the >1.6 million

URIs found by crawler at probability of 0.8 or higher as threshold for acceptable language identification � Computationally bound at present, but re-engineering

Page 12: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 12

Search Facilities

� Currently search delivered via OLAC Search Engine (http://www.language-archives.org/tools/search/)

� Features� Web search style interface, UTF-8 support, no restrictions

on string, operators, inline syntax� Fuzzy string matching for geographical entities and

language names� ‘Click minimization’ strategy for empty search: pre-

composed derivative queries� Exploits Ethnologue and Getty ontologies� Exploits linguistic knowledge (eg language families)

Page 13: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 13

Search Facilities

� Localization-oriented interface� XML core with XSL� Entirely user preference driven with a default� Post-query encoding/language change� Currently code auditing for upgrading interface

strings to XLIFF Portable Objects� Interest for localization into French, Spanish,

Bahasa Indonesia, Vietnamese, Thai� More search architecture detail in Kamat and

Hughes (2005)

Page 14: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 14

Language Search: Dinka

Page 15: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 15

Country Search: Togo

Page 16: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 16

Future Work

� Increased frequency of web crawling� More efficient and reliable language identification� End user documentation and accessibility� API documentation for third party data consumers and

documentation for service/interface customization� Map based search GUI; better geographical context-

aware search� Linguistically or geographical proximity based

language matching� Basic Language Resource Kits (BLARK)� Integration with MyLanguage

Page 17: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 17

Conclusion

� Language-centric broad coverage web search is a strongly motivated user function

� Major search providers do not focus on precision improvement per se, but can be incrementally improved through covert means

� A multilingual web and multilingual web users can be supported effectively, even down to low densities

� Interested in leveraging our existing research and service development in other ways

Page 18: Towards a Web Search Service for Minority Language Communities

17 January 2006 Hughes @ OpenRoad 2006 18

Acknowledgements

� Research supported by the Australian Research Council under the funding program for Special Research Initiatives (E-Research) Grant SR0567353 “An Intelligent Search Infrastructure for Language Resources on the Web”.