Top Banner
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen
18

WebLicht Application and “Workspaces”

Feb 22, 2016

Download

Documents

cala

WebLicht Application and “Workspaces”. Erhard Hinrichs & Thomas Zastrow University Tübingen. Outline. Web-based Linguistic Chaining Tool ( WebLicht ) for incremental filtering and access of language corpus data WebLicht – Motivation WebLicht - Architecture - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht Application and “Workspaces”

Erhard Hinrichs & Thomas ZastrowUniversity Tübingen

Page 2: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

Outline

Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data

WebLicht – Motivation WebLicht - Architecture WebLicht – Future Requirements Test Case – Gutenberg Corpus

Page 3: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

CLARIN Mission

CLARIN (Common Language Resource and Technology Infrastructure Network)

• is committed to establishing an integrated and interoperable RI supporting easy access and use of language

• aims to overcome the current fragmentation and offer a stable, persistent and extendable infrastructure

• it will offer its services to researchers and scholars across a wide spectrum of domains in particular in the humanities and soc sciences

• ESFRI roadmap project; implementation phase starts in 2011

Page 4: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

Typical CLARIN user scenario

Scenario: A PhD student investigates regional differences in vocabulary and in word collocations in different variants of German .

Data: large text corpora available at BBAW in Berlin, at the Austrian Academy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano.

Tools for targeted data access: WebLicht offers customizable chains of web services for filtering and analyzing the data

Page 5: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Motivation

• Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available

• Most of them are implemented to run on local machines. This can be inconvenient and error-prone

• Requirements: go beyond “do-it-yourself” and “download-first” strategies

• The CLARIN solution: Make tools and resources available as webservices

Page 6: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Architecture

WebLicht is a SOA for accessing and processing text corpora

Development started in October 2008 WebLicht consists of the following components:

Distributed services: offering functionality (resources & tools) over the (inter-)net. Implemented as webservices (ca. 90 at the moment)

Repository: stores metadata and technical information about the services

Web 2.0 based user interface: interacts with the user and combines services and information from the repository. Access still possible via scripts / programming code

Page 7: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Architecture

Web 2.0 Application forTool Chainingand Execution

Repository

StuttgartTübingen Berlin Leipzig Finland

Standard-conformantText Corpus Encoding

Stuttgart Tübingen Leipzig

Romania Iceland UK

Page 8: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht – Architecture

Services are implemented as REST style webservices

HTTPs POST method is used to send data from the UI to the services

As client, anything which is able to use the HTTP protocol, can be used: Browser Commandline tools (wget, curl) Programming Languages

Anyone can implement his/her own interface to WebLicht

Page 9: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Processing Chains

Page 10: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Results

Page 11: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Results

Page 12: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Features

With RESTstyle webservices, everyone can implement a web service for WebLicht (4pages tutorial)

The SOA infrastructure is independent of programming languages or operating systems

The chaining algorithm is independent of the used dataformat

Form a legal point of view, the web services are still located in the institute where they were created

Page 13: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht – Future Requirements

Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be

desirable The processing power is limited by local computing

resources Scalability only with strong centers possible

The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes

Page 14: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht – Future Requirements

Currently, users have to store the input data and their results on their local machines Online storage in the form of personal workspaces

with reliable backup solutions Linguistic tools are typically developed in a variety

of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individual services with common APIs

for interoperability Currently, WebLicht services are limited to

processing text corpora Extending webservices also to spoken language and

multi-modal datasets (MPI is already working on this)

Page 15: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

Test Case: Gutenberg Corpus

On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen

Ca. 20.000 texts from 800 authors Runtime: ca. 3.5 weeks Result:

217 million tokens (words), 533 million constituents, 110 GB data

Page 16: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

Gutenberg Corpus – Analyzing

Fulltext index (Lucene) Database for the linear part of the data Tree-like structures can be analyzed with XML based

techniques (Xpath, Xquery) DOM based techniques are slow and performance

hungry

Page 17: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

Links etc.

Clarin Homepage: http://www.clarin.eu The D-Spin homepage: http://www.d-spin.org WebLicht (login via DFN AAI):

https://weblicht.sfs.uni-tuebingen.de/

Erhard Hinrichs, Thomas ZastrowSeminar für Sprachwissenschaft

Universität Tübingen

Wilhelmstr. 19D-72074 Tübingen

[email protected]@uni-tuebingen.de

Page 18: WebLicht  Application and “Workspaces”

WebLicht Application and Workspaces

MunichSeptember 2010

www.d-spin.org

WebLicht - Combinations