Top Banner
JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtech
33

JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

Mar 29, 2015

Download

Documents

Jameson Mailey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 1

Multilingual text analysis applications based on automatic Eurovoc indexing

Ralf Steinberger

Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment

JRC Workshop, Ispra, 16/17 September 2004

http://www.jrc.cec.eu.int/langtech

Page 2: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 2

Applications mentioned so far

• Thesaurus indexing (summarise main concepts of document)– Fully automatic– Interactive – Monolingual and cross-lingual

• Document retrieval– Monolingual and cross-lingual

Eurovoc indexing can be used for MUCH MORE …

Page 3: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 3

Main goals of JRC’s Language Technology (LT) activity

• Gather potentially user-relevant documents

• Analyse texts in various languages – extract information from texts (Eurovoc)– identify similarity between documents (Eurovoc)– Classify documents (Eurovoc)

• Visualise contents– of individual documents (Eurovoc)– of whole document collections (Eurovoc)

Page 4: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 4

Eurovoc indexing as part of a tool set

Page 5: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 5

(Cross-lingual) document similarity calculation

EnglishEnglishTextText

Resolution on radio-

active waste

SpanishSpanishTextText

Resolución sobre los residuos

radioactivos

6621020304

52160104

monolingual

Page 6: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 6

(Multilingual) text classification

• Most current approaches to text classification are monolingual

Category 1 Category 2 Category 3

EsEs EsFr Es

• Text classification, via Eurovoc, is multilingual

Page 7: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 7

(Multilingual) document map© Cartia’s ThemeScape

Page 8: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 8

‘Translation Spotting’

Why?• To test document similarity calculation• To compile a collection of parallel texts (for the training and testing of

other multilingual text analysis applications)• To detect cross-lingual document plagiarism

Page 9: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 9

‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a

parallel text collection

DS considering the length of documents

DS correcting the monolingual bias (83%)

Simple document similarity (DS)

Page 10: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 10

• To organise unknown document collections• Algorithm:

–Find pairs of texts that are most similar–Group them in one cluster, repeat the operation until only one cluster

remains

(Multilingual) clustering of documents

90%

80%

75%

40%

10%

Page 11: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 11

Building a (multilingual) cluster tree

Page 12: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 12

Application to (multilingual) news analysis

EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it)

• Cluster related news stories and identify duplicates (news topic identification)

• Identify keywords, people’s names, place names, main sentences (information extraction)

• Find related news stories over time (news topic tracking)

• Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)

Page 13: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 13

Detection of the major news of the day (EMM)

Page 14: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 14

Establish Links to Related News over time

Page 15: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 15

Establish links to related news in other languages

Page 16: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 16

Subject-specific summarisation (1)

Title: "Resolution on the 10th anniversary of the Chernobyl accident"

Eurovoc descriptors:

Page 17: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 17

Subject-specific summarisation (2)

Eurovoc descriptors:

Page 18: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 18

Further JRC LT applications

• Recognition and translation of:

– Place names; + visualisation

– People’s names; + retrieval of images and further information

– Dates

– Products • Recognition of text language

Page 19: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 19

Place name recognition / Cross-lingual display

Page 20: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 20

Place name recognition / Visualisation

18 references (Boston, American, America, New York)

11 references (Vietnam)

5 references (Iraq)+ 1 reference to Sweden(Andre Heinz(…) Swedish based environmental consultant)

Page 21: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 21

Place name recognition / Disambiguation

Requires disambiguation• 14 Paris’, 7 Birminghams• cities called ‘And’, ‘Annan’• name variants (exonyms)

Zoom on Europe

Page 22: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 22

Recognising names, places, … - News navigation

Top-mentioned personalities En/Fr news

26 July 2004

Page 23: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 23

Automatic recognition of name variants

Page 24: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 24

Automatic link to online encyclopaedia

Page 25: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 25

News clusters mentioning a person

Page 26: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 26

Persons talked about in same news clusters

Page 27: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 27

Countries talked about in same news clusters

Page 28: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 28

Frequent keywords for these news clusters

Page 29: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 29

Recognising products and product groups

Sample text

Page 30: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 30

Recognising products and product groups

Identified products

Page 31: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 31

Recognising products and product groups

Cross-lingual display of products found

Page 32: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

JRC-Ispra, 16.09.04, Slide 32

Page 33: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.

Multilingual Information Extraction– Language recognition (demo)– Keywords (monolingual; cross-lingual)– Geographical place names (intro; new EU languages; demo)– Products and product groups (slides; demo JRC, demo CIS)– Names of people (demo news names, demo recognition,

related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)– Dates (demo recognition)– Terminology extraction– Summarisation (standard sentence extraction; subject-specific summarisation)

Cross-lingual navigation and classification– Document similarity (monolingual; cross-lingual; translation spotting)– Bottom-up document clustering; topic detection (demo news analysis)– Classification (multi-monolingual and cross-lingual; pre-classification clustering)– Relevance-ranking of documents (slides)– News topic tracking (monolingual historical; cross-lingual; demo news analysis)– Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).

Visualisation of textual contents– Individual documents (document profile)– Whole document collections (document map)– Geographical information (maps; animated maps, demo)– Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …

Further tools– Document Gathering (Lang-Tech crawler; WT’s EMM system)– Document format conversion (PDF, MS-Word, PS, HTML, XML)– Character set conversion (UTF-8, ISO-Latin, HTML, …)

Projects IDoRA for OLAF (slides) Cross-lingual Indexing

(EUROVOC) Breaking News –

Detection and Visualisation (BNDV / State-of-the-World)

SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH,

AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)

JRC Introduction

Multilingual and crosslingual text analysis