••• 1 Roberto Cencioni Kimmo Rossi Challenge 2 – Objective 2.2 Language based Interaction DG Information Society and Media Unit INFSO.E1 Language Technologies & Machine Translation [email protected] ICT 2008, Lyon, 26 Nov 08
Mar 27, 2015
••• 1
Roberto CencioniKimmo Rossi
Challenge 2 – Objective 2.2
Language based Interaction
DG Information Society and Media
Unit INFSO.E1Language Technologies& Machine Translation
ICT 2008, Lyon,
26 Nov 08
••• 2
Outline
• Opening remarks
• FP7 ICT Call 4 – Essence
• FP7 ICT Call 4 – Ingredients
• Q&A
• CIP ICT-PSP Call 3 – Opportunities
• Q&A, close
••• 3
Here we are
• a new unit established in July 2008– Language Technologies & Machine Translation (INFSO.E1)– high expectations vs. low rate of EC S&T activity in the last
few years
• language is everywhere– written & spoken; documents, messages, databases,
webpages, multimedia objects etc; information as well as meta-information
• but our resources are limited, so initial focus on– multilingual technologies, services, applications
• two instruments in 2009:– Research: FP7 ICT, call 4
Objective 2.2 – Language based Interaction– Innovation: CIP ICT-PSP, call 3
Theme 5 – Multilingual Web
• total budget of 40 Meuro
••• 4
• Why?
– new online paradigms centred around communication, collaboration, co-creation … but significant language barriers remain
– EU comprises 27 countries & 23 official languages
– single European Information Space – one of the i2010 objectives
– EC communication on Multilingualism (Sept ‘08) calls fora broader policy framework & joint action
• Purpose: support & enhance
interpersonal & business communication
information access & publishing
across languages
Baseline
••• 5
A few facts
• EU official languages: 23 x 22 = 506 pairs– EC MT (Systran core engine) has 18 pairs in operation
& 10 more pairs at prototype stage
– 60+ national, regional & minority languages within the EU
• English accounts for 30% of today’s Web content– 50% in 2000, 35% in 2004
– Arabic, Chinese, Portuguese … growing very fast
• nearly 1,5 billion internet users worldwide (2008)– c 320 million native EN speakers in the world
• basic requirements for the “digital translation market”:– volume– access– personalisation
real quick, real cheap
••• 6
Can’t this be doneelsewhere?
• indeed EU RTD projects often exhibit multilingual features
• yet approaches are too often naïve, short term, sectoral
• hence a dedicated focal point
– stimulating upstream research
– enhancing research capacity
– thus enabling more ambitious & impactful domain specific actions
••• 7
Research vs. Innovationdivision of labour
• from
– long term foundational research (FP7)
• through
– applied research & technology development (FP7)
• to
– integration & demonstration (FP7 + PSP)
– infrastructure & resources (FP7 + PSP)
• different scale of ambition (€)
• different level of maturity (technologyservice)
• different timescales & partnerships
••• 8
FP7-ICT Call
I. WorkprogrammeR&D topics & outcomes
••• 9
What technologycan offer today
• machine translation & translation memory
– making sense of online content
– improving productivity of human translation
– automatic translation of “acceptable” quality in specific domains / language pairs
• information search & retrieval
– find relevant information across languages
• information extraction, filtering, categorisation
– incl. summarization, routing & alert services, …
– for a variety of purposes eg business intelligence
• speech technology
– command & control, dictation systems
– call center services, conversational systems
••• 10
Trends
• new requirements, new approaches– from Web 1.X to Web 2.0 – we are all content producers
– from static & uni-directional to dynamic, volatile, collaborative
– from service to self-service, translations are needed “on the fly”
are language technologies up to the task?
• what happens to online content– disappearing document?
– Europa website: 6 million “documents”
– elusive distinction between content & service
how to manage effectively multilingual content
• multilingualism on the rise– in the EU (from 4 to 23 languages) and globally
– English gains ground but mother tongues remain
online content becomes even more multilingual
••• 11
What technology might offer tomorrow
• machine translation– MT that learns from its mistakes – embedded in products/services, can cover any use
context esp.online: chats, blogs, dynamic content ...– broader coverage, fill in missing languages
• information search & retrieval– truly multilingual access to information: query in any
language, content automatically translated
• website content development & management– new content is translated automatically– changes automatically applied in all language versions
• speech technology– real-time speech-to-speech translation (eg phone call,
in a conference)
••• 12
Challenges for MT
• bring MT to the users
– understand what users need
– novel use scenarios
– communication rather than translation
– better evaluation metrics
• MT that learns & adapts
– how to exploit feedback from users
– how to use readily available “world knowledge”
• towards a paradigm shift?
– inspiration from:
machine learning, cognitive systems, psycho-linguistics, sociology, semantic web, data mining, new computing paradigms ...
••• 13
a) Core research exploring new avenues for machine translation (IP)
ground breaking, multidisciplinary, high risk – high promise research
architectures & technologies that learn and adapt flexibly & effectivelyto different languages, domains & tasks
catering for new forms of language & communication (eg online communities; dynamic, volatile …)
b) Problem oriented research for specific tasks & usage contexts (STR)
online translation for the masses
translation in distributed collaborative environments
managing multilingual communication & content
automatic acquisition & annotation of language resources
c) Community building & networking (NOE)
reinvigorate European machine translation (MT) community
build bridges between MT & MLT and other relevant disciplines
help develop & coordinate shared technical infrastructure, promote reusability & interoperability, foster evaluation
FP7-ICT Call 4 at a glance
••• 14
Core research. Explore new research avenues (one IP, up to 8 M)– break new ground, foster a novel multi-disciplinary approach to
machine translation
– architectures & technologies that can learn and adapt flexibly& effectively to different languages, domains & tasks
– catering for new forms of language & communication (eg online communities)
– high risk but high promise (accuracy, speed, scalability)
– language & translation models coupled with data driven, machine learning methods
automatic acquisition & representation of linguistic facts
semantics, models of world knowledge relevant for translation
approaches inspired from social networks …
Outcome a)IP
••• 15
Outcome b)STR
Problem oriented research. A clearly defined usage context (~5 STR’s, c 12 M)– online translation for the masses
wide coverage (beyond GoogleTranslate); adequate quality, suitableat least for gisting/browsing; language embedded in documents, web pages, multimedia objects …
– translation in distributed environments support non-linear collaborative interplay between authors, translators,
editors/publishers & active users; innovative integration of automatic, interactive & human translation beyond current practice; technologies as well as processes & social interaction
– managing multilingual content & communication a superset of the above addressing the development & management
of online content & services esp. their versioning & maintenance in multiple languages
– acquisition & annotation of language resources (nearly-)automatic, high volume, high performance mining the web as well available repositories (eg corpora) and
public information sources
••• 16
Outcome b)managing multilingual
Web content
• methods, techniques, metrics … for developing & managing multilingual web content & services
– much more than translation; significant cultural elements
• think of
– one big website in many languages, or
– several interrelated websites, one country/language each
• now think of how to maintain the integrity & consistency of such resources, effectively & over a long period of time
– and how to detect & repair gaps or inconsistencies
• so, beyond the “translation” step:
– design, authoring, versioning & maintenance of (multiple, parallel, interconnected …) websites, portals or repositories
– in a distributed collaborative environment, possibly across organisational boundaries
••• 17
Outcome c)NOE
Community building & networking (1 or 2 NoEs, up to 6 M)– reinvigorate Europe’s machine translation (MT) community
bring together key players from scientific, technical & commercial circles (esp. SMEs)
stimulate cross-border cooperation (teams, institutions, national initiatives)
assess skills, foster training & exchanges; support smaller teams & not well-served languages
identify gaps, establish roadmap encompassing technologies, resources & applications
– build bridges between MT & MLT community and other relevant disciplines stimulate dialogue between diverse communities; identify opportunities & bottlenecks
initiate integrative research, prepare the ground for further collaboration
explore medium to long term approaches, identify possible shifts in paradigm
– develop & coordinate shared technical infrastructure, reusability & interoperability, evaluation infrastructural support: portal services, inventories & repositories of general
interest tools & raw/annotated datasets, their documentation
active promotion of reusability & open-source; harmonisation of representation & annotation schemes
foster widely recognized benchmarks ...
••• 18
What we don’t do
Not supported under Call 4:
• approaches that do not promise to deliver performance along with portability, scalability & maintenability
– yes: emphasis on automation, flexibility & cost effectiveness
• developments addressing immediate commercial concerns
– no: adding a language pair to an existing product
• proposals that do not address « language transfer »
– yes: focus on mapping a source language into one or several target languages
• issues covered by other Challenges and Objectives
– no: HMI, interaction with robots, ambient intelligence …
• topics well covered by recent & ongoing projects
– no: sign languages, dialogue systems …
••• 19
Practical info
FP7-ICT Objective 2.2 – Language based interation
budget: 26 Meuro under Call 4
managed by: Unit E1
Email: [email protected]
EC contact: Mr Kimmo Rossi
• inquiries: available
• pre-proposals: from Dec 1st until 3 weeks before the call closing date (Apr 1st)
Language Technology Days: 14-15 January 2009, Luxbg
ICT Proposers’ Day: 22 January 2009, Budapest
••• 20
Web sources
INFSO.E1 website (under construction):
cordis.europa.eu/fp7/ict/language-technologies/..
• FP7-ICT: ../fp7-call4_en.html
• ICT-PSP: ../cip-psp_en.html
– Events & Presentations
– Call guidance notes
– Background material & useful Links …
EC contact: Mrs Susan Fraser
••• 21
FP7-ICT Call
II. Practicalities &Success Factors
••• 22
LT Days
14-15 January, 2009
Luxembourg, JMO conference complex
EC presentations, sessions w/ext speakers, proposal clinics, self-presentations & posters
Agenda & registrations:
cordis.europa.eu/fp7/ict/
language-technologies/fp7-call4_en.html
••• 23
Pre-proposals& Clinics
3 pages max, mail to: [email protected] • describe the problem your proposal addresses, in particular
– specify the intended user profile and related tasks
– describe actual or prospective applications
– detail data sets: source(s), typology, volume
• how will the proposed project contribute to the outcomes and impacts set out in the work programme? – what are the key innovations?
– what will be the main concrete results?
– what public outputs are foreseen?
– what impact do you expect?
• describe the consortium – give partners' names or profiles and the intended skills mix
– indicate the intended instrument (if known)
• indicate the scale of your ambition
– what is the estimated effort (man-months)
– how long will the proposed project last?
– what amount of EU funding are you looking for?
••• 24
Overall approach
• research for a purpose, problem driven
• centred around people & tasks, data & flows
– a compelling use case is as important as the underlying research
• meaningful demonstrator(s)
– field validation & assessment
• active promotion & dissemination of results beyond purely scientific circles
– public outputs, public final showcase
••• 25
Instruments
• IP
– up to 4 years, 5-8 Meuro (EU funding)
• NoE
– up to 3 years, 3-6 Meuro
• STR
– up to 3 years, 2-3 Meuro
••• 26
Partnerships
• keep the consortium manageable:
IPs 7-11 partners
STRs 5-7 partners
NoEs 3-4 “core” partners
• select competent, committed & reliable partners; geography not an issue!
• industry, SME, academia … participation as dictated by project needs
• user/industrial/commercial organisations to provide a demanding problem & validation context
••• 27
Language coverage
• most of the work is expected to be language independent– flexibility & ease of adaptation to other languages are indeed key
factors
– many of the ancillary tasks & tools are language independent anyway
• project outcomes must however be validated in 3+ languages– preferably belonging to different linguistic families
• target languages are chosen & justified by the proposers bearing in mind the following priorities (from high to low):1. EU official languages
2. nationally recognised languages
3. regional languages
4. minority languages
• Non-EU world languages linked to global markets & exports can be considered as well– on a proposal by proposal basis
••• 28
Target industrialsectors
• look for– huge & growing data volumes – competitive pressure– high growth & innovation– international markets
• obvious candidates– ICT & media– manufacturing– process industries eg pharmaceuticals– energy & utilities– engineering & construction– financial services …
••• 29
• RTD content– narrow scope, little or no EU dimension
– lack of focus, aims too general
– lack of innovation, current state of art missing
• planning– links missing between objectives & work plan
– milestones missing or too general
– risk factors not addressed, no contingency plans
– no monitorable indicators, no metrics
• management– consortium not balanced, gaps in the skills mix
– lack of integration between partners
– vague management structure
– weak or narrow dissemination plans
– ill-defined exploitation prospects
Reasons for failure
••• 30
• Quality
• Impact
• Effectiveness
but also
• Relevance wrt. WP
• Credibility
Evaluators will have access to Web sources: previous projects, teams & skills, background & reference documents …
Success factors .1
••• 31
It’s a project, not a dissertation:
– problem?
– user?
– data?
– outputs (incl. public ones)?
– metrics?
– impact?
– exploitation channels?
– …
Success factors .2
••• 32
Success factors .3
• preserve your credibility: select one proposal & make it win
• ensure that the proposal brings out both innovation & exploitation potential
• full depth of participation rather than long list of organisations with limited involvement
• key individuals, expertise & achievements rather than long list of previous projects
• make the proposal compelling for a busy reader (the first 5-10 pages are key!)
••• 33
Time schedule
• call due to close 1 April, 2009
• evaluation & selection until end June
• negotiation from mid-July on
• contract awarding in December
• projects due to start Q1 2010
… highly selective & demanding process
••• 34
ICT-PSP Call
Overview(subject to forthcoming adoption of WP,
call budget & schedule)
••• 35
ICT-PSP Call 3,Q1 09
ICT Policy Support Programme (PSP) within the Competitiveness & Innovation Framework Programme (CIP) (adopted in October 2006)
• geared towards innovation & ICT uptake:
– development of the Single European information space
– strengthening of the internal market for ICT products and services and ICT-based products and services
– stimulation of innovation through the wider adoption of and investment in ICT
• ensure seamless access to ICT-based services
• improve the conditions for the development of digital content, taking into account multilingualism & cultural diversity
Takes over eContentplus activities from Jan 2009
••• 36
• translation & interpretation market (exc. in-house):– c $15 billion; €1.1 billion for EU institutions alone (2006)– top EU-based translation company posted a revenue of
$175 million in 2006• market fragmentation
– big players < 1000 employees– est. 300,000 full time salaried translators worldwide
(37% in Europe)• a good European base
– SDL, Star, RWS, XRX, Euroscript, Logos, Moravia, VistaTEC, Semantix …
– ESTeam, Lucy Software … • a largely untapped potential
– 4x according to some companies
“Europe’s language is Translation”
••• 37
Business world
• new models: Most companies follow the age-old translate-edit-proofread model of translation. Collaborative, web-based technologies allow translation to become more agile, faster, and better with fewer steps (CSA Inc.)
• new markets: Language Weaver is entering the three new strategic markets – Web Content, Business Intelligence and Customer Care – to provide high-volume, high-speed, and accurate automated translation solutions at a price that would have been unfathomable just a few years ago
• new approaches: If you don't see your native language here, you can help Google create it by becoming a volunteer translator. Check out our Google in Your Language program
• and then of course:
Unfortunately for Google as a person with 7 years of translation experience myself I can tell that you will hardly ever find a translatorwho will agree that machine translation can be useful for anything. (a Russian translator)
••• 38
ICT-PSP Call 3,Theme 5:
Multilingual Web
• 3 objectives:
– machine translation for the multilingual Web (pilot projects)
– multilingual Web content management (pilot projects)
– standards & best practices for the multilingual Web (thematic network)
• 14 Meuro in total, around 6 projects
“The duration of the pilot is expected to be 24 to 36 months within which there should be a 12-month operational phase.”
••• 39
ICT-PSP Call 3,Theme 5:
Multilingual Web
• research: no, at least not ICT research …
• development/engineering:
– optimisation, customisation, integration … of existing (state of the art) methods, tools & services with a view to defining new approaches, offerings & practices
• demonstration:
– innovative combination is key; new business models, processes & services, organisational setups, usability …
– evaluation along user, technical & (socio-)economic dimensions
• problem orientation:
– useful & useable although possibly not perfect; think ROI
••• 40
Scope & defs
• MT as defined in the ICT-PSP workprogramme encompasses
1. fully automatic machine translation, whatever the technology
2. interactive computer-aided translation (eg TM)
3. a suitable combination of 1. and/or 2. with web based
– human translation, proof-reading & post-editingincl. where relevant methods inspired from social networks
– workflow & content management systems, …
• innovative & effective combination of people, processes& technology; the end result is not science, rather
– more and/or better output
– save time
– cut cost
• emphasis on language transfer, from source language to target language(s)
– language input-output (e.g. speech-to-text) is not the focus
– cross-platform, multi-format content access/delivery is key
••• 41
Language coverage
• some of the work is expected to be language independent– flexibility & ease of adaptation to other languages are key factors
– content authoring & management, collaboration & workflow … are language independent anyway
• project outcomes must be validated in 3+ languages– preferably belonging to different linguistic families
• target languages are chosen & justified by the proposers bearing in mind the following priorities (from high to low):1. EU official languages
2. nationally recognised languages
3. regional languages
4. minority languages
• Non-EU world languages linked to global markets & exports can be considered as well– on a proposal by proposal basis
••• 42
Cont’d
• project’s language coverage driven by the need to:– address gaps & overcome barriers e.g. cross-border
communication for less-developed languages, or
– exploit opportunities e.g. address emerging markets & sizeable language communities
• impact is key, so: viability, sustainability, exploitation channels, deployment prospects …
• main findings must be pro-actively disseminated
• some form of public showcase is mandatory
• participants should include– private or public sector content owners & aggregators
– providers of language services, technology suppliers
– (online) communities of interest where relevant
• 6-7 partners/project, up to €2.5 million funding, up to36 months
••• 43
ICT-PSP Call 3Feb 09
3 intertwined objectives:
5.1 machine translation for the multilingual Web (projects)
information access: MT and other multilingual solutions for information access & use, esp. cross-lingual search & retrieval
information publishing: MT to create, distribute and (re-)use more widely & effectively online content in a multilingual environment
5.3 multilingual Web content management (projects)
communication: multilingual Web content development & management; design, authoring, versioning & maintenanceof multilingual Web sites, portals or repositories
5.2 standards & best practices for the multilingual Web (network)
conventions & best practices for multilingual Web content
••• 44
ICT-PSP, 5.3multilingual Web
content management
• methods, techniques, metrics … for developing & managing multilingual web content & services– much more than translation; significant cultural elements
• think of– one big website in many languages, or– several interrelated websites, one country/language each
• now think of how to maintain the integrity & consistency of such resources, effectively & over a long period of time– and how to detect & repair gaps or inconsistencies
• so, beyond the “translation” step (obj 5.1):– design, authoring, versioning & maintenance of (multiple, parallel,
interconnected …) websites, portals or repositories– in a distributed collaborative environment, possibly across
organisational boundaries
• so as to turn a multi-million endeavour into a viable proposition for a much broader range of companies & administrations
••• 45
ICT-PSP, 5.1machine translation for
the multilingual Web
5.1 can be seen as a subset & central component of obj 5.3 (its “translation box”)
• different usages:
– web at large, enterprise, public information repositories …
• different users:
– teams as well as individuals, engineers as well as analysts, sales & marketing, language professionals, … you & me
• different content rich, information bound sectors, private & public
• quality depends on task & user
– from raw translation & “gisting” up to error-free translation
• two important conditions:
– widely recognised, well argued problem; clearly identified target community
– thorough validation in a given domain / for a given task volume metrics
••• 46
ICT-PSP, 5.2standards & best practices
Thematic network
• covers the same broad issues as 5.3
– “the web as THE vehicle for multilingual content & services”
• provides a forum for multilateral exchange of experience & consensus building
• structure & tasks to be defined by the proposers, indicative list:– bring together a meaningful subset of the main stakeholders, possibly
through their own groups & associations– ICT & language industries, content aggregators/distributors, e-services,
multinational agencies, industry & de-jure standards bodies …
– analyse current situation, identify gaps & bottlenecks; assess market failures if any, specify technical & non-technical conditions to be met and the respective actors
– establish roadmap (trends, requirements, dependencies …) for further developments in the coming years
– stimulate consensus & active involvement/coordination; take part in leading conferences, liaise with primary associations etc.
– explore means to promote best practice (conferences, portals, publications, training …) beyond current channels
– propose suitable follow-on actions
••• 47
ICT-PSPInstruments & Funding
• pilot B projects:
– min. 4 partners from 4 different countries
– 50% of eligible direct costs
– flat 30% overhead rate of personnel costs
• thematic networks:
– min. 7 partners from 7 different countries
– lump sum; for 3 years and 1+10 participants:
coordinator: 95 Keuro
other participants: 24 Keuro each
ec.europa.eu/information_society/activities/ict_psp/participating/index_en.htm
••• 48
Practical info
ICT-PSP Theme 5 – Multilingual Web
budget: 14 Meuro under Call 3
managed by: Unit E1
Email: [email protected]
EC contact: Mr Kimmo Rossi
• inquiries: from the call publication date (~Feb)
• pre-proposals: from publication until 3 weeks before the call closing date
••• 49
Events
Language Technology Days:
14-15 Jan 2009, Luxbg
ICT-PSP Info Day:
26 Jan 2009, Brussels (tbc)
Email: [email protected]
URL: cordis.europa.eu/fp7/ict/language-technologies/..
FP7-ICT: ../fp7-call4_en.html
ICT-PSP: ../cip-psp_en.html
••• 50
Thank you!