Chances and Challenges in Comparing Cross-Language Retrieval Tools

Chances and Challenges in ComparingCross-Language Retrieval Tools

Giovanna RodaVienna, Austria

Irf Symposium 2010 / June 3, 2010

CLEF-IP: the Intellectual Property track at CLEF

CLEF-IP is an evaluation track within the Cross LanguageEvaluation Forum (Clef). 1

organized by the IRF

first track ran in 2009

running this year for the second time

1http://www.clef-campaign.org

http://www.clef-campaign.org





























What is an evaluation track?

An evaluation track in Information Retrieval is a cooperative actionaimed at comparing different techniques on a common retrievaltask.

produces experimental data that can be analyzed and used toimprove existing systems

fosters exchange of ideas and cooperation

produces a reusable test collection, sets milestones

Test collection

A test collection consists traditionally of target data, a set ofqueries, and relevance assessments for each query.






Test collection







Test collection







Test collection







Test collection


Clef–Ip 2009: the task

The main task in the Clef–Ip track was to find prior art for agiven patent.

Prior art search

Prior art search consists in identifying all information (includingnon-patent literature) that might be relevant to a patent’s claim ofnovelty.

Clef–Ip 2009: the task

The main task in the Clef–Ip track was to find prior art for agiven patent.

Prior art search

Prior art search consists in identifying all information (includingnon-patent literature) that might be relevant to a patent’s claim ofnovelty.

Participants - 2009 track

1 Tech. Univ. Darmstadt, Dept. of CS,Ubiquitous Knowledge Processing Lab (DE)

2 Univ. Neuchatel - Computer Science (CH)

3 Santiago de Compostela Univ. - Dept.Electronica y Computacion (ES)

4 University of Tampere - Info Studies (FI)

5 Interactive Media and Swedish Institute ofComputer Science (SE)

6 Geneva Univ. - Centre Universitaired’Informatique (CH)

7 Glasgow Univ. - IR Group Keith (UK)

8 Centrum Wiskunde & Informatica - InteractiveInformation Access (NL)

































































9 Geneva Univ. Hospitals - Service of MedicalInformatics (CH)

10 Humboldt Univ. - Dept. of German Languageand Linguistics (DE)

11 Dublin City Univ. - School of Computing (IE)

12 Radboud Univ. Nijmegen - Centre for LanguageStudies & Speech Technologies (NL)

13 Hildesheim Univ. - Information Systems &Machine Learning Lab (DE)

14 Technical Univ. Valencia - Natural LanguageEngineering (ES)

15 Al. I. Cuza University of Iasi - Natural LanguageProcessing (RO)


















































15 participants

48 experimentssubmitted for the maintask

10 experimentssubmitted for thelanguage tasks


15 participants




15 participants




15 participants



2009-2010: participants

2009-2010: evolution of the CLEF-IP track

2009

2010

1 task: prior art search

prior art candidate searchand classification task

targeting granted patents

patent applications

15 participants

20 participants

all from academia

4 industrial participants

families and citations

include forward citations

manual assessments

expanded lists of relevantdocs

standard evaluation mea-sures

new measure: pres, morerecall-oriented


2009

2010




patent applications

15 participants

20 participants

all from academia




manual assessments





2009 2010




patent applications

15 participants

20 participants

all from academia




manual assessments





2009 2010

1 task: prior art search prior art candidate searchand classification task


patent applications

15 participants

20 participants

all from academia




manual assessments





2009 2010


targeting granted patents patent applications

15 participants

20 participants

all from academia




manual assessments





2009 2010



15 participants 20 participants

all from academia




manual assessments





2009 2010




all from academia 4 industrial participants



manual assessments





2009 2010





families and citations include forward citations

manual assessments





2009 2010






manual assessments expanded lists of relevantdocs




2009 2010






manual assessments expanded lists of relevantdocs



What are relevance assessments

A test collection (also known as gold standard) consists of a targetdataset, a set of queries, and relevance assessments correspondingto each query.

The CLEF-IP test collection:

target data: 2 million EP patents

queries: full-text patents (without images)

relevance assessments: extended citations

























Relevance assessments

We used patents cited as prior art as relevance assessments.

Sources of citations:

1 applicant’s disclosure: the Uspto requires applicants todisclose all known relevant publications

2 patent office search report: each patent office will do a searchfor prior art to judge the novelty of a patent

3 opposition procedures: patents cited to prove that a grantedpatent is not novel

























Extended citations as relevance assessments

direct citations and their families


direct citations of family members ...


... and their families

Patent families

A patent family consists of patents granted by different patentauthorities but related to the same invention.

simple family all family members share the same priority number

extended family there are several definitions, in the INPADOCdatabase all documents which are directly orindirectly linked via a priority number belong to thesame family

Patent families




Patent families




Patent families

Patent documents are linked bypriorities

Patent families


INPADOC family.

Patent families


Clef–Ip uses simple families.

Relevance assessments 2010

Expanding the 2009 extended citations:

1 include citations of forward citations ...

2 ... and their families

This is apparently a well-known method among patent searchers.

Zig-zag search?






Zig-zag search?






Zig-zag search?






Zig-zag search?






Zig-zag search?

How good are the CLEF-IP relevance assessments?

CLEF-IP uses families + citations:


how complete are extendedcitations as a relevanceassessments?

will every prior art patent beincluded in this set?

and if not, what percentageof prior art items are capturedby extended citations?

when considering forwardcitations, how good areextended citations as a priorart candidate set?
















Feedback from patent experts needed

Quality of prior art candidate sets has to be assessed


Know-how of patent search experts is needed


at Clef–Ip 2009 7 patent search professionals assessed 12search results

the task was not well defined and there weremisunderstandings on the concept of relevance

amount of data was not sufficient to draw conclusions










Some initiatives associated with Clef–Ip

The results of evaluation tracks are mostly useful for the researchcommunity.

This community often produces prototypes that are of littleinterest to the end-user.

Next I’d like to present two concrete outcomes - not of Clef–Ipdirectly but arising from work in patent retrieval evaluation









Soire

Soire

developed at Matrixware

service-oriented architecture - available as a a Web service

allows to replicate IR experiments based on classicalevaluation model

tested on the CLEF-IP data

customized for the evaluation of machine translation

Soire






Soire






Soire






Soire






Spinque

Spinque

a spin-off (2010) from CWI (the Dutch National ResearchCenter in Computer Science and Mathematics)

introduces search-by-strategy

provides optimized strategies for patent search - tested onCLEF-IP data

transparency: understand your search results to improvestrategy

Spinque





Spinque





Spinque





Clef–Ip 2009 learnings

The Humboldt University implemented a model for patent searchthat produced the best results.

The model combined several strategies:

using metadata (IPC, ECLA)

indexes built at lemma level

an additional phrase index for English

crosslingual concept index (multilingual terminologicaldatabase)




































Some additional investigations

Some citations were hard to find

% runs class≤ 5 hard

5 < x ≤ 10 very difficult

10 < x ≤ 50 difficult

50 < x ≤ 75 medium

75 < x ≤ 100 easy


Some citations were hard to find

% runs class≤ 5 hard

5 < x ≤ 10 very difficult

10 < x ≤ 50 difficult

50 < x ≤ 75 medium

75 < x ≤ 100 easy


We looked at the content of citations and citing patents.


Ongoing investigations.

Thank you for your attention.

Chances and Challenges in Comparing Cross-Language Retrieval Tools

Technology

ir group keith

cooperative action aimed

comparing dierent techniques

machine learning lab

additional phrase index

uspto requires applicants

natural language engineering

dierent patent authorities