CERN European Organization for Nuclear Research Automatic Keyword Automatic Keyword Assignment for High Energy Assignment for High Energy Physics Literature Physics Literature Arturo Montejo Ráez ETT/SI Data Handling Group- CERN Geneva (Switzerland) Joint Research Center, Ispra (Italy) -4 March 2002
42
Embed
CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CERNEuropean Organization for Nuclear Research
Automatic Keyword Assignment Automatic Keyword Assignment for High Energy Physics Literaturefor High Energy Physics Literature
Arturo Montejo Ráez
ETT/SI Data Handling Group- CERNGeneva (Switzerland)
Joint Research Center, Ispra (Italy) -4 March 2002
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
What we are going to see today...
Data Handling Group
Keyword assignment process Why keywords? How it is done for High Energy Physics papers The HEPindexer project:
Future work
Data Algorithm Experiments Results
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
AuthorsIndexer
Keyworded papers
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The document...
Full text paper Stored in a database Simplified representation needed
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The thesaurus...
Controlled vocabulary of concepts Relationships between keywords Categories and subcategories Can be domain specific Can be translated into multiple languages
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The thesaurus: a relational model for terms
cheese MT 6016 processed agricultural produce BT1 milk product
The thesaurus: a subject tree 04 POLITICS 0406 political framework 0411 political party 0416 electoral procedure and voting 0421 parliament 0426 parliamentary proceedings 0431 politics and public safety 0436 executive power and public service08 INTERNATIONAL RELATIONS 0806 international affairs 0811 cooperation policy 0816 international balance 0821 defence10 EUROPEAN COMMUNITIES 1006 Community institutions and European civil service 1011 Community law 1016 European construction 1021 Community finance
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The indexer...
An expert in the domain of the documents An expert in the use of the thesaurus Heavy task Not always the same proposition Expensive!
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Why keywords?
Data Handling Group
Permit to index documents in a coherent way Can be viewed like the "index" at the end of a book Concepts that represent better the content Human made (value added) Meaningful Can stablish relations between documents Multilingual
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Access to documents
But... we already have fulltext indexing!
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Classification: To store (libraries) To access (narrow searches)
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Category 1 Category 2 Category 3
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Navaja
Razor
Couteau
Navaja
Razor
Couteau
Razor?
LamettaLametta
Crosslingual access
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
RazorRazor
LamettaLametta
Multilingual comparison
MurderFrabbica
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Multilingual comparison
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
Advantages over fulltext searches:
No ambiguity Better relevance and precision
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
More advanced tools for searching and classification are coming!
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem...
- E X P E N S I V E -
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
The CERN
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
• The world's largest particle physics centre• Explores what matter is made of, and what forces hold it together• Employs just under 3000 people• 6500 scientists, come for their research
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany)
DESY thesaurus Group of indexers (students, experts...) Only High Energy Physics related papers
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The DESY thesaurusA
*a4(2040) ('postulated particle, a4(2040)', was delta(2040)) *a6(2450) ('postulated particle, a6(2450)', was delta(2450)) *abelian *aberration absorption -absorptive model (model, absorption) accelerator . . .
B
B B anti-B B+ B+L number B*(5320) (excited B) -B** ('B*2...', similar for B/s, etc.) *B*2(5732) (postulated particle, B*2(5732)) B- -B-factory (B, particle source) B-L number . . .
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The DESY thesaurus: Few categories rarely used Only two type of keywords: main keywords (1191) secondary keywords (949) No relationships between terms Specific terminology
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Future Work
Automatic proposition of secondary keywords Improve the algorithm (lemmatizer, multiwords, segmentation...) Use of references to link documents based on common concepts Specific algorithms for handling of energies, particle decays, desintegrations, etc. Agents OAI Apply Semantic Web approaches