Page 1
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Towards a learning approachfor abbreviation detection and resolution
Klaar Vanopstal, Bart Desmet, Veronique Hoste
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.be
Department of Applied Mathematics & Computer ScienceGhent University
Krijgslaan 281 (S9), 9000 Gent, Belgium
May 19, 2010
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 2
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
1 Background
2 Annotation
3 Pattern-based approach
4 Learning-based approach
5 Conclusions and future work
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 3
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
1 Background
2 Annotation
3 Pattern-based approach
4 Learning-based approach
5 Conclusions and future work
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 4
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
1 Background
2 Annotation
3 Pattern-based approach
4 Learning-based approach
5 Conclusions and future work
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 5
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
1 Background
2 Annotation
3 Pattern-based approach
4 Learning-based approach
5 Conclusions and future work
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 6
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
1 Background
2 Annotation
3 Pattern-based approach
4 Learning-based approach
5 Conclusions and future work
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 7
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
ProblemUse
Problem
Information explosion ⇒ growing number of (bio)medicalabbreviations.New abbreviations are created; not always known to the reader.⇒ automatic detection and resolution
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 8
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
ProblemUse
Use
information retrieval
information extraction
NER
anaphora resolution
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 9
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Corpus
English
- AbbRE: reliable standard but limited size- Medstract: publicly available and commonly used
Dutch: no resources available
Abstracts from 2 medical journals:
- Nederlands Tijdschrift voor Geneeskunde (NTvG); 29,978words
- Belgisch Tijdschrift voor Geneeskunde (TvG); 36,757 words
⇒ total of 66,739 words
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 10
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Corpus
English
- AbbRE: reliable standard but limited size- Medstract: publicly available and commonly used
Dutch: no resources available
Abstracts from 2 medical journals:
- Nederlands Tijdschrift voor Geneeskunde (NTvG); 29,978words
- Belgisch Tijdschrift voor Geneeskunde (TvG); 36,757 words
⇒ total of 66,739 words
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 11
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Corpus
English
- AbbRE: reliable standard but limited size- Medstract: publicly available and commonly used
Dutch: no resources available
Abstracts from 2 medical journals:
- Nederlands Tijdschrift voor Geneeskunde (NTvG); 29,978words
- Belgisch Tijdschrift voor Geneeskunde (TvG); 36,757 words
⇒ total of 66,739 words
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 12
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Different types of abbreviations included in annotations:
Truncation
Example
adm for administration
First letter initialization
Example
AAA for abdominal aortic aneurysm
Opening letter initialization
Example
HeLa for Henrietta Lacks
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 13
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Different types of abbreviations included in annotations:
Truncation
Example
adm for administration
First letter initialization
Example
AAA for abdominal aortic aneurysm
Opening letter initialization
Example
HeLa for Henrietta Lacks
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 14
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Different types of abbreviations included in annotations:
Truncation
Example
adm for administration
First letter initialization
Example
AAA for abdominal aortic aneurysm
Opening letter initialization
Example
HeLa for Henrietta Lacks
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 15
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Syllabic initialization
Example
BZD for benzodiazepine
Substitution initialization
Example
Fe for iron
Combination of letters and numbers
Example
CXCR4 for chemokine receptor fusin
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 16
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Syllabic initialization
Example
BZD for benzodiazepine
Substitution initialization
Example
Fe for iron
Combination of letters and numbers
Example
CXCR4 for chemokine receptor fusin
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 17
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Syllabic initialization
Example
BZD for benzodiazepine
Substitution initialization
Example
Fe for iron
Combination of letters and numbers
Example
CXCR4 for chemokine receptor fusin
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 18
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Labels
1. ABBR: Dutch abbreviations which have a full form in theirlocal context
Example
Hoge-resolutie-computertomografie (HRCT)EN: High resolution computed tomography (HRCT)
2. ABBR DE: Dutch abbreviations with full form in abstract(not in local context)
Example
de pathofysiologie van het CFSEN: the pathophysiology of CFS
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 19
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Labels
1. ABBR: Dutch abbreviations which have a full form in theirlocal context
Example
Hoge-resolutie-computertomografie (HRCT)EN: High resolution computed tomography (HRCT)
2. ABBR DE: Dutch abbreviations with full form in abstract(not in local context)
Example
de pathofysiologie van het CFSEN: the pathophysiology of CFS
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 20
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
3. DEF: Dutch full forms which define an abbreviation in theirlocal context
Example
Hoge-resolutie-computertomografie (HRCT)EN: High resolution computed tomography (HRCT)
4. ABBR IN COMP: part of a compound word; no definition inthe abstract
Example
HIV-patienten(EN: HIV patients)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 21
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
3. DEF: Dutch full forms which define an abbreviation in theirlocal context
Example
Hoge-resolutie-computertomografie (HRCT)EN: High resolution computed tomography (HRCT)
4. ABBR IN COMP: part of a compound word; no definition inthe abstract
Example
HIV-patienten(EN: HIV patients)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 22
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
5. ABBR IN COMP DE: part of a compound word; full formin abstract
Example
ernstige reumatoıde artritis (RA)-vasculitis. Bij de ziekte vanWegener en RA-vasculitis...EN: severe rheumatoid arthritis (RA) vasculitis. Wegener’s diseaseand RA vasculitis...)
6. ABBR NO DEF: abbreviations without full form
Example
AIDS, HIV
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 23
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
5. ABBR IN COMP DE: part of a compound word; full formin abstract
Example
ernstige reumatoıde artritis (RA)-vasculitis. Bij de ziekte vanWegener en RA-vasculitis...EN: severe rheumatoid arthritis (RA) vasculitis. Wegener’s diseaseand RA vasculitis...)
6. ABBR NO DEF: abbreviations without full form
Example
AIDS, HIV
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 24
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
7. ABBR EN: English abbreviation with Dutch/Englishdefinition in local context
Example
endosonografie (EUS)EN: endoscopic ultrasound (EUS)
8. DEF EN: English full form which accompanies an Englishabbreviation
Example
Mini Mental State Examination (MMSE)
⇒ Kappa score: 0.89
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 25
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
7. ABBR EN: English abbreviation with Dutch/Englishdefinition in local context
Example
endosonografie (EUS)EN: endoscopic ultrasound (EUS)
8. DEF EN: English full form which accompanies an Englishabbreviation
Example
Mini Mental State Examination (MMSE)
⇒ Kappa score: 0.89
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 26
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
NTvG TvGABBR 11.60 14.25ABBR DE 30.62 22.55ABBR IN COMP 7.14 22.43ABBR IN COMP DE 16.85 4.96ABBR NO DEF 27.65 29.12ABBR EN 6.14 6.69
TOTAL % 3.36 2.19
Table: Labels and their frequencies in the corpus (%)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 27
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
NTvG TvGdef: loc 17.74% 20.94 %def: broad 47.47% 27.50%def: loc/broad 65.21% 48.45%
Table: Abbreviations and defined abbreviations in the corpus
⇒ Between 45% and 52% of the abbreviations are undefined
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 28
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Challenges
English abbreviations with Dutch full form: no match
Example
HAART = krachtige antiretrovirale therapie
Parenthetical patterns
Example
gunstige uitkomst (score 5)
Syllabic initialization
Example
CVS = chronische-vermoeidheidssyndroomEN: CFS = chronic fatigue syndrome)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 29
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Challenges
English abbreviations with Dutch full form: no match
Example
HAART = krachtige antiretrovirale therapie
Parenthetical patterns
Example
gunstige uitkomst (score 5)
Syllabic initialization
Example
CVS = chronische-vermoeidheidssyndroomEN: CFS = chronic fatigue syndrome)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 30
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Challenges
English abbreviations with Dutch full form: no match
Example
HAART = krachtige antiretrovirale therapie
Parenthetical patterns
Example
gunstige uitkomst (score 5)
Syllabic initialization
Example
CVS = chronische-vermoeidheidssyndroomEN: CFS = chronic fatigue syndrome)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 31
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
CorpusLabels
Challenges
English abbreviations with Dutch full form: no match
Example
HAART = krachtige antiretrovirale therapie
Parenthetical patterns
Example
gunstige uitkomst (score 5)
Syllabic initialization
Example
CVS = chronische-vermoeidheidssyndroomEN: CFS = chronic fatigue syndrome)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 32
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Pattern-based approach - Related research
⇒ Use of patterns to detect abbreviations:
short uppercase words
typical patterns: “long form (short form)” or “short form(long form)”
identification of definitions:
- window of 2*N (Taghva & Gilbreth, 1999)or 3*N words (Stanford Medical Abbreviation Method (Chang& Schutze, 2006))
- text markers: () “ =- linguistic cues: “short”, “or” (Park & Byrd, 2001)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 33
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Pattern-based approach - Related research
⇒ Use of patterns to detect abbreviations:
short uppercase words
typical patterns: “long form (short form)” or “short form(long form)”
identification of definitions:
- window of 2*N (Taghva & Gilbreth, 1999)or 3*N words (Stanford Medical Abbreviation Method (Chang& Schutze, 2006))
- text markers: () “ =- linguistic cues: “short”, “or” (Park & Byrd, 2001)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 34
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Pattern-based approach - Related research
⇒ Use of patterns to detect abbreviations:
short uppercase words
typical patterns: “long form (short form)” or “short form(long form)”
identification of definitions:
- window of 2*N (Taghva & Gilbreth, 1999)or 3*N words (Stanford Medical Abbreviation Method (Chang& Schutze, 2006))
- text markers: () “ =- linguistic cues: “short”, “or” (Park & Byrd, 2001)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 35
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
+ use of NLP tools to refine the search space of thedefinitions (Pustojevski et al., 2001) and/or to tackle theproblem of function word matching
Example
ADL = activiteiten van het dagelijkse levenEN: daily life activities
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 36
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
2 steps:
Abbreviation detection
Definition matching
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 37
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
2 steps:
Abbreviation detection
Definition matching
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 38
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
2 steps:
Abbreviation detection
Definition matching
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 39
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 1: abbreviation detection:
capital letters / combinations of capital letters with 1-3lowercased letters or numbers
Example
QSRLpANCACDG1A
window of 3*N words
text markers () = “ ’ ⇒ list of candidate definitions
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 40
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 1: abbreviation detection:
capital letters / combinations of capital letters with 1-3lowercased letters or numbers
Example
QSRLpANCACDG1A
window of 3*N words
text markers () = “ ’ ⇒ list of candidate definitions
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 41
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 1: abbreviation detection:
capital letters / combinations of capital letters with 1-3lowercased letters or numbers
Example
QSRLpANCACDG1A
window of 3*N words
text markers () = “ ’ ⇒ list of candidate definitions
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 42
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 1: abbreviation detection:
capital letters / combinations of capital letters with 1-3lowercased letters or numbers
Example
QSRLpANCACDG1A
window of 3*N words
text markers () = “ ’ ⇒ list of candidate definitions
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 43
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 2: definition matching:
list of candidate definitions
matching: first letter of abbreviation - words in candidatedefinition⇒ matching word + rest of the 3*N sequence = definition
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 44
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 2: definition matching:
list of candidate definitions
matching: first letter of abbreviation - words in candidatedefinition⇒ matching word + rest of the 3*N sequence = definition
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 45
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Step 2: definition matching:
list of candidate definitions
matching: first letter of abbreviation - words in candidatedefinition⇒ matching word + rest of the 3*N sequence = definition
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 46
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Abbreviationsprecision recall FB1
TvG 83.89 78.64 81.18NTvG 82.05 83.07 82.56
Definitionsprecision recall FB1
TvG 74.49 83.36 78.68NTvG 68.03 85.50 75.77
Table: Results of the pattern-based approach
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 47
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters- Roman numerals confused with capitalized i, v or x- single letters which are not abbreviations (e.g. hepatitis A)- abbreviations with word-internal capital letters (e.g. mmHg
(EN: Torr))- abbreviations with no typical orthographical characteristics
(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 48
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters
- Roman numerals confused with capitalized i, v or x- single letters which are not abbreviations (e.g. hepatitis A)- abbreviations with word-internal capital letters (e.g. mmHg
(EN: Torr))- abbreviations with no typical orthographical characteristics
(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 49
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters- Roman numerals confused with capitalized i, v or x
- single letters which are not abbreviations (e.g. hepatitis A)- abbreviations with word-internal capital letters (e.g. mmHg
(EN: Torr))- abbreviations with no typical orthographical characteristics
(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 50
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters- Roman numerals confused with capitalized i, v or x- single letters which are not abbreviations (e.g. hepatitis A)
- abbreviations with word-internal capital letters (e.g. mmHg(EN: Torr))
- abbreviations with no typical orthographical characteristics(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 51
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters- Roman numerals confused with capitalized i, v or x- single letters which are not abbreviations (e.g. hepatitis A)- abbreviations with word-internal capital letters (e.g. mmHg
(EN: Torr))
- abbreviations with no typical orthographical characteristics(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 52
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Error Analysis
Errors in abbreviation detection step
- Titles printed in capital letters- Roman numerals confused with capitalized i, v or x- single letters which are not abbreviations (e.g. hepatitis A)- abbreviations with word-internal capital letters (e.g. mmHg
(EN: Torr))- abbreviations with no typical orthographical characteristics
(e.g. min)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 53
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation- mislinked words (e.g. het hepatitis-A-virus (HAV))- function words (e.g. op evidentie gebaseerde zorg (EBZ)
(EN: evidence-based medicine (EBM))- English abbreviations with a Dutch definition- contractions (e.g. therapiegebonden secundaire
myelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 54
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation
- mislinked words (e.g. het hepatitis-A-virus (HAV))- function words (e.g. op evidentie gebaseerde zorg (EBZ)
(EN: evidence-based medicine (EBM))- English abbreviations with a Dutch definition- contractions (e.g. therapiegebonden secundaire
myelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 55
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation- mislinked words (e.g. het hepatitis-A-virus (HAV))
- function words (e.g. op evidentie gebaseerde zorg (EBZ)(EN: evidence-based medicine (EBM))
- English abbreviations with a Dutch definition- contractions (e.g. therapiegebonden secundaire
myelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 56
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation- mislinked words (e.g. het hepatitis-A-virus (HAV))- function words (e.g. op evidentie gebaseerde zorg (EBZ)
(EN: evidence-based medicine (EBM))
- English abbreviations with a Dutch definition- contractions (e.g. therapiegebonden secundaire
myelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 57
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation- mislinked words (e.g. het hepatitis-A-virus (HAV))- function words (e.g. op evidentie gebaseerde zorg (EBZ)
(EN: evidence-based medicine (EBM))- English abbreviations with a Dutch definition
- contractions (e.g. therapiegebonden secundairemyelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 58
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Errors in definition matching step
- error percolation- mislinked words (e.g. het hepatitis-A-virus (HAV))- function words (e.g. op evidentie gebaseerde zorg (EBZ)
(EN: evidence-based medicine (EBM))- English abbreviations with a Dutch definition- contractions (e.g. therapiegebonden secundaire
myelodysplasie (t - MDS) en acute leukemie (t - AL).(EN: the incidence of therapy-related secondary myelodysplasia(t-MDS) and acute leukemia (t-AL).)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 59
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning-based approach - Related research
Often in combination with pattern-based techniques, e.g.Stanford Medical Abbreviation Method (2006), Chang et al.(2002)
Pattern-based detection of abbreviations + learning-basedmatching with definitions
examples of features:
- % of characters aligned at beginning of word- % of characters aligned on syllable boundary- number of words that were skipped (negative weight)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 60
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning-based approach - Related research
Often in combination with pattern-based techniques, e.g.Stanford Medical Abbreviation Method (2006), Chang et al.(2002)
Pattern-based detection of abbreviations + learning-basedmatching with definitions
examples of features:
- % of characters aligned at beginning of word- % of characters aligned on syllable boundary- number of words that were skipped (negative weight)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 61
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning-based approach - Related research
Often in combination with pattern-based techniques, e.g.Stanford Medical Abbreviation Method (2006), Chang et al.(2002)
Pattern-based detection of abbreviations + learning-basedmatching with definitions
examples of features:
- % of characters aligned at beginning of word- % of characters aligned on syllable boundary- number of words that were skipped (negative weight)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 62
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning-based approach - Related research
Often in combination with pattern-based techniques, e.g.Stanford Medical Abbreviation Method (2006), Chang et al.(2002)
Pattern-based detection of abbreviations + learning-basedmatching with definitions
examples of features:
- % of characters aligned at beginning of word- % of characters aligned on syllable boundary- number of words that were skipped (negative weight)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 63
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Own approach
Preprocessing steps:
tokenizationPOS tagging + NP chunking (Daelemans & van den Bosch,2005)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 64
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Own approach
Preprocessing steps:
tokenization
POS tagging + NP chunking (Daelemans & van den Bosch,2005)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 65
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Own approach
Preprocessing steps:
tokenizationPOS tagging + NP chunking (Daelemans & van den Bosch,2005)
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 66
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning experiments
YamCha (Kudo & Matsumoto, 2003): open source sequencetagger using SVM10-fold cross-validation
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 67
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning experiments
YamCha (Kudo & Matsumoto, 2003): open source sequencetagger using SVM
10-fold cross-validation
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 68
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Learning experiments
YamCha (Kudo & Matsumoto, 2003): open source sequencetagger using SVM10-fold cross-validation
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 69
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Feature vector:
tokenPOSname initialssentence-initial positionmorphological features (initial capital letter, completelycapitalized, internal capital letters, lowercased, roman number,punctuation, hyphens, exclusively consonants)prefix and suffix informationsymbolic word shape feature: all morphological (binary)featuresfeature to match 1st letter of abbreviation against words in3*N sequence
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 70
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Feature vector:
tokenPOSname initialssentence-initial positionmorphological features (initial capital letter, completelycapitalized, internal capital letters, lowercased, roman number,punctuation, hyphens, exclusively consonants)prefix and suffix informationsymbolic word shape feature: all morphological (binary)featuresfeature to match 1st letter of abbreviation against words in3*N sequence
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 71
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Related researchOwn approachResults
Results
Abbreviationsprecision recall FB1
TvG 95.31 92.26 93.76NTvG 96.76 90.97 93.78
Definitionsprecision recall FB1
TvG 86.92 78.18 82.32NTvG 87.19 78.00 82.34
Table: Ten-fold cross-validation results of the learning experiments.
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 72
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 73
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 74
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 75
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 76
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score
- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 77
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Conclusions
annotated dataset of +/- 67,000 words (Dutch, medical)
2 approaches: pattern-based and classification-based
classification-based approach outperforms the pattern-basedapproach on both tasks:
- abbreviation detection: 93% F-score- definition matching: 82% F-score
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 78
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 79
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 80
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 81
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 82
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 83
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium
Page 84
BackgroundAnnotation
Pattern-based approachLearning-based approach
Conclusions and future work
Future work
incorporate information from error analysis into learningapproach
apply decompounding techniques (syllabic initializations)
cross-lingual matching: external sources + MT techniques
undefined abbreviations: external sources
F-scores per label (now focus on abbreviations anddefinitions)
English corpus
LT3, Language and Translation Technology TeamUniversity College Ghent
{klaar.vanopstal,bart.desmet,veronique.hoste}@hogent.beDepartment of Applied Mathematics & Computer Science
Ghent UniversityKrijgslaan 281 (S9), 9000 Gent, Belgium