Supporting e-learning with automatic glossary extraction Experiments with Portuguese
Post on 20-Jan-2016
34 Views
Preview:
DESCRIPTION
Transcript
Supporting e-learning with
automatic glossaryextraction
Experiments with Portuguese
Rosa Del Gaudio, António BrancoRANLP, Borovets 2007
Presentation Plan
● LT4eL project● ILIAS● Corpus● Tool● Grammars
● Copula● Other Verbs● Punctuation
● Results● Conclusion
LT4eL● Improve retrieval and accessibility of LO in learning management systems●Employ language technology resources and tools for the semi-automatic generation of descriptive metadata .
●Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).
ILIAS
Objective
● Build a Glossary in an automatic way to support e-learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages)
● Better access to information for student ●Accelerate the work of the tutor
ILIAS: Glossary Candidate Detector
The Corpus
• 274.000 tokens • Tutorials
• PhD Thesis
• Scientific papers
• 3 Domains evenly represented
• e-learning
• Technology for non experts
• Calimera
XML format
<definingText continue="y" def="m147" def_type1="is_def" id="d5"><markedTerm dt="y" id="m147" kw="y"><tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok></markedTerm><tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok><tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok><tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok><tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs"
sp="y">desenvolvida</tok><tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok><tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms"
sp="y">processamento</tok><tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok><tok base="informação" class="word" ctag="CN" id="t9040" msd="fp"
sp="y">informações</tok><tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok><tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok><tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok><tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok><tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok><tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok></definingText>
LxTransduce
• Input: simple text or xml
• Regular expressions
• Substitution and markup
• Output the same file with changes
• Match tree using elements
• Quick
• Unicode friendly
• freeware
• Easy to integrate in other tools (java)
Rules in lxtransduce
<rule name="Conj"> <query match="tok[@ctag =
'CJ']"/></rule>
<rule name="Coor"> <!--Conjunctions or comma -->
<first><query match="tok[. = ',']"/><ref name="Conj" mult="+"/></first></rule>
<rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule>
<rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule>
<rule name="parenthetic"><seq><ref name="PARopen"/><repeat-until name="tok"><ref name="PARcl"/></repeat-until><ref name="PARcl"/></seq></rule>
First developmentphase
● Less than 50% of the corpus● Focus on the verb● Precision: manually marked/all automatic● Recall: correct automatic/manually marked● F2 :3*(precision*recall)/2*precision+recall
0.220.200.31Gr 01
0.260.440.14Gr 00
F2RecallPrecision
Second developing phase
• 75% of the corpus for developing
• 25% of the corpus for testing
• Specific grammar/rules for each type
Copula baseline grammar
<rule name="euristic"><seq><repeat-until name="tok"><ref name="SERdef" mult="+"/></repeat-until><ref name="SERdef" mult="+"/><not><ref name="PPA"/></not><ref name="tok" mult="*"/><end/></seq></rule>
Verb “to be” third person singular or plural present indicative
<rule name="SERdef"><best><ref name="Ser3"/><ref name="PoderSer"/></best></rule>
Copula base result
• Sentence level results
• Problem with precision
Copula Grammar
Rules for is_type
<!-- To Be 3rd person pl and s -->
<rule name="Serdef"> <querymatch="tok[@ctag = ’V’ and
@base=’ser’ and(@msd[starts-with(.,’fi-
3’ )]or @msd[starts-with(.,’pi-
3’ )])]</rule>....
<rule name="copula1"><seq><ref name="SERdef"/><best><seq><ref name="Art"/><ref name="adj|adv|prep|"
mult="*"/><ref name="Noun" mult="+"/></seq>....</best><ref name="tok" mult="*"/><end/></seq></rule>
Confronting Results
Include that patterns that were excluded
Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.
Other_Verbs grammar
• Collect verbs in a lexicon• Three different category:
reflexive, active, passive.• 22 different verbs
<lex word="chamar"><cat>ref</cat></lex><lex word="chamar,chamado"><cat>pas</cat></lex>
<rule name="Vpas"><seq><ref name="tok"/><not><ref name="not"/> </not><ref name="tok" mult="?"/><query match="tok[mylex(@base)
and (@ctag='PPA')]" constraint="mylex(@base)/cat='pas'"/>
</seq></rule>
Results for verb_type
• Analyze each verbs separately as with is_type
• Richer syntactic patterns
Punctuation Grammar
<rule name="punct_def"><seq><start/><ref name="CompmylexSN"
mult="+"/><query match="tok[.~’^:\$’]"/><ref name="tok" mult="+"/><end/></seq></rule>
●Preliminary work
●Definition introduced by colon mark (most frequent)
All-in-one
• Combination of the previous grammars
• The type is not take into account to calculate precision and recall
Conclusions and Future Work
• Overall results: Recall 86%, Precision 14%
• Difference among domains: the style of a document influence the result.
• Improve the rules for verb_type and punc_type
• Combining with other techniques such as ML
top related