Top Banner
A simple method for citation metadata extraction using hidden Markov models Erik Hetzner (California Digital Library) JCDL 2008
27

A simple method for citation metadata extraction using - Gales

Feb 25, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A simple method for citation metadata extraction using - Gales

A simple method for

citation metadata extraction

using hidden Markov models

Erik Hetzner

(California Digital Library)

JCDL 2008

Page 2: A simple method for citation metadata extraction using - Gales

Advantages of our method

⊲ Good performance on homogeneous

citations.

⊲ Reasonable performance on heterogeneous

citations.

⊲ Extractor can be implemented in a few pages

of code.

Page 3: A simple method for citation metadata extraction using - Gales

Improving HMM performance

⊲ Reduce the size of the alphabet by mapping

words to a smaller set of symbols.

⊲ Use two states for each label: first & rest.

⊲ Use ‘separator states’, one for each possible

transition between labels.

Page 4: A simple method for citation metadata extraction using - Gales

Hidden Markov models

a

.25 b.75

0.5

1.5

.75

.250.25

1.75

Page 5: A simple method for citation metadata extraction using - Gales

Alphabet of symbols: words?

exorcised throed deposed roil vaporized rattletrap mocking prohibit sleetier

effectual tweeter decremented atrophied nearby captor earn oboe ticked in-

oculate algorithmic extremist inherited burping silenced harassment doctri-

naire emptiest tarting freewheeled parqueting gentlewoman optimal dash-

board taskmaster acceptance mucky prototyping virtual recapture per-

petrate junking rewrote goody cooperated mottling yahoo gridiron suc-

cessfully bumper siphoned witchcraft jettison capering grouchier disal-

lowed eyeballing medic sullen certitude tearier parlor becoming morpho-

logical cognomen saddening apprenticed signpost lignite wishing boldface

postage audibility jingoistic lousy reacted rivulet arboreal primping eddy

belatedly necessity ordinance retrogressed perverting sponging neutralizer

deadlier inferential easel aptly trapeze circumlocution descanted caress-

ing redeemable entice thunderstruck lectured postmarking twanged bel-

lowing rainier grouching cozier flimsiest grizzly decorously jawboning tinier

crookeder liberation sleeting heehawed puffin paisley daunt screenwriter …

Page 6: A simple method for citation metadata extraction using - Gales

Alphabet of symbols: keywords

wAND wAPPEAR wCOMMUNICATIONS

wCONFERENCE wDE wDISSERTATION

wEDITOR wIN wINC wJOURNAL

wNOTICES wNUMBER wPAGES

wPHD wPRESS wPROCEEDINGS

wREPORT wSUBMITTED wTECHNICAL

wTHESIS wTRANSACTIONS

wUNIVERSITY wVAN wVOLUME

Page 7: A simple method for citation metadata extraction using - Gales

Alphabet of symbols: punctuation

pPERIOD pCOMMA pLEFTPAREN

pRIGHTPAREN pLEFTBRACKET

pRIGHTBRACKET pHYPEN pCOLON

pSEMICOLON pQUESTIONMARK

pMISC pAPOSTROPHE

pDOUBLEQUOTE pSINGLEQUOTE

Page 8: A simple method for citation metadata extraction using - Gales

Alphabet of symbols: word classes

wMONTH wSEASON

Page 9: A simple method for citation metadata extraction using - Gales

Alphabet of symbols: features

fINITIAL fTC fUPPER fLOWER

fNUMERAL4 fNUMERAL fMIXED

Page 10: A simple method for citation metadata extraction using - Gales

Tokens→ symbols

1 ˆ[aA][nN][dD]$ → wAND

2 ˆ[Jj]an(uary)?$ → cMONTH

3 ˆ\.$ → pPERIOD

4 ˆ,$ → pCOMMA

5 ˆ[A-Z]$ → fINITIAL

6 ˆ[A-Z][A-Z]+$ → fUPPER

Page 11: A simple method for citation metadata extraction using - Gales

Tokens→ symbols

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

Page 12: A simple method for citation metadata extraction using - Gales

Tokens→ symbols

fTC, Daniel P., and Matthias Felleisen. The Little

Schemer. 4th Edition. Cambridge, Mass.: The

MIT Press, 1995.

Page 13: A simple method for citation metadata extraction using - Gales

Tokens→ symbols

fTCpCOMMA Daniel P., and Matthias Felleisen.

The Little Schemer. 4th Edition. Cambridge,

Mass.: The MIT Press, 1995.

Page 14: A simple method for citation metadata extraction using - Gales

Tokens→ symbols

fTCpCOMMA fTC fINITIALpPERIODpCOMMA

wAND fTC fTCpPERIOD wTHE fTC wTCpPERIOD

fMIXED wEDITIONpPERIOD fTCpCOMMA

fTCpPERIODpCOLON wTHE fUPPER fTCpCOMMA

fNUMERAL4pPERIOD

Page 15: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f fTC

Page 16: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r pCOMMA

Page 17: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r fTC

Page 18: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r fINITIAL

Page 19: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r pPERIOD

Page 20: A simple method for citation metadata extraction using - Gales

Separator states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r a|a pCOMMA

Page 21: A simple method for citation metadata extraction using - Gales

Separator states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r a|a wAND

Page 22: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f

a:r

fTC a|a

Page 23: A simple method for citation metadata extraction using - Gales

Label states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:ra|a

fTC

Page 24: A simple method for citation metadata extraction using - Gales

Separator states

Friedman, Daniel P., and Matthias Felleisen. The

Little Schemer. 4th Edition. Cambridge, Mass.:

The MIT Press, 1995.

a:f a:r

a|a

a|t pPERIOD

Page 25: A simple method for citation metadata extraction using - Gales

Results on the Cora dataset

token .944

field .892

whole instance .613

Page 26: A simple method for citation metadata extraction using - Gales

Improving HMM performance

⊲ Reduce the size of the alphabet by mapping

words to a smaller set of symbols.

⊲ Use two states for each label: first & rest.

⊲ Use ‘separator states’, one for each possible

transition between labels.

Page 27: A simple method for citation metadata extraction using - Gales

Erik Hetzner

[email protected]

http://purl.net/net/egh/hmm cite parser/