Advanced citation matching and large-scale cited reference extraction Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University EXCITE Workshop 2017: “Challenges in Extracting and Managing References” Cologne, Germany, March 30, 2017
48
Embed
Advanced citation matching and large-scale cited reference extraction
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Advanced citation matching and large-scale cited reference extraction
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
EXCITE Workshop 2017: “Challenges in Extracting and Managing References”
Cologne, Germany, March 30, 2017
Outline
• Citation matching– Comparison of the accuracy of the Web of Science, CWTS, and
iFQ citation matching algorithms
• Cited reference extraction– Assessment of the accuracy of cited references in Web of Science
based on Elsevier ScienceDirect data
1
Accuracy of the WoS, CWTS, and iFQcitation matching algorithms
2
3
Citation matching problem
4
…
References
[1] Hirsch, JE (2005)PNAS, 102, p.16569
[2] Egghe, L (2006)Scientist, 20, p.15
…
An index to quantify an individual's scientific research output
Hirsch, JE
PNAS, 102(46), p.16569-72UT: 000233462900010
Abstract…
How to improve the h-index
Egghe, L
The Scientist, 20(3), p.15UT: 000235634200013
Abstract…
Bibliographic databaseWoS, Scopus
A
B
C
Why is citation matching difficult?
• ‘Big data’ problem– No. of publications: 50 million– No. of cited references: 1 billion
• Little data available on cited references in WoS– First author (last name and initials)– Source title (abbreviated)– Publication year– Volume number– First page number– (DOI)
• Errors in data– Citation extraction errors
• OCR errors• Interpretation errors due to different citation styles
• Little is known about the citation matching algorithm used in WoS
• Larsen et al. (2007) concluded from their investigation of missed matches in WoS that the algorithm is quite conservative and does not allow for any variations
6
Citation matching algorithm of CWTS
• The aim is to overcome the problem of missed citation matches in WoS
– Error in meta data of cited reference (e.g., incorrect publication year or incorrect volume number): 16 (26.7%)
– Correct cited reference: 1 (1.5%)
23
Missing cited references in WoS (1)
24
Missing cited references in WoS (2)
25
Missing cited references in WoS (3)
26
Missing cited references in WoS (4)
27
Missing cited references in WoS (5)
28
Missing cited references in WoS (6)
29
Missing cited references in WoS (7)
30
Missing cited references in WoS (8)
31
Missing cited references in WoS (9)
32
Incorrect cited references in WoS (1)
33
Incorrect cited references in WoS (2)
34
Incorrect cited references in WoS (3)
35
Incorrect cited references in WoS (4)
36
Incorrect cited references in WoS (5)
37
Incorrect cited references in WoS (6)
38
WoS cited reference Original cited reference in publication
WANG J, 2006, CHINESE CHEM LETT, V17, P49
J. Wang, J.K. Carson, M.F. North, D.J. Cleland, Int. J. Heat Mass Transfer 49 (17) (2006) 3075–3083.
KANBER B, 2013, CEREBROVASC DIS S2, V35, P21
Kanber B, Hartshorne TC, Horsfield MA, Naylor AR, Robinson TG, Ramnarine KV. Dynamic variations in the ultrasound gray-scale median of carotid artery plaques. Cardiovasc Ultrasound 2013a;11:21.
EVANS P, 2010, TLS-TIMES LIT S 0326, P30
Evans PD, Chowdhury MJA. Photoprotection of wood using polyester-type UVabsorbersderived from the reaction of 2 hydroxy-4(2,3-epoxypropoxy)-benzophenone with dicarboxylic acid anhydrides. J Wood ChemTechnol 2010;30:186e204.
Incorrect cited references in WoS (7)
39
WoS cited reference Original cited reference in publication
CAO X, 2010, IEEE GLOBECOMM 2010, V2010, P1
Cao, X., Zong, Z., Ju, X., Sun, Y., Dai, C., Liu, Q., Jiang, J., 2010. Molecular cloning, characterization and function analysis of the gene encoding HMG-CoA reductase from Euphorbia Pekinensis Rupr. Mol. Biol. Rep. 37, 1559e1567.
LI XY, 2013, NANJING NONGYE DAXUE, V36, P36
X. Li, S. Wang, Y. Chen, G. Liu, X. Yang, Overexpression of CD40 in sacral chordomasand its correlation with low tumor recurrence, Onkologie 36 (10) (2013) 567–571
ZHANG K, 2014, IEEE T PATTERN ANAL, V1, P1
K. Zhang, H. Chen, G. Wu, K. Chen, H. Yang, High expression of SPHK1 in sacral chordomaand association with patients’ poor prognosis, Med. Oncol. 31 (11) (2014) 247.
More cited references in WoS (1)
40
More cited references in WoS (2)
41
More cited references in WoS (3)
42
More cited references in WoS (4)
43
More cited references in WoS (5)
44
More cited references in WoS (6)
45
Conclusions
• About 0.3% of cited references are missing in WoS
• About 0.2% of cited references in WoS have minor errors (e.g., incorrect publication year or volume number)
• About 0.1% of cited references in WoS have major errors (i.e., reference to completely incorrect target document)
• WoS does a good job in handling references pointing to multiple target documents
• These results are based on Elsevier publications only; publications from other publishers may yield different outcomes 46