Page 1
1
Bettina Berendt
KU Leuven, Dept. of Computer Science, Hypermedia & Databases
www.cs.kuleuven.be/~berendt
3 December 2007 [updated version]
Intelligent bibliography creation and markup for authors:
The missing link between Google Scholar and plagiarism prevention?
Page 2
2
Three related problems
Page 3
3(1) Why is literature search sub-optimal?
Page 4
4(2) How can we improve our knowledge sources?
Page 5
5(3) Why are academic standards degrading(and what can we do about this)?
Page 6
6... and even more related questions
Literature search /
understanding science
Quality of
scientific writing
( and data)
Quality of
learning & teaching
scientific method
Ranking of scientists
PatentsQuality assurance
for Open Access
See Erik‘s talk on Thu
[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]
Page 7
7Context: Research areas andThe Big Questions
What is & what are we doing with our data / information / knowledge?
The privilege (and responsibility) of Computer Scientists
Page 8
8The approach
Requirements analysis K
nowl edge
Page 9
9Acknowledgements – work presented today (HU = Humboldt-Univ. zu Berlin, IIS = Inst. Information Systems)
Elke Brenstein – ex HU, Inst. Pedagogy and Informatics, now Lernen & Gestalten Consulting
Kai Dingel – ex HU IIS
Christoph Hanser – ex HU IIS
Frank Havemann – HU Inst. of Library Science
Sebastian Kolbe – HU IIS / TU, Comp.Sci.
Beate Krause – ex HU IIS; now Inst. of Knowledge Engineering, Univ. of Kassel & Research Center L3S, Hannover
Bert Wendland – ex HU, Digital Publishing Group, now Bibliothèque Nationale de France
The Citeseer and Citebase teams
+ many (other) students + colleagues!
Page 10
10
Why are the problems related, and why should we care?
Page 11
11“Garbage in, garbage out“ (or: Quality in, quality out)
Literature search /
understanding science
Quality of
scientific writing
( and data)
Quality of
learning & teaching
scientific method
Quality of
citation metadata
Quality of
information extraction
algorithms
Page 12
16
Background:Search functionalitiesavailable in Web-based DLs
Page 13
17Searching and navigating from the search result
Keyword search
Related by text similarity
Related by linkage
Page 14
18Similarity measures for determining neighbourhoods
are based on
links (citations)
text
usage
Page 15
19Similarity measures: (Some) roots in bibliometrics / scientometrics
Co-citation analysis (Small, 1973, 1977; Small & Greenlee, 1980)
„specialities“ (Small & Griffith, 1974)
– cluster of co-cited doc.s = the knowledge base of a specialty
„research fronts“ (Garfield & Small, 1989)
Bibliographic coupling (Kessler 1963)
Co-word analysis (Callon et al., 1983, 1986)
Combinations (e.g., Braam, Moed, & Raan, 1991; Glenisson, Glänzel, Janssens, and De Moor, various 200x)
[PageRank: Pinski & Narin, 1976]
Page 16
20Scientometric mappings
Choice of figures based on [Chen, Mapping Scientific Frontiers, Springer 2003]
Page 17
21
Co-citationBibliographic coupling
Link-based similarity measures: basic forms
Direct citation
A B
C
A B C
A B
Direct citation Bibliographic coupling
Co-citation
Page 18
22Link-based similarity: citing documents
Page 19
23Link-based similarity: cited documents
Page 20
24Link-based similarity: local co-citation neighbourhood
Page 21
25Link-based similarity: local bibliographic-coupling neighbourhood
Page 22
26Link-based similarity: local bibliographic-coupling neighbourhood
Active Bibliography
sources also cited by others
Active Bibliography Score
= Common Citation Inverse Document Frequency
Active Bibliography
sources also cited by others
Active Bibliography Score
= Common Citation Inverse Document Frequency
Page 23
27Text-based similarity (I)
Page 24
28Text-based similarity (I)
Similarity at the sentence level:
respects sentence structure (sequence, minus some data cleaning)
usually revisions of the document under consideration
Similarity at the text level:
based on bag-of-words and TF.IDF
Page 25
29Text-based similarity (II)
Page 26
30Similarity based on the text of metadata
Page 27
31Usage-based similarity (here: community-based)
Page 28
32Interactive Citation analysis tools: Pros and Cons
+ Sources are immediately accessible (if they are Open Access)
+ Relationships become visible (esp. when looking at the full text)
- Not available for all disciplines in the same quality
- Incomplete and incorrect document analysis
- Frustration when documents are not Open Access
- Algorithms are not always understandable
e.g., Google Scholar proprietary Citeseer open source
- Intractability I: only local search, starting from one document
- Intractability II: no ranking or unclear ranking in result sets
[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]
Page 29
33Problem “intractability due to local search“
Local search in the neighbourhood of 1 document
No “Top-down“ grouping of documents
Why are groups useful?
Citation indices must be formed with reference to groups
Understanding a scientific field includes forming groups of concepts
Assumption: Concepts are represented by groups
Page 30
34
Our approach:Search + grouping + interactivity
Page 31
35
Build a tool that is
o user-friendly
o intelligent
o modular and extensible
Page 32
36
[Berendt, Proc. AAAI Symposium KCVC 2005][Berendt, Dingel, & Hanser, Proc. ECDL 2006]
[Berendt & Krause, submitted][Berendt & Kolbe, in preparation]
Page 33
37
Web servicesWeb services
System architecture
Web servicesWeb services
Text & link mining /Information Extraction tools
Text & link mining /Information Extraction tools
Databases(local a/omirrored)
Databases(local a/omirrored)
other WS and info. sources
VBA macroVBA macro
Page 35
39Search; Retrieval[slides not included in the online version]
Page 38
42Organisation of the literature /bibliography construction [slide not included in the online version] – here‘s the old interface
Page 39
43
Co-citationBibliographic coupling
Citation analysis and text analysis
Direct citation
A B
C
A B
Direct citation Bibliographic coupling
C
A B
Co-citation
& similarity measure –
e.g., Jaccard coeff. for co-citation / analogous for b.c.:
No. of sources cited in both documents no. of sources cited in at least 1 document
& keywords
(title & abstract, TF.IDF)
Page 40
44Current architecture of the clustering tool(partial view)[slide not included in the online version]
Page 41
45Enter full-text indexing(ex.: D. Mladenič‘s publications @ IST-WORLD; Pajntar & Ferlež,200x;cf. DocumentAtlas / Ontogen: Fortuna, Mladenič & Grobelnik,2005+)
Page 42
46Questions
How to make this scale better?
How to best combine link & text analysis?
Best cluster quality measures?
What else is needed to turn this into real ontology learning?
How to best support this as an interactive learning process?
Page 44
48Discussion
How to best support this as a collaborative learning process?
Page 46
50Writing
corrected, XML annotated, and formatted
Page 47
51
Information extraction: Reference parsing in 3 tools
... In our tool: involving the author ( higher IE quality)
How to best learn regular expressions?
How to best support this as an interactive learning process?
Page 48
52
(Other) uses in education
Page 49
53How can these ideas be used in education?
Tool + evaluations
In addition, tasks like these
and Google as an
example of why citation networks are useful
Auxiliary (Web-based) materials on why + how to cite
Classes on these topics
Page 50
54Plagiarism detection market study
[Berendt, Humboldt-Universität CMS Journal, 2007]
Page 51
56A final question
Should we delegate to Google Scholar
and its proprietary algorithms
or ...?
Page 52
57
… for your attention!
Thank you …