Top Banner
Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29
37

Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

Open and self-sustaining digital library services: the example of NEP.

Thomas Krichel

2005-06-29

Page 2: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

introduction

• Title "Open and self-sustaining digital libraries" has been chosen before I was really aware of the need of the audience.

• I read in the announcement that I am supposed to talk about "по информационному поиску и автоматической обработке текстов". This is area I don't know that much about but I hope to be asking some interesting questions.

• I hope to find someone who is interested enough in some of them to work with me.

Page 3: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

my background

• I am a trained economist. An economist knows the price of everything and the value of nothing.

• I am interested in free digital libraries. • "Free" can mean "бесплатный" or "свободный". I

am interested more in the former than in the latter. • My work has mainly been on building such digital

libraries. I am less concerned with the usage of such libraries.

• The building and maintenance of the library will generate costs. How can it be given away for $0?

Page 4: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

automation

• Digital libraries could be entirely automated.• This is true if the purpose of the digital library is

mainly to retrieve information. • Generally speaking, for information retrieval an

automated system is quite sufficient. Examples are Google and CiteSeer.

Page 5: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

limit to automation

• This comes in when the library is used to assess underlying facts.

• If we say "Thomas Krichel wrote paper X" the computer will not understand who Thomas Krichel is. Only a human can know for sure.

• When the library is used for evaluative purposes, it needs some controlled human intervention.

• By evaluative purpose I mean to purpose to say how well a person or institution has behaved.

Page 6: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

evaluative purpose

• Seems vague but here are some evaluative issues in academic libraries– which journal is the most cited in field X?– who has written the most papers in field Y?– which institution has the most researchers in field Z?

• Human intervention is critical because– identification problems that we have discussed– problem of abuse and fraud

Page 7: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

why bother with evaluation?

• For a self-sustaining freely available digital library, the problem of contribution is critical.

• Providers of data will have good incentives, if the data that they contribute is used to evaluate performance.

• In academic digital libraries a crucial ingredient that helps performance is visibility. Publish (in the sense of "make public) or perish quite literally.

Page 8: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

role of automated means

• Ideally a digital library will use a mixture of automated and human activity.

• We push automation as far as we can, and let humans do the rest.

• The design and successful implementation of such digital libraries is a complex long-run task.

• It can be helped if the digital library is also open.

Page 9: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

Example: RePEc

• This is what I am most famous for. I founded the RePEc digital library. In fact its creation in 1997 goes back to efforts that I made as early as 1993.

• RePEc is a digital library that aims to document keys aspect of the discipline of Economics.

• It is essentially a metadata collection. But it goes beyond document+collections metadata to collect data about academic authors and institutions.

• These data on authors and institutions stand in relation to the document metadata.

Page 10: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

RePEc is based on 440+ archives

• WoPEc• EconWPA• DEGREE• S-WoPEc• NBER• CEPR

• US Fed in Print• IMF• OECD• MIT• University of Surrey• CO PAH

Page 11: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

to form a 300+k item dataset

146,000 working papers

154,000 journal articles

1,600 software components

900 book and chapter listings

6,400 author contact and publication listings

8,400 institutional contact listings

Page 12: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

RePEc is used in many services

• EconPapers

• NEP: New Economics Papers

• Inomics• RePEc author service• Z39.50 service by the DEGREE

partners

• IDEAS

• RuPEc

• EDIRC

• LogEc

• CitEc

Page 13: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

institutional registration

• This works through a system called EDIRC.

• Christian Zimmermann started it as a list of departments that have a web site.

• I persuaded him that his data would be more widely used if integrated into the RePEc database.

• Now he is a crucial RePEc leader.

Page 14: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

LogEc

• It is a service by Sune Karlsson that tracks usage of items in the RePEc database– abstract views

– downloads

• There is mail that is sent by Christian Zimmermann to– archive maintainers

– RAS registrants

that contains a monthly usage summary.

Page 15: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

authors' incentives

• Authors perceive the registration as a way to achieve common advertising for their papers.

• Author records are used to aggregate usage logs across RePEc user services for all papers of an author.

• Stimulates a "I am bigger than you are" mentality. Size matters!

Page 16: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

NEP: New Economics Papers

• NEP is a current awareness service for new working papers in RePEc.

• Working papers are accounts of recent research findings prior to formal publications.

• Formal publication takes about four years in Economics, so no formal paper is new.

Page 17: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

NEP reports

• NEP is a collection of subject-specific report. • Each report is a serial. It has issues, usually

every week. • Each report has

– code e.g. nep-mic– subject e.g. microeconomics– editor, i.e. human who controls the contents.

• A special NEP report, nep-all, contains all new papers.

Page 18: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

history

• Initially, I opened NEP in 1998. John S. Irons agreed to be the general editor.

• The general editor is the person who– prepares nep-all

– overlooks the lists

• In early 2005, the command structure was changed to– general editor who prepares nep-all

– managing director who opens new reports and communicates to the editors

– controller who watches what editors are doing

Page 19: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

edition control

• In the years 1999 to 2001 I took a rather peripheral interest in NEP. At this time many reports developed long editorial delays or where not issued at all.

• Despite that the number of reports did still grow. • But there is no organization of reports into line of

subject in economics. • The report subject space is linear, with most

subjects being covered.

Page 20: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

coverage ratio analysis

• In a paper by Krichel & Bakkalbasi, there is an effort to analyze the coverage ratio of NEP issues. This is the ratio of papers in NEP-all that make it to at least one subject report.

• Historical data shows the mean coverage ratio is not improving over time. Rather it stays constant at around 70%.

• There are two theories that can help to explain the static nature of the coverage ratio.

Page 21: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

coverage ratio theory I: target size

• When editors compose the subject report, they have an implicit report size in mind. When nep-all is large, then the editors will be more selective. That is, they will take a narrow view of the subject area.

• The chances of a paper to be included in the subject report are likely to be smaller when a nep-all issue is large.

Page 22: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

coverage ratio theory II: quality

• Papers in RePEc have different quality.• Some papers have problems with "substantive quality"

– come from authors that are unknown– come from institutions that have an unenviable research

reputation– appear in collections that are unknown.

• Some papers have problems with "descriptive quality". – not in English– no abstract– no keywords

• Editors also filter for quality.

Page 23: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

empirical study

• Krichel & Bakkalbasi investigate this by using a binary logistic regression analysis. This estimates, for every paper that appeared in nep-all, the probability that is will get announced in any subject report.

• They find support for both target size and quality theories. There is strong empirical support that the series matters. There is also some empirical support that author prolificacy matters.

• These results have been greeted with protests by the editors, who claim that they only consider the subject when making decision.

Page 24: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

pre-sorting reports

• As RePEc is growing the growing size of nep-all threatens the survival of NEP.

• Editors simply don't want the cope with it. • In 2001 I developed an idea to pre-sort the report

for the editors. A computer program would look at past issues of the report, extract features, and make forecasts about the most likely papers.

• Editors would then only need to look at the top part of the pre-sorted nep-all issue, not at the bottom.

Page 25: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

current state of play

• I extract the following features– author names

– title

– abstract

– keyword

– journal of economic literature (JEL) classifications

– series

• I remove punctuation, lowercase, normalize using L2

• I submit the result to svm_light for classification.

• I test using 300 record, and use the rest for training.

Page 26: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

How well am I doing?

• This is not a trivial question. Precision and recall are useless. It matters what documents are judged relevant by the system. Only the ordering matters. We know the best and worst outcomes.

• Some measures have been proposed that do take ordering. But they still need to be applied to our case.

• Ideally I have a measure that will evaluate instant outcomes and that have some normalization properties– The value of the measure at the best outcome should be 1.– the expected value of the measure, under random ordering

should be 0.

Page 27: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

the hiking measure

• One measure that I have developed is what I call the hiking measure. – I define a steps as a permutation of two documents in

the outcome vector. – I the number of steps that it takes, from an outcome x

to be evaluated, to the best outcome as s(x) – Then the hiking measure h(x) = 1 – 2s(x) / n / ( n – r)– where n is the total number of documents and r is the

number of relevant documents.

Page 28: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

example r=2 n=5

• Here is the complete table and outcome x h(x) x h(x)

1,1,0,0,0 1.0 1,0,0,0,1 0.0

1,0,1,0,0 2/3 0,0,1,1,0 -1/3

0,1,1,0,0 1/3 0,1,0,0,1 1/3

1,0,0,1,0 1/3 0,0,1,0,1 -2/3

0,1,0,1,0 0.0 0,0,0,1,1 -1.0

• Problems:– no strict ordering different outcomes have the same

hikes– violation of a "natural order of outcomes"

Page 29: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

natural order

• A conscientious editor will be concerned by how low the last relevant paper sinks. Thus comparing two outcomes, the one that has the last relevant paper at a lower position should be preferred.

• If two outcomes have the last relevant paper at the same position, the second-to-last paper relevant paper should be compared.

• This leads to a complete ordering of outcomes.

Page 30: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

conjecture

• A rational editor faces two penalities when composing the report.– examine a new paper– risk loosing a relevant paper

• I claim that under a large class of formulation of the editor's choice, ranking outcomes by the natural order is consistent with minimizing the loss experienced by the editor.

• But I can not show this.

Page 31: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

one way for the computational implementation of natural order

• Derive an algorithm that will associate consecutive natural numbers with each of the outcomes, ordered by the natural order.

• The expected value is then trivial to compute, and a measure can can be defined.

• Does anyone know such an algorithm?

Page 32: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

a more flexible way for the computational implementation of

natural order• Pick y > 1• Then evaluate any outcome as

– sum(y**p)*i,– where p is the position, starting from the right– i=1 if relevant– i=0 if not

• example: for y=2, interpret x as a binary number• example for y=3,

– 0 1 1 0 0 --> 3**1*0+3**2*0+ 3**3*1+3**4*1+3**5

• Does anybody know the expected value?

Page 33: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

outcome: average hike, 30 trialsexp 98.66 cis 98.35 spo 96.08 ets 95.75tra 95.61

hea 95.50 dcm 94.76 geo 94.56 int 94.43 ecm 94.27

gth 94.09 dge 92.94 mon 92.54 eff 91.48 ene 91.46

ifn 90.64 ino 90.31 cba 90.04 fmk 89.90 ure 89.86

hpe 88.91 agr 88.89 evo 87.90 law 87.84 env 87.22

cul 86.39 cbe 85.76 ent 85.07 com 84.52 net 84.20

edu 83.80 lab 83.58 dev 83.55 cfn 82.84res 82.62

sea 82.25 ias 81.45 cmp 81.11 tur 80.50 fin 80.47

tid 80.29 pbe 78.99 pol 78.75 mfd 78.07 eec 78.01

mac 77.03 rmg 76.22 cdm 76.12 cwa 75.38 pub 74.60

his 71.90 ltv 71.23 afr 69.72 acc 68.72 ind 67.56

lam 66.20 mic 61.17 reg 59.12 pke 58.85 bec 57.76

Page 34: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

some remarks

• There is a great diversity in the results.• Some topics are more easy to classify automatically

than others. The value of the report lies in what the human says that goes beyond the recognition by the machine.

• Unfortunately, manual inspection of poorly forecasted results suggests that the reason for the poor result may lie more in the inconsistency of editor decision making than in the forecasting technique.

• This suggests that this could be used as evaluation device for the editors. This was not intended when I started this work!

Page 35: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

how to improve

• Clearly word ordering is important in this areas since different classes don't differ that much by word choice.

• I can use all the keyword data in the RePEc database to find phrases to add to my feature set.

• There may also be a way to automatically deduct significant word combinations from titles and abstracts.

• Finally a combination with the quality criteria mentioned may be good but it does not appear obvious how to do it.

Page 36: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

conclusions

• To provide high quality digital library services, human intervention still appears to be desirable.

• However, we need ways to monitor how well the humans are doing. If they take bad decisions

• Forecastability can be one criterion.• Timeliness and usage can be others. • I will have to work further to develop better

monitoring systems for editor behavior.

Page 37: Open and self-sustaining digital library services: the example of NEP. Thomas Krichel 2005-06-29.

http://openlib.org/home/krichel

спасиьо до внимание!