1 Writeslike.us Em Tonkin, Andrew Hewson [email protected] [email protected].

1

Writeslike.usEm Tonkin, Andrew Hewson

[email protected]

[email protected]

2

BackgroundRelevant research themes:

• Metadata harvesting and reuse• Automatic metadata extraction• Text analysis• Social network analysis • Scholarly communication, particularly

informal communication

3

AimHelping people to find each other:

• Finding other researchers with similar interests to yourself in your geographic area

• Or in your area of research• Not everybody with similar interests will

attend the same conferences!• Helping students find potential research

supervisors• Encouraging serendipity.

4

Relevant technologiesIn fact there are an awful lot of these. Social network analysis:

• Requires a very large dataset• Solvable either by a) being Facebook or

similar (but adoption rates are far from 100%) b) automated analysis of relevant data

• Solution b) is cheap, simple, and very fallible.

• Not a new approach – at the core of bibliometrics

5

Relevant technical problems

• Author identity disambiguation• Formal social networks disambiguate between

instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which).

• Needs to be solved to acceptable level.• Need to define how good 'acceptable' is.

• Formal solutions usually depend on unique identifiers + registries

• Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

6

Methodology• Harvest OAI metadata: captures large list

of: • Author names (somewhat randomly formatted)• Digital object titles, descriptions (sometimes),

dates (sometimes) and content (sometimes)• Citations (sometimes)

• Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc.

• Retain OAI source: useful clue regarding author affiliations (sometimes)

7

Methodology (II)• Analyse text for noun-phrase-like

structures – useful clue as to theme • Background information required, such as:

Institution name, domains/URLs associated with each institution

• Retrieved via harvesting from Wikipedia• Much of this information is not well-structured,

so unavailable via DBPedia• Poorly structured information needs filtering:

for example, author names are not consistently structured between repositories. - machine learning problem.

• Search with contextual network graph algorithm

8

'Sometimes' and 'usually'• Statistics are:

• Cheap• Imperfect• Available

• Rapid innovation philosophy:• Cheap is good• Simple is good• Solutions requiring novel/additional uptake of

infrastructure are out of reach

9

Results• Basic concept worked well• Law of diminishing returns: beyond the

first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!)

• Interface development actually required more time than the dataset development, and exceeded project length...

• But useful dataset can be released as linked data, reused for various purposes

10

Walkthrough: Basic search (the harder method!)

11

Advanced search

12

13

14

15

16

Walkthrough

17

Conclusion• OAI-DC (and Wikipedia!) is a good source

for 'semi-structured' data• There is a great deal of potential for using

this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network-like graphs

• Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications

1 Writeslike.us Em Tonkin, Andrew Hewson [email protected] [email protected].

Documents