Dec 30, 2015
2
BackgroundRelevant research themes:
• Metadata harvesting and reuse• Automatic metadata extraction• Text analysis• Social network analysis • Scholarly communication, particularly
informal communication
3
AimHelping people to find each other:
• Finding other researchers with similar interests to yourself in your geographic area
• Or in your area of research• Not everybody with similar interests will
attend the same conferences!• Helping students find potential research
supervisors• Encouraging serendipity.
4
Relevant technologiesIn fact there are an awful lot of these. Social network analysis:
• Requires a very large dataset• Solvable either by a) being Facebook or
similar (but adoption rates are far from 100%) b) automated analysis of relevant data
• Solution b) is cheap, simple, and very fallible.
• Not a new approach – at the core of bibliometrics
5
Relevant technical problems
• Author identity disambiguation• Formal social networks disambiguate between
instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which).
• Needs to be solved to acceptable level.• Need to define how good 'acceptable' is.
• Formal solutions usually depend on unique identifiers + registries
• Cheap, moderately effective solution: disambiguate via textual characteristics + metadata
6
Methodology• Harvest OAI metadata: captures large list
of: • Author names (somewhat randomly formatted)• Digital object titles, descriptions (sometimes),
dates (sometimes) and content (sometimes)• Citations (sometimes)
• Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc.
• Retain OAI source: useful clue regarding author affiliations (sometimes)
7
Methodology (II)• Analyse text for noun-phrase-like
structures – useful clue as to theme • Background information required, such as:
Institution name, domains/URLs associated with each institution
• Retrieved via harvesting from Wikipedia• Much of this information is not well-structured,
so unavailable via DBPedia• Poorly structured information needs filtering:
for example, author names are not consistently structured between repositories. - machine learning problem.
• Search with contextual network graph algorithm
8
'Sometimes' and 'usually'• Statistics are:
• Cheap• Imperfect• Available
• Rapid innovation philosophy:• Cheap is good• Simple is good• Solutions requiring novel/additional uptake of
infrastructure are out of reach
9
Results• Basic concept worked well• Law of diminishing returns: beyond the
first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!)
• Interface development actually required more time than the dataset development, and exceeded project length...
• But useful dataset can be released as linked data, reused for various purposes
17
Conclusion• OAI-DC (and Wikipedia!) is a good source
for 'semi-structured' data• There is a great deal of potential for using
this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network-like graphs
• Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications