Top Banner
OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD . INNOVATION ACCELERATION BY PUBLIC DATA ANALYSIS
16

OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

Dec 16, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

O R ,BIG DATA IN HUNGARY - ARCHIV ING

AND MIN ING THE ACADEMIC WEB

GEORGE KAMPIS , CEOPETABYTE NONPROFIT RESEARCH LTD .

INNOVATION ACCELERATION BY PUBLIC DATA ANALYSIS

Page 2: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

PETABYTE NONPROFIT RESEARCH LTD.

www.petabyte-research.org

www.hungarianscience.org

www.textrend.org

www.dynanets.org

www.futurict.szte.hu

Page 4: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

CONTEXT IN FUTURICT

• „Innovation Accelerator“

• .. to help (scientific) innovation with [...] social media as well as data services

Helbing, D., & Balietti, S. (2011). How to create an innovation accelerator. The European Physical Journal Special Topics, 195(1), 101-136.

Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a Nonlinear Process, the Scientometric Perspective, and the Specification of an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming).

van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoretical and technological building blocks for an innovation accelerator. The European Physical Journal Special Topics, 214(1), 183-214.

„BIG DATA“

Page 5: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

PARTIALLY SIMILAR DEVELOPMENTS

• Mendeley• Reference manager and collaboration network

• ResearchGate• Research network and publications portal w/ quality assessment

• Altmetrics• Article-level online metrics

• VIVO• Connect, share, discover

Page 6: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

BIG (WEB) DATA IS A KEY

• Big Data in Google trends

• „deep data“• controversy...

• Massive Web Data: harvesting / archiving

• Google itself...• The Internet Archive• UK web archive, British Library

Page 7: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

WEB ARCHIVING IN HUNGARY

• None. Nope.

• „MIA“ (Magyar Internet Archivum, HU Internet Archive)• Various documents, plans and small-scale pilots• Since 2006

• Our ambition: to archive and mine HU academia = „HUA“• 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.)• 42 HAS (HU Acad Sci) research institutes• 47 higher education entities (universities and polytechnics)

• Now in collaboration with: OSZK (National Library), NIIF...

Page 8: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

A RUNNING „HUA“ PILOT IN PETABYTE/FUTURICT.HU

• Hardware: Dell T710 server (2x4 core Xeon E5520, 48GB RAM, 2TB HDD)

• Software: Heritrix crawlers called from API and CURL, spawned from timed sripts...

• Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip

• Many technical issues: Flash pages, portlet containers (e.g. WebSphere), CMSs (e.g. Joomla)...

• Operation since April 2013.• Longitudinal archiving in mirror format (2-weekly

periods), using a form of „diff“ in own development

Page 9: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

THE PROCESSING OF RESULTS

• Future plans: keyword extraction, timed (dynamic) keyword nets, correlation with support programs and grant calls (to analyze ROI in publications, citations, ...terms)

• „The Science of Success“ (A.-L. Barabási)• http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience

• Bottleneck: availability of public funding data, need for open data initiatives enforcement

• In this pilot phase: basic statistics, turnover rates etc.

Page 10: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

HOW BIG IS BIG?

Page 11: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

QUICK RESULTS, BASIC STATS

• All 89 HU academic insitutitions: 86GB total (text 42GB)• Rank distributions (total)

HAS

Higher Ed.

Page 12: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

QUICK RESULTS, BASIC STATS 2.

• Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps)

HAS

Higher Ed.

Page 13: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

QUICK FIRST INSIGHTS

• (Outliers are chem.catalogs viz. astronomy datasets)

• Average size: 974 MB per site (median: 137 MB [!])• Average text size: 474 MB per site (median: 47 MB

[!])

• For comparison: • Kampis website @ ELTE = 180 MB (text only)

• Hypothesis: useful comparisons and metrics possible• Add dynamic aspect...

Page 14: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

CONCLUSIONS, SUGGESTIONS

• Very first steps, only 2 months into the pilot• Data intensive, has natural timing

• Big (web) data are important for research assessment• Big data are often small (also elsewhere...)• Suggests itself for readily available indexes and

derivative measures• We have shown a simplest yet instructive case („size

matters“)• Caveat: need normalizations!

Page 15: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

FUTURE WORKS... ARE LEFT TO THE (NEAR) FUTURE

Page 16: OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD. INNOVATION ACCELERATION BY PUBLIC DATA.

THANK YOU!

• Coworkers: Laszlo Gulyas (PhD), Sandor Soos (PhD), Balazs Balint (MSc), Zsolt Juranyi (BSc), Attila Palmai (BSc student)

• This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TÁMOP-4.2.2.C-11/1/KONV-2012-0013).