Top Banner
HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010
25

Delivering Data For New Generations of Research

Jan 23, 2016

Download

Documents

palti

Delivering Data For New Generations of Research. Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010. Introduction. Digital Repository Initial focus on digitized book and journal content “Light” archive Collections and Collaboration Comprehensive collection - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Delivering Data For  New Generations of Research

HATHI TRUST A Shared Digital Repository

Delivering Data For New Generations of Research

Strategies and ChallengesJeremy York

NISO/BISG ForumALA 2010

Page 2: Delivering Data For  New Generations of Research

Introduction

• Digital Repository– Initial focus on digitized book and journal content– “Light” archive

• Collections and Collaboration– Comprehensive collection– Shared strategies– Local services– Public Good

Page 3: Delivering Data For  New Generations of Research

Content Distribution

6,173,575 – Total1,177,667 – Public Domain

* As of June 15, 2010

Page 4: Delivering Data For  New Generations of Research

Language Distribution (1)

* As of June 15, 2010

Page 5: Delivering Data For  New Generations of Research

Language Distribution (2)The next 40 languages make up ~13% of total

* As of June 15, 2010

Page 6: Delivering Data For  New Generations of Research

Originating Institution

* As of June 15, 2010

Page 7: Delivering Data For  New Generations of Research

Content over time

* As of June 15, 2010

Page 8: Delivering Data For  New Generations of Research

Content Growth

Page 9: Delivering Data For  New Generations of Research
Page 10: Delivering Data For  New Generations of Research

Data Distribution & APIs

• OAI-PMH• Metadata files• Bibliographic API• Data API

Page 11: Delivering Data For  New Generations of Research

Extended Services

• Community Development Environment• Non-Google Ingest• Non-Book/Non-Journal Ingest• Computational Research

Page 12: Delivering Data For  New Generations of Research

Strategies for Computational Research

• Data distribution• Protocol-based access• Research Center

Page 13: Delivering Data For  New Generations of Research
Page 14: Delivering Data For  New Generations of Research

SEASR Architecture

Components Components

Virtualization InfrastructureVirtualization Infrastructure

Meandre InfrastructureMeandre Infrastructure

VisualizationVisualization

Component RepositoryComponent Repository Component DiscoveryComponent Discovery

Meandre Data-Intensive FlowsMeandre Data-Intensive Flows

AppsApps ServicesServicesPluginsPlugins Web AppsWeb Apps

AnalyticsAnalyticsDataData

Dev

elop

er T

ools

Dev

elop

er T

ools

RepositoriesData

AnalysisComponents

Flows

RepositoriesData

AnalysisComponents

Flows

User InterfacesUser Interfaces

Cloud ComputingCloud Computing

VisualizationsVisualizations

Meandre WorkbenchMeandre Workbench

Page 15: Delivering Data For  New Generations of Research

SEASR @ Work – Tag Cloud

• Count tokens

• Filter options supported

• Stem words

Page 16: Delivering Data For  New Generations of Research

SEASR @ Work – Entity Mash-up

• Entity Extraction with OpenNLP or Stanford NER

• Locations viewed on Google Map

• Dates viewed on Simile Timeline

Page 17: Delivering Data For  New Generations of Research

SEASR @ Work – Entities To Network

• Identify entities• Define relationships between entities

within same sentence

Page 18: Delivering Data For  New Generations of Research

SEASR @ Work – Text Clustering

• Clustering of Text by token counts• Filtering options for stop words, Part of

Speech• Dendogram Visualization

Page 19: Delivering Data For  New Generations of Research

SEASR @ Work – Audio Analysis

• NEMA: Executes a SEASR flow for each run

– Loads audio data

– Extracts features for every 10 sec moving window of audio

– Loads and applies the models

– Sends results back to the WebUI

• NESTER: Annotation of Audio via Spectral Analysis

Page 20: Delivering Data For  New Generations of Research

SEASR @ Work – Zotero

• Plugin to Firefox • Zotero manages the

collection• Launch SEASR Analytics

– Citation Analysis uses the JUNG network importance algorithms to rank the authors in the citation network that is exported as RDF data from Zotero to SEASR

– Zotero Export to Fedora through SEASR

– Saves results from SEASR Analytics to a Collection

• Launch MONK Processing– MONK DB Ingestion Workflo

w

Page 21: Delivering Data For  New Generations of Research

SEASR @ Work – Emotion Tracking

Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)

Page 22: Delivering Data For  New Generations of Research

Sentiment Analysis: Visualization

Page 23: Delivering Data For  New Generations of Research

Person Extraction:Scott's Waverley, Ivanhoe, and The Heart of Midlothian.

Page 24: Delivering Data For  New Generations of Research

Location Extraction:Top: Walter Scott's Waverley Bottom: Maria Edgeworth's Castle Rackrent 

Page 25: Delivering Data For  New Generations of Research

Thank you!

[email protected]@umich.edu