T2D + DATA IDENTIFICATION, CURATION & DURATION Maxine Tedesco ACCOLEDS: December 2-4, 2009.

T2D + DATA IDENTIFICATION, CURATION & DURATION

Maxine Tedesco

ACCOLEDS: December 2-4, 2009

TABLE TO DATA (T2D) PROJECT

Approved March/08 at the COPPUL director’s meeting as a collaborative project seeking to implement a system of linking articles & data in open access journals published at COPPUL institutions.

T2D ACTIVITIES TO DATE May/08: Brainstorming at IASSIST conference July/08: Drupal Wiki established & “Outline of

Activities” disseminated to project members Fall/08: Maxine undertook a Literature Search

(building on work done by Jim Jacobs, Feb/08) December/08: Maxine reported at ACCOLEDS

and renewed effort to involve project members Spring/09: Maxine investigated related project

topics in connection with Study Leave research

Additionally, Chuck liaised/advocated for the project throughout the timeline & consultation with OA publishers was undertaken by some project members.

T2D PROJECT STAGES

1. Investigating Literature Searches re: background, tools, etc.

2. Recruiting Open access publishers amenable to a pilot

project Researchers willing to deposit data

3. Marking Develop a set of descriptive tags for table content Identify which parts of a data file “should” be

linked and/or archived

4. Tooling (i.e., tools for markup, searching & display)

5. Evaluating/Reporting (i.e., HOW the project results contribute to research, teaching & learning)

SO … WHAT IS IN IT FOR US?

This seemed like a reasonable question to investigate further in the research in terms of “background information”.

TAKING INTO ACCOUNT RESEARCHERS’ DISCIPLINARY DIFFERENCES, TABLES/FIGURES ARE INCREASINGLY:

used as a more effective summary of the article’s content than subject headings or other descriptors

used as a quick means of identifying types of data, methodologies &/or results

used to assess article relevance before reading the entire article

less effective if completely extracted from the surrounding explanatory text and/or complementary tables/figures

DISAGGREGATION

Disaggregation of article components such as tables/figures facilitates searching at a greater level of granularity in order to:

Improve search precision (# of relevant items) & recall (# of tables/figures not otherwise retrieved in a traditional search)

Facilitate the REAGGREGATION of a journal article’s components into new forms/formats

REAGGREGATION?

Researchers wish to easily incorporate tabular information:

into new documents (to support original research)

into multimedia documents (to support presentations - classrooms or conferences)

into other contexts (utilize data in pre-existing tables rather than generate new time-consuming and/or expensive datasets)

into a comparison of similar information (to check one’s own work against other work)

SO … WHAT CAN MAKE IT EASIER TO RETRIEVE RELEVANT TABLES/FIGURES?

The research was decidedly sparse in this area or not quite as “on-topic” as one would have hoped.

OVERVIEW OF LITERATURE REVIEW

The research mostly dealt with such topics as:

Making T&F (tables/figures) more accessible to the visually impaired.

Improved graphical presentation of T&F. Poor quality of T&F replication in

electronic versions of documents. Improved dissemination of statistical

information. Full-text does not necessarily mean the

inclusion of T&F.

FORMAT-SPECIFIC DATABASES

TableBase (Gage; 1997+) table title, table text, and descriptor fields

are searchable text that accompanies the table is not

searchable or retrievable from the product tables are directly downloadable to Excel

Statistical Universe (Lexis-Nexis PowerTables; 2000+) users search by “criteria” links to full-text documents in the CIS/LEXIS-

NEXIS digital archive & on WWW sites download a PDF file or an Excel spreadsheet

SEARCH RESULTSfrom TableBase

TYPICAL RECORD in TableBase

DATABASES WITH “DEEP INDEXING” FEATURES

Illustrata (ProQuest/CSA; 2006+) assigns 7-8 index terms per image (these

are searchable but not the table text itself) thumbnail images for quick preview links to full-text and other components

within the product

Selected ProQuest Databases (Oct. 1, 2009+) deep indexing of images added along with

traditional abstracting & indexing of text (at no additional cost)

ILLUSTRATA RESULTS PAGE

ILLUSTRATA ARTICLE RECORD

ILLUSTRATA OBJECT RECORD

GEOREF DATABASE’S LINK TO “DEEP INDEXING”

ABSTRACT RETRIEVED FROM GEOREF FOR "AERONOMY" AND

"MAPS”

PRODUCTS THAT INDEX TABLE CONTENT

TableSeer (search engine; 2006+) automatically identifies tables in digital

documents and extracts the contents in the cells of the tables

contents are stored in a queryable table in a database which extracts table metadata and uses a novel ranking function to search for tables relevant to user queries

BioText Search Engine (freely available web-based application; 2007+) searches over 300 open access journals ability to search for words within a table

TABLESEER IS PART OF CHEMXSEER

http://chemxseer.ist.psu.edu/

BIOTEXT SEARCH IN ARTICLES FOR: “HYPERCHOLESTEROLEMIA” &

“EDUCATION”

SAME BIOTEXT SEARCH IN “FIGURE CAPTIONS” – GRID VIEW

SAME BIOTEXT SEARCH IN “TABLES”

SO … WHAT DOES THIS ALL MEAN FOR THE T2D PROJECT?

Not exactly sure but perhaps, in seeing this trend in the Abstract & Indexing industry, we might investigate developing a “SocioText” type of product to index open access journals such as the Canadian Journal of Sociology = ??

SO … WHAT ELSE NEEDS TO BE “PUT ON THE TABLE”?

What if the table information is insufficient and

I want to look at entire dataset?

Where is the entire dataset?

Who owns the entire dataset?

When will it become available for me to use?

How can I get my hands on it?

IDENTIFIC/CUR/DUR-ATION!

Personal Websites Institutional Repositories Subject-specific Repositories such as:

Dryad - http://datadryad.org/repo ExLab - http://exlab.bus.ucf.edu

AND THEN PERHAPS, there’s still: Desk Drawers (aka: LOST)

SO . . . WHAT DO WE DO NOW?

Hopefully I’ve been able to provide some context and/or “food for thought” and, well . . .

stay tuned for updates!

T2D + DATA IDENTIFICATION, CURATION & DURATION Maxine Tedesco ACCOLEDS: December 2-4, 2009.

Documents

project results

project membersspring09

articles data

project membersfall08

collaborative project

tf tablesfigures

related project topics

table text