Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ strategies

Post on 19-Jun-2015

334 Views

Category:

Spiritual

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Despite a stagnating domestic demand near the end of the seventeenth century, Dutch book producers managed to keep up their international market position. In a so-called embedded research project, the Short Title Catalogue, Netherlands (STCN) was used to gain insight in the strategies and decisions of these publishers. The STCN is a retrospective bibliography of publications 1540-1800, containing information on title, author, book producer, language, subject and collation. Historians and computer scientists collaborated to disclose this STCN, and to connect it to other relevant datasets. To explore the possibilities of, and difficulties in, disclosing and linking the bibliography, attention was turned to a particular strategy: publishing scandalous books. Next to explaining the process of converting and querying the STCN data, the presentation will deal with differences in handling data and the advantages of an Open Data approach in the humanities research.

Transcript

e-Humanities Group Research Meeting: STCN

2013/10/10 Wouter Beek

Albert Meroño Peñuela Rinke Hoekstra

Fernie Maas Inger Leemans

‘OPENING’ THE STCN LINKING THE STCN

Open data

Linked Open Data

• Connect to existing datasets • Connect to services • Queries/inferences run across datasets

– The Picarta topic hierarchy allows us to infer that certain publications cover related topics.

– GeoNames gives the latitude of publishing houses, allowing publishing decisions to be correlated to historical events.

– Lexvo / ISO standards allow translations to be traced via related languages (e.g. language families).

• Easy to create mashups / new applications.

died in

Biografisch portaal

same as

Taking the STCN to the Semantic Web

• 139.817 publications (4M facts) • 23.543 authors (120K facts) • 9.959 printers (55K facts) • 37K enriched concepts (DBpedia, Yago, Heidelberg

Diglit, …) • 105 topics (1K facts) • Relate to international standards

(GGC/OCLC/ISO/RFC/IANA) • Making the schema explicit (vocabulary)

Relational DB domain knowledge

RDF files

Text files ambiguous

XML files depends on structure

domain knowledge

Link to external sources (linksets) domain knowledge needed

Domain-independent data conversions fully automated

Simple RDF

Domain-dependent data conversions domain knowledge needed

Connect to services (e.g. query interface, maps)

high level of reuse

Fixing bad data origin inconsistencies

& inaccuracies

FROM THE LIBRARY TO THE LAB

“How many publications by Arminius?”

“How many publications by Gomarus?”

What happens to the average publication format after 1619?

Measured in terms of the number of folds: • Works by Arminius: 5.6 5.7 • Works by Gomarius: 6.8 4.9

Distant reading!

Methodological implications

From

searching for resources (librarian) to

validating/refuting hypotheses (scientist)

humR

humanities + R (statistics processing software)

A WEB SERVICE FOR

RESEARCH INVOLVING DISTANT READING

Open issues 0: institutional hurdles

• The products of publicly funded research should be publicly available (papers&datasets). – Not everybody makes their data publicly available.

• Distant reading research is often restricted by the user interace.

Open issues 1: meaning

A large percentage of the data has no/unknown meaning: • “before 1808” • “This book was published between the Big Bang and

1808.” Context-dependent: • “The first dinosaur walked the earth before 300M years

BC.” • “Einstein came up with the idea of general relativity

before 1937.” Fuzzyness: • “James Joyce’s Ulysses was published before 1925.”

Open issues 2: statistics • Which query results are statistically relevant? • How to detect whether a statistically significant

difference reflects reality and not the way in which the dataset was constructed?

top related