Top Banner
Data Wrangling I: Exploring Programming in Digital Scholarship February 19, 2016 Paige Morgan Digital Humanities Librarian
63

Feb.2016 Demystifying Digital Humanities - Workshop 2

Apr 13, 2017

Download

Education

Paige Morgan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feb.2016 Demystifying Digital Humanities - Workshop 2

Data Wrangling I:Exploring Programming in Digital

ScholarshipFebruary 19, 2016

Paige MorganDigital Humanities Librarian

Page 2: Feb.2016 Demystifying Digital Humanities - Workshop 2

Programming is complex enough that just figuring out what you want to do

and what sort of language you need is work.

Page 3: Feb.2016 Demystifying Digital Humanities - Workshop 2

Thinking that you ought to be able to do everything almost

immediately is a recipe for feeling terrible.

Page 4: Feb.2016 Demystifying Digital Humanities - Workshop 2

There will always be new programs and platforms

that you will want to experiment with.

Page 5: Feb.2016 Demystifying Digital Humanities - Workshop 2

Working with technology means periodically starting

from scratch -- a bit like working with a new time

period or culture; or figuring out how to teach a new

class.

Page 6: Feb.2016 Demystifying Digital Humanities - Workshop 2
Page 7: Feb.2016 Demystifying Digital Humanities - Workshop 2

Being able to effectively communicate about your

project as it relates to programming is a skill in

itself.

Page 8: Feb.2016 Demystifying Digital Humanities - Workshop 2

What can programming languages do?

Page 9: Feb.2016 Demystifying Digital Humanities - Workshop 2

Programming languages can...• search for things

• match things

• read things

• write things

• receive information, and give it back, changed or unchanged

• count things

• do math

• arrange things in quantitative or random order

• respond: if x, do y OR do x until y happens

• compare things for similarity

• go to a file at a location, and retrieve readable text

• display things according to instructions that you provide

• draw points, lines, and shapes

Page 10: Feb.2016 Demystifying Digital Humanities - Workshop 2

They can also do many or all of these things in

combination.

Page 11: Feb.2016 Demystifying Digital Humanities - Workshop 2

Example #1• find all the statements in quotes ("") from a

novel.

• count how many words are in each statement

• put the statements in order from smallest amount of words to largest

• write all the statements from the novel in a text file

Page 12: Feb.2016 Demystifying Digital Humanities - Workshop 2

Example #2• allow a user to type in some information, i.e.,

"Benedict Cumberbatch"

• compare “Benedict Cumberbatch” to a much larger file

• retrieve any data that matches the information

• print the retrieved information on screen

Page 13: Feb.2016 Demystifying Digital Humanities - Workshop 2

Example #3• "read" two texts -- say, two plays by Seneca

• search for any words that the two plays have in common

• print the words that they have in common on screen

• calculate what percentage of the words in each play are shared

• print that percentage onscreen

Page 14: Feb.2016 Demystifying Digital Humanities - Workshop 2

Example #4• if the user is located in geographic

location Z, i.e., Blue Road & S Red Road, retrieve some text from an online location

• print that text on the user’s tablet screen

• receive input from the user and respond

Page 15: Feb.2016 Demystifying Digital Humanities - Workshop 2

However...• In Example #1, the computer is focusing on

things that characters say. But what if you want to isolate speeches from just one character?

• In Example 2, how does the computer know how much text to print? Will it just print "Benedict Cumberbatch" 379 times, because that's how often it appears in the larger file?

Page 16: Feb.2016 Demystifying Digital Humanities - Workshop 2

These are the areas of programming where critical thinking and

specialized disciplinary knowledge become vital.

Page 17: Feb.2016 Demystifying Digital Humanities - Workshop 2

The Difference• Humans are good at differentiating

between material in complex and sophisticated ways.

• Computers are good at notdifferentiating between material unless they’ve been specifically instructed to do so.

Page 18: Feb.2016 Demystifying Digital Humanities - Workshop 2

Computers work with data.

You work with data, too -- but you may have to do extra work to make your data readable by computer.

Page 19: Feb.2016 Demystifying Digital Humanities - Workshop 2

How to make your data machine-readable• Annotate it with markup language

• Organize it in formats or structures that the computer can understand

• Add metadata that is not explicitly readable in the current format (i.e., hardbound/softbound binding; language:English; date of record creation)

Page 20: Feb.2016 Demystifying Digital Humanities - Workshop 2

Depending on the data you have, and the way

you annotate or structure it, different things become

possible.

Page 21: Feb.2016 Demystifying Digital Humanities - Workshop 2

Your goal is to make the data As Simple As

Possible -- but not so simple that it stops being

useful.

Page 22: Feb.2016 Demystifying Digital Humanities - Workshop 2

Depending on the data you work with, the work of structuring or annotating

becomes more challenging, but also

more useful.

Page 23: Feb.2016 Demystifying Digital Humanities - Workshop 2

The work of creating data is social.

Page 24: Feb.2016 Demystifying Digital Humanities - Workshop 2

Many programming languages have governing bodies that establish

standards for their use:

• World Wide Web (W3C) Consortium (www.w3.org/standards/)

• TEI Technical Council (www.tei-c.org)

Page 25: Feb.2016 Demystifying Digital Humanities - Workshop 2

Data Examples• Annotated (Markup Languages: HTML,

TEI)

• Formatted (JSON)

• Structured data (tabular, relational, non-relational)

• Object-Oriented Programming (Java, Python, Ruby on Rails)

Page 26: Feb.2016 Demystifying Digital Humanities - Workshop 2

Markup: HTML

<i> This text is italic.</i> =

This text is italic.

Page 27: Feb.2016 Demystifying Digital Humanities - Workshop 2

Markup: HTML

<a href=“http://www.paigemorgan.net”>This text</a> will take you to a webpage.

=This text will take you to a webpage.

Page 28: Feb.2016 Demystifying Digital Humanities - Workshop 2

Markup: HTML

Anything can be data -- and markup languages provide instructions for how

computers should treat that data.

Page 29: Feb.2016 Demystifying Digital Humanities - Workshop 2

Markup: HTMLHTML is used to format text on webpages.

<p> separates text into paragraphs.

<em> makes text bold (emphasized).

These are just a few of the HTML formatting instructions that you can use.

Page 30: Feb.2016 Demystifying Digital Humanities - Workshop 2

HTML Syntax Rules

• Open and closed tags: <> and </>• Attributes (2nd-level information)

defined using =“”

Page 31: Feb.2016 Demystifying Digital Humanities - Workshop 2

Markup languages are popular in digital

humanities because lots of humanists work with

texts.

Page 32: Feb.2016 Demystifying Digital Humanities - Workshop 2

Without markup languages, the things that a computer can search for

are limited.

Page 33: Feb.2016 Demystifying Digital Humanities - Workshop 2

Ctrl + F: any text in iambic pentameter.

Page 34: Feb.2016 Demystifying Digital Humanities - Workshop 2

With markup, the things you can search for are

only limited by your interpretation.

Markup: TEI

Page 35: Feb.2016 Demystifying Digital Humanities - Workshop 2

TEI(Text Encoding Initiative)

Markup: TEI

Page 36: Feb.2016 Demystifying Digital Humanities - Workshop 2

Poetry w/ TEI<text xmlns="http://www.tei-c.org/ns/1.0" xml:id="d1">

<body xml:id="d2"><div1 type="book" xml:id="d3">

<head>Songs of Innocence</head><pb n="4"/><div2 type="poem" xml:id="d4">

<head>Introduction</head><lg type="stanza">

<l>Piping down the valleys wild, </l><l>Piping songs of pleasant glee, </l><l>On a cloud I saw a child, </l><l>And he laughing said to me: </l>

</lg>

Page 37: Feb.2016 Demystifying Digital Humanities - Workshop 2

Grammar w/ TEI<entry>

<form><orth>pamplemousse</orth>

</form><gramGrp>

<gram type="pos">noun</gram><gram type="gen">masculine</gram>

</gramGrp></entry>

Page 38: Feb.2016 Demystifying Digital Humanities - Workshop 2

TEI’s syntax rules are identical to HTML’s --though your normal

browser can’t work with TEI the way it works with

HTML.

Page 39: Feb.2016 Demystifying Digital Humanities - Workshop 2

TEI is meant to be a highly social language

that anyone can use and adapt for new purposes.

Page 40: Feb.2016 Demystifying Digital Humanities - Workshop 2

In order for TEI to successfully encode texts, it has to be adaptable to

individual projects.

Page 41: Feb.2016 Demystifying Digital Humanities - Workshop 2

Anything that you can isolate (and put in brackets) can

(theoretically) be pulled out and displayed for a reader.

Page 42: Feb.2016 Demystifying Digital Humanities - Workshop 2

TEI can be used to encode more than just text:

<div type="shot"><view>BBC World symbol</view>

<sp><speaker>Voice Over</speaker>

<p>Monty Python's Flying Circus tonight comes to you livefrom the Grillomat Snack Bar, Paignton.</p>

</sp></div><div type="shot">

<view>Interior of a nasty snack bar. Customers around, preferablyreal people. Linkman sitting at one of the plastic tables.</view>

<sp><speaker>Linkman</speaker>

<p>Hello to you live from the Grillomat Snack Bar.</p></sp>

</div>

Page 43: Feb.2016 Demystifying Digital Humanities - Workshop 2

Or, you could encode transcriptions of

presidential debates according to their

emotional register.

Page 44: Feb.2016 Demystifying Digital Humanities - Workshop 2

Whether you include or exclude some aspect of the text in your markup can be very important

from an academic perspective.

Page 45: Feb.2016 Demystifying Digital Humanities - Workshop 2

The challenge of creating good data is one reason that collaboration is so

important to digital scholarship.

Page 46: Feb.2016 Demystifying Digital Humanities - Workshop 2

Wise Data Collaboration

• Avoid reinventing the wheel (has someone else already created an effective method for working with this data?)

• Consider the labor involved vs. the outcome (and future use of the data you create.)

Page 47: Feb.2016 Demystifying Digital Humanities - Workshop 2

Structured Data

Page 48: Feb.2016 Demystifying Digital Humanities - Workshop 2

Study Scenario #1

• You study urban espresso stands: their hours, brands of coffee, whether or not they sell pastries, and how far the espresso stands are from major roadways.

Page 49: Feb.2016 Demystifying Digital Humanities - Workshop 2

Study Scenario #2

• You study female characters in novels written between 1700 and 1850. Encoding a whole novel just to study female characters isn’t practical for you.

Page 50: Feb.2016 Demystifying Digital Humanities - Workshop 2

Both scenarios involve aggregating information, rather than encoding it.

Page 51: Feb.2016 Demystifying Digital Humanities - Workshop 2

Structured Data: Example #1(Tabular Data)

ID Name Location Hours Coffee Brand Pastries (Y/N) Distance from Street

008 Java the Hut 56 FarringdonRoad, London, UK

7:00 a.m.-2:00 p.m.

Square Mile Roasters

N 25 meters

009 PrufrockCoffee

18 ShoreditchHigh Street

7:00 a.m. –10:00 p.m.

Monmouth Y 10 meters

Page 52: Feb.2016 Demystifying Digital Humanities - Workshop 2
Page 53: Feb.2016 Demystifying Digital Humanities - Workshop 2

Structured Data: Example #2 (RDF)

Page 54: Feb.2016 Demystifying Digital Humanities - Workshop 2

Object-Oriented Programming

• Java, Python, C++, Perl, PHP, Ruby, etc.

• Widely used, highly flexible, very powerful

Page 55: Feb.2016 Demystifying Digital Humanities - Workshop 2

What’s an “object”?• An object is a structure that contains data in

one or more forms.

• Common forms include strings, integers, and arrays (groups of data).

• Example (handout)

Page 56: Feb.2016 Demystifying Digital Humanities - Workshop 2

Object-oriented programming, cont’d• Learning a bit about an OOP language can

help you become accustomed to working with programming

• Reading OOP code can also be useful

• Many free tutorials are available

• Goal: to be able to converse more effectively with professional programmers, rather than become an expert yourself.

Page 57: Feb.2016 Demystifying Digital Humanities - Workshop 2

How your data is structured will influence the technology that you (can) use to work with it.

Page 58: Feb.2016 Demystifying Digital Humanities - Workshop 2

Digital scholars see creating machine-

readable data as valuable scholarship.

Page 59: Feb.2016 Demystifying Digital Humanities - Workshop 2

Exercise: You Create the Data!

Page 60: Feb.2016 Demystifying Digital Humanities - Workshop 2

Your data determines your project.

Page 61: Feb.2016 Demystifying Digital Humanities - Workshop 2

Every project has data.

Text objects, images, tags, geographical coordinates, categories, records, creator

metadata, etc.

Page 62: Feb.2016 Demystifying Digital Humanities - Workshop 2

Even if you’re not planning to learn any programming skills,

you are still working with data.

Page 63: Feb.2016 Demystifying Digital Humanities - Workshop 2

Next time:Programming on the Whiteboard

• Cleaning data before you work with it!• Identifying specific programming tasks• How access affects your project idea• Flash project development