LIS510 lecture 12 Thomas Krichel 2006-12-13. today Leftovers from last time. I discuss some elements of Bill Arms’ book on Digital Libraries. –It’s introductory.

Post on 26-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

LIS510 lecture 12

Thomas Krichel

2006-12-13

today• Leftovers from last time. • I discuss some elements of Bill Arms’ book

on Digital Libraries. – It’s introductory book that general, but smartly

written. – It is not a book to each someone to become a

digital librarian.– LIS650 and LIS651 are for that. They really

deal with the introduction to digital information.

• I also talk generally about understanding some digital contents.

definition

• An informal definition of a digital library is a “managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network.”

• “managed” in the key word here.

benefits of digital libraries

• The digital library brings the library to the user.

• Computer power is used for searching and browsing.

• Information can be shared.

• Information is easier to keep current.

• The information is always available.

• New forms of information become possible.

costs• Non-digital libraries are very expensive.

• Digital libraries are also expensive. Many publishers charge more for online editions that for traditional print.

• However the cost of the infrastructure is dropping.

• And there are potentials for changes in the way information is supplied in digital libraries.

technical change

• Electronic storage is becoming cheaper than paper.

• Personal computer displays are becoming more pleasant to use.

• High-speed networks are becoming widespread.

• Computers have become portable.

libraries adapt• Libraries get wired

• They offer electronic access, even to the home user.

• Other actions depend on the library type– Some shift from information access to

community center.– Some adopt digital reference with 24/7

asynchronous help.– Some get involved in digital archiving of

institutional assets.

digital library cost

• The digital library material will cost more initially because publishers want to see a return in the extra functionality they have developed.

• In the longer run, digital library costs may be lower than in print– lower storage cost– less risk to the items– fewer staff (but differently trained) requirements

classic roles for the library with digital material

• Investigation what to buy

• Negotiation of the purchase

• Acquisition of access to a service

• Installation of access devices

• Training of users

• Maintenance: update, migrate, replace

beyond the library

• The classic roles will at best a stagnating, if not declining source for information professionals.

• The rise of open access will mean that no longer as many assets as before will have to be purchased. Today’s example

http://dme.mozarteum.at• Training needs of users decline as digital

media are getting easier to use.

new roles for information professionals

• The information age does not happen without information professionals.

• There a huge demand for tech-savvy information professionals out there. Examples include– web site maintenance– digital archiving

impact of technology on staff

• Information professionals that are technologically savvy will thrive better than those who are not.

• Fortunately the Palmer School offers LIS508, LIS650, LIS651.

• It still does not have a system administration class, but that may come as well.

impact of technology on staff

• Constant computer use can cause serious health problems

• Problem areas are– bad posture problems at the desk– eye strain

• The use of mouse is particularly bad. Learn how to avoid using it.

• Injuries take a long time to heal.

digital libraries are hard

• In digital libraries terminology is a bad problem. Basic concepts are hard to find.

• These definition problems also hurt efforts to build sophisticated information systems by semi-automated means.

• We live in the age of the brute-force calculation, not the age of artificial intelligence.

data and metadata

• Metadata is data about data. The distinction between data and metadata depends often on the context.

• Metadata is often divided into– descriptive metadata– structural metadata– administrative metadata

what’s in the digital library?• Items ?

• Material ?

• Documents ?

• Objects?

• Digital Items ?

• Digital Material ?

• Digital Documents ?

• Digital Objects ?

storage and dissemination

• Items are stored in digital format in a way we can call the stored form of the item.

• When the item is shown to the user, it is shown as a “presentation” or “dissemination”. This is the way the object leaves the server.

• When it arrives at the users’ machines, they have to “render” the presentation.

users and clients

• A user is someone who uses a digital library. Many times, the user is anonymous and can not be identified.

• A client is a software that the user runs to use the digital library. Sometimes this is called a user agent. Many times common people refer to it as a browser.

work and contents

• These are difficult things to discuss. Look at the example at the song “Der Lindenbaum”. Could mean– song as sound and words– score– performance– recording– mp3 file containing the recording

repositories

• This is general term used to talk about a computer system that has primarily the function of storing contents.

• When long-run storage is involved a repository becomes an archive.

• A server is a computer that is switched on constantly to provide services to the public.

an example of terminology• “A data model is an abstraction (or an extra

level of indirection) for digital objects such that each digital object can be seen as an instance of the class defined by the data model.”

• “A surrogate is a transmittable serialization or representation of a digital object that can be passed back and forth so we can do things with it. Possible serialization techniques include XML and RDF/XML.”

a digital library from scratch

• Much of the data that is stored in digital libraries is text.

• Most other material, that is not textual in nature, such as – sound files– graphics

need textual metadata in order to be found. Current technology is not able to find it

otherwise.

Information

• Information is best understood as “what it takes to answer a question”.

• The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information.

• Term first used by John Turkey in 1946.

• Concatenation of “binary digit”.

Usage of bits

• Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor"

• Graphics are also often described by the number of bits used to represent each dot.

bits and bytes

• a bit can take the values 0 or 1, thus it can describe 2 possibilities

• two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities

• n bits can encode 2 power n possibilities.• The first chips used to process 8 bits at a time. It

become customary to refer to them as a byte. It can encode 2 power 8 possibilities.

• We can use binary numbers just as decimal numbers.

application of bytes

• IP (Internet Protocol) numbers are used as the addresses of computers on the Internet.

• In IP version 4 (the one that is most commonly used), each IP number has 4 bytes.

• It is represented as x.x.x.x where x is a number between 0 and 255 (why?)

• How many computers can there be on the Internet at any one time?

Many bytes

• Larger units are– Kilo byte is 2 power 10 bytes (=1024 bytes)– Mega bytes is 2 power 20 bytes– Giga bytes is 2 power 30 bytes– Tera byte is 2 power 40 bytes

• From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.

Hex numbers• A byte is often represented by two hex

numbers.

• Each hex number can encode 16 values

• Written 0 to 9, then A B C D E F. F is 15.

• Conventionally prefixed with 0x

• Use Microsoft calculator with scientific notation to convert.

applications of hex numbers• Media Access Control (mac) addresses of

hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9

• character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint.

• Color codes on web pages use 6 hex digits.– 000000 is black– FFFFFF is white

Information in a computer file

• A file is a piece of data on a stored on a computer.

• Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101…

• For a computer to make sense of a file, it has to know what type of file it is.

executable files

• Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another one.

• Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file.

Characters

• Much of the information processed by computers is in the form of characters.

• From wikipedia– A character is a unit of information that roughly

corresponds to a grapheme, or written symbol, of a natural language, such as a letter, numeral, or punctuation mark.

• A character is not a grapheme because there are ligatures.

control characters

• The concept also includes control characters, which do not correspond to natural language symbols but to other bits of information used to process texts of the language, such as instructions to printers or other devices that display such texts.

• An example for such a control character is the newline character.

text files

• Many data files contain textual data. • Textual data is a sequence of characters.• A character is an elementary symbol that

has some meaning– alphabet letter– hieroglyph

• Example: email file• Text files can be read by many computer

programs.

non-text files

• Examples for non-text files are – graphics files– movie files– sound files

• Non-text files are of minor significance in library settings– There is no way to organize information

retrieval for non-text files. They have to be retrieved using a textual surrogate.

– Traditional library material are textual

• will talk about this later.

Representing characters

• Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set.

• Examples for characters are – a

– c

– ë

– €

Legacy character sets

• In early days, computers were a lot less powerful than they are today.

• Could only deal with the characters that are most commonly used.

• Such sets are– ascii– ISO-8859-1– cp1252

ASCII

• American Standard Code for Information Interchange

• 7-bit character set. There is no such thing as 8-bit ASCII

• 95 printable symbols

• 33 control characters (0-31, 127)

• http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list up to 127

some ASCII control characters

• CR (13, ^M) is the carriage return

• LF (10, ^J) is the linefeed

• FF (12, ^L) is the form feed (new page)

• BS (8, ^H) is the backspace

• DEL (127, ALT-127) is delete

• ESC (27, ^[) escape

ISO-8859-1

• ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages.

• It is the default character set of html.

• Positions 128 to 159 are not used.

• Cp1252 fills these with graphic chars. It is as Microsoft character set.

This is not enough

• There are around 6800 different languages around.

• Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones!

• Setting up a character set for all languages is almost impossible.

ISO 10646-1

• Defines the Universal Character Set (UCS)• UCS contains the characters required to

represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic.

• ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars.

• Not finished.

.

Unicode

• ISO is a inter-government agency. Slow and bureaucratic.

• Industry has come together to work on Unicode, a 2-byte character set.

• With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS.

• Much better documented standard.

Unicode and legacy sets

• The first 128 characters are identical to those in ASCII

• The next 128 characters are identical to ISO 8859-1 (Latin-1).

• Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian.

Beyond characters

• There is more to text than a string of characters.

• There is layout– titles– abstracts– mathematical formula spacing

Layout

• Layout can be conveyed by additional text that has special meaning. Examples – LaTeX– HTML– PostScript

• Another way is to do non-textual layout by adding some other digital signals. Examples– DVI– MS Word– MS Powerpoint

These can not be shown in these slides!

Example: LaTeX

\bigskip\textbf{Class structure}

Classes will be held in the computer lab in the Palmer School between 18:15 and 20:45. An optional practice session will last until 21:15.

\begin{tabular}{@{}llll@{}}

0&2006--09--12&introduction to the course &\\

1&2006--09--19&libraries and food &\\

2&2006--09--26&introduction to shushing &\\

Example: HTML

<p><strong>Class structure</strong><p>Classes will be held in the computer lab in the Palmer School between 18:15 and 20:45. An optional practice session will last until 21:15.<p>Class details:

<p><center><table width=100% border=1>

<tr><td align=left> 0 </td><td align=left> 2006&#8211;09&#8211;12 </td><td align=left><a href="lis510w06a-00.ppt">introduction to the course</a> </td></tr><tr><td align=left> 1 </td><td align=left> 2006&#8211;09&#8211;19 </td><td align=left><a href="lis510w06a-01.ppt">libraries and food</a> </td>

Example: PostScript

Fc(Class)g(structur)o(e)-104 3956 y Fd(Classes)26b(will)g(be)e(held)g(in)h(the)f(computer)f(lab)i(in)f(the)h(P)o(almer)f(School)g(between)f(18:15)h(and)g(20:45.)36 b(An)25 b(optional)e(practice)h(session)-104 4055 y(will)d(last)g(until)f(21:15.)-104 4155 y(Class)i(details:)-104 4307 y(0)141 b(2003\22609\22623)94b(introduction)18 b(to)i(the)h(course)-104 4407 y(1)141 b(2002\22609\22630)94 b(bits)21 b(bytes)f(and)g(characters)-104 4507 y(2)141 b(2003\22610\22607)94 b(databases)20 b(and)g(markup)e(languages)-

DVI (rendition, "class structure")1659: fntnum27 current font is ptmb8t1660: setchar67 h:=-820459+473168=-347291, hh:=-221661: setchar108 h:=-347291+182183=-165108, hh:=-101662: setchar97 h:=-165108+327680=162572, hh:=111663: setchar115 h:=162572+254928=417500, hh:=271664: setchar115 h:=417500+254928=672428, hh:=431665: right3 163840 h:=672428+163840=836268, hh:=531669: setchar115 h:=836268+254928=1091196, hh:=691670: setchar116 h:=1091196+218232=1309428, hh:=831671: setchar114 h:=1309428+290976=1600404, hh:=1011672: setchar117 h:=1600404+364376=1964780, hh:=1241673: setchar99 h:=1964780+290976=2255756, hh:=1421674: setchar116 h:=2255756+218232=2473988, hh:=1561675: setchar117 h:=2473988+364376=2838364, hh:=1791676: setchar114 h:=2838364+290976=3129340, hh:=197

XML

• XML the extensible markup language. It have become the lingua franca for structured textual data.

• It is also increasingly use on the web.

Databases

• Databases are collection of data with some organization to them.

• The classic example is the relational database.

• But not all database need to be relational databases.

Relational databases

• A relational database is a set of tables. There may be relations between the tables.

• Each table has a number of record. Each record has a number of fields.

• When the database is being set up, we fix – the size of each field – relationships between tables

Example: Movie database

ID | title | director | date

M1 | Gone with the wind | F. Ford Coppola | 1963

M2 | Room with a view | Coppola, F Ford | 1985

M3 | High Noon | Woody Allan | 1974

M4 | Star Wars | Steve Spielberg | 1993

M5 | Alien | Allen, Woody | 1987

M6 | Blowing in the Wind | Spielberg, Steven | 1962

• Single table• No relations between tables, of course

Problem with this database

• All data wrong, but this is just for illustration.

• Name covered inconsistently. There is no way to find films by Woody Allan without having to go through all spelling variations.

• Mistakes are difficult to correct. We have to wade through all records, a masochist’s pleasure.

Better movie databaseID | title | director | year

M1 | Gone with the wind | D1 | 1963

M2 | Room with a view | D1 | 1985

M3 | High Noon | D2 | 1974

M4 | Star Wars | D3 | 1993

M5 | Alien | D2 | 1987

M6 | Blowing in the Wind | D3 | 1962

ID | director name | birth year

D1 | Ford Coppola, Francis | 1942

D2 | Allan, Woody | 1957

D3 | Spielberg, Steven | 1942

Relational database

• We have a one to many relationship between directors and film– Each film has one director– Each director has produced many films

• Here it becomes possible for the computer– To know which films have been directed by

Woody Allen– To find which films have been directed by a

director born in 1942

Many-to-many relationships

• Each film has one director, but many actors star in it. Relationship between actors and films is a many to many relationship.

• Here are a few actorsID | sex | actor name | birth year

A1 | f | Brigitte Bardot | 1972

A2 | m | George Clooney | 1927

A3 | f | Marilyn Monroe| 1934

Actor/Movie table

actor id | movie id

A1 | M4

A2 | M3

A3 | M2

A1 | M5

A1 | M3

A2 | M6

A3 | M4

… as many lines as required

SQL

• Once we have the relational database, we can ask sophisticated questions:– Which director has had the most female actors

working for him?– In which years films have been shot that

starred actors born between 1926 and 1935?

• Such questions can be encoded in a language know as “structured query language” or SQL. All relational database vendors implement a dialect of SQL.

databases in libraries

• Relational databases dominate the world of structured data

• But not so popular in libraries– Slow on very large databases (such as catalogs)– Library data has nasty ad-hoc relationships, e.g.

• Translation of the first edition of a book• CD supplement that comes with the print version

Difficult to deal with in a system where all relations and field have to be set up at the start, can not be changed easily later.

http://openlib.org/home/krichel

Thank you for your attention!

top related