LIS510 lecture 2

LIS510 lecture 12
Thomas Krichel
2006-12-13
today
• Leftovers from last time.
• I discuss some elements of Bill Arms’ book
on Digital Libraries.
– It’s introductory book that general, but smartly
written.
– It is not a book to each someone to become a
digital librarian.
– LIS650 and LIS651 are for that. They really
deal with the introduction to digital information.
• I also talk generally about understanding
some digital contents.
definition
• An informal definition of a digital library is a
“managed collection of information, with
associated services, where the information
is stored in digital formats and accessible
over a network.”
• “managed” in the key word here.
benefits of digital libraries
• The digital library brings the library to the
user.
• Computer power is used for searching and
browsing.
• Information can be shared.
• Information is easier to keep current.
• The information is always available.
• New forms of information become possible.
costs
• Non-digital libraries are very expensive.
• Digital libraries are also expensive. Many
publishers charge more for online editions
that for traditional print.
• However the cost of the infrastructure is
dropping.
• And there are potentials for changes in the
way information is supplied in digital
libraries.
technical change
• Electronic storage is becoming cheaper than
paper.
• Personal computer displays are becoming
more pleasant to use.
• High-speed networks are becoming
widespread.
• Computers have become portable.
libraries adapt
• Libraries get wired
• They offer electronic access, even to the
home user.
• Other actions depend on the library type
– Some shift from information access to
community center.
– Some adopt digital reference with 24/7
asynchronous help.
– Some get involved in digital archiving of
institutional assets.
digital library cost
• The digital library material will cost more
initially because publishers want to see a
return in the extra functionality they have
developed.
• In the longer run, digital library costs may
be lower than in print
– lower storage cost
– less risk to the items
– fewer staff (but differently trained) requirements
classic roles for the library with
digital material
•
•
•
•
•
•
Investigation what to buy
Negotiation of the purchase
Acquisition of access to a service
Installation of access devices
Training of users
Maintenance: update, migrate, replace
beyond the library
• The classic roles will at best a stagnating, if
not declining source for information
professionals.
• The rise of open access will mean that no
longer as many assets as before will have
to be purchased. Today’s example
http://dme.mozarteum.at
• Training needs of users decline as digital
media are getting easier to use.
new roles for information professionals
• The information age does not happen
without information professionals.
• There a huge demand for tech-savvy
information professionals out there.
Examples include
– web site maintenance
– digital archiving
impact of technology on staff
• Information professionals that are
technologically savvy will thrive better than
those who are not.
• Fortunately the Palmer School offers
LIS508, LIS650, LIS651.
• It still does not have a system
administration class, but that may come as
well.
impact of technology on staff
• Constant computer use can cause serious
health problems
• Problem areas are
– bad posture problems at the desk
– eye strain
• The use of mouse is particularly bad. Learn
how to avoid using it.
• Injuries take a long time to heal.
digital libraries are hard
• In digital libraries terminology is a bad
problem. Basic concepts are hard to find.
• These definition problems also hurt efforts
to build sophisticated information systems
by semi-automated means.
• We live in the age of the brute-force
calculation, not the age of artificial
intelligence.
data and metadata
• Metadata is data about data. The distinction
between data and metadata depends often
on the context.
• Metadata is often divided into
– descriptive metadata
– structural metadata
– administrative metadata
what’s in the digital library?
•
•
•
•
•
•
•
•
Items ?
Material ?
Documents ?
Objects?
Digital Items ?
Digital Material ?
Digital Documents ?
Digital Objects ?
storage and dissemination
• Items are stored in digital format in a way
we can call the stored form of the item.
• When the item is shown to the user, it is
shown as a “presentation” or
“dissemination”. This is the way the object
leaves the server.
• When it arrives at the users’ machines, they
have to “render” the presentation.
users and clients
• A user is someone who uses a digital
library. Many times, the user is anonymous
and can not be identified.
• A client is a software that the user runs to
use the digital library. Sometimes this is
called a user agent. Many times common
people refer to it as a browser.
work and contents
• These are difficult things to discuss. Look at
the example at the song “Der Lindenbaum”.
Could mean
– song as sound and words
– score
– performance
– recording
– mp3 file containing the recording
repositories
• This is general term used to talk about a
computer system that has primarily the
function of storing contents.
• When long-run storage is involved a
repository becomes an archive.
• A server is a computer that is switched on
constantly to provide services to the public.
an example of terminology
• “A data model is an abstraction (or an extra
level of indirection) for digital objects such
that each digital object can be seen as an
instance of the class defined by the data
model.”
• “A surrogate is a transmittable serialization
or representation of a digital object that can
be passed back and forth so we can do
things with it. Possible serialization
techniques include XML and RDF/XML.”
a digital library from scratch
• Much of the data that is stored in digital
libraries is text.
• Most other material, that is not textual in
nature, such as
– sound files
– graphics
need textual metadata in order to be found.
Current technology is not able to find it
otherwise.
Information
• Information is best understood as “what it
takes to answer a question”.
• The simplest question has a “yes” or “no”
answer. Therefore a bit is the natural
measure of information.
• Term first used by John Turkey in 1946.
• Concatenation of “binary digit”.
Usage of bits
• Computers are sometimes classified by the
number of bits they can process at one
time. "32 bit processor"
• Graphics are also often described by the
number of bits used to represent each dot.
bits and bytes
• a bit can take the values 0 or 1, thus it can
describe 2 possibilities
• two bits can take the value 00, 01, 10, 11, thus it
can describe four 2×2 possibilities
• n bits can encode 2 power n possibilities.
• The first chips used to process 8 bits at a time. It
become customary to refer to them as a byte. It
can encode 2 power 8 possibilities.
• We can use binary numbers just as decimal
numbers.
application of bytes
• IP (Internet Protocol) numbers are used as
the addresses of computers on the Internet.
• In IP version 4 (the one that is most
commonly used), each IP number has 4
bytes.
• It is represented as x.x.x.x where x is a
number between 0 and 255 (why?)
• How many computers can there be on the
Internet at any one time?
Many bytes
• Larger units are
– Kilo byte is 2 power 10 bytes (=1024 bytes)
– Mega bytes is 2 power 20 bytes
– Giga bytes is 2 power 30 bytes
– Tera byte is 2 power 40 bytes
• From ancient Greek words for "thousand",
"large", "giant", and "monster", respectively.
Terms date back to the French revolution.
Hex numbers
• A byte is often represented by two hex
numbers.
• Each hex number can encode 16 values
• Written 0 to 9, then A B C D E F. F is 15.
• Conventionally prefixed with 0x
• Use Microsoft calculator with scientific
notation to convert.
applications of hex numbers
• Media Access Control (mac) addresses of
hardware that allows access to computer
networks. They are 6-byte numbers, each
byte written as 2 hex numbers, e.g.
00:60:08:F5:20:A9
• character numbers that you see when you
are inserting a special symbol in Microsoft
software, e.g. powerpoint.
• Color codes on web pages use 6 hex digits.
– 000000 is black
– FFFFFF is white
Information in a computer file
• A file is a piece of data on a stored on a
computer.
• Any file contains a sequence of 0s and 1s,
like 1010100101010011110101010101…
• For a computer to make sense of a file, it
has to know what type of file it is.
executable files
• Files that are executable are files that make
the computer do something. For example
the file starts a program, say powerpoint.
An executable on one computer may not
run on another one.
• Non-executable files hold data that is used
by an executable file. We will call them data
files. Example: powerpoint slides file.
Characters
• Much of the information processed by
computers is in the form of characters.
• From wikipedia
– A character is a unit of information that roughly
corresponds to a grapheme, or written symbol,
of a natural language, such as a letter,
numeral, or punctuation mark.
• A character is not a grapheme because
there are ligatures.
control characters
• The concept also includes control
characters, which do not correspond to
natural language symbols but to other bits
of information used to process texts of the
language, such as instructions to printers or
other devices that display such texts.
• An example for such a control character is
the newline character.
text files
• Many data files contain textual data.
• Textual data is a sequence of characters.
• A character is an elementary symbol that
has some meaning
– alphabet letter
– hieroglyph
• Example: email file
• Text files can be read by many computer
programs.
non-text files
• Examples for non-text files are
– graphics files
– movie files
– sound files
• Non-text files are of minor significance in
library settings
– There is no way to organize information
retrieval for non-text files. They have to be
retrieved using a textual surrogate.
– Traditional library material are textual
• will talk about this later.
Representing characters
• Computers don't understand text, they only
understand numbers. For computers to be able to
treat text, there must be a correspondence
between numbers and text characters. Such a
correspondence is called a character set.
• Examples for characters are
–
–
–
–
a
c
ë
€
Legacy character sets
• In early days, computers were a lot less
powerful than they are today.
• Could only deal with the characters that are
most commonly used.
• Such sets are
– ascii
– ISO-8859-1
– cp1252
ASCII
• American Standard Code for Information
Interchange
• 7-bit character set. There is no such thing
as 8-bit ASCII
• 95 printable symbols
• 33 control characters (0-31, 127)
• http://www.ccmr.cornell.edu/helpful_data/as
cii2.html has a list up to 127
some ASCII control characters
•
•
•
•
•
•
CR (13, ^M) is the carriage return
LF (10, ^J) is the linefeed
FF (12, ^L) is the form feed (new page)
BS (8, ^H) is the backspace
DEL (127, ALT-127) is delete
ESC (27, ^[) escape
ISO-8859-1
• ISO-8859-1, aka ISO-latin-1 extends ASCII
with characters that are commonly used by
the western European languages.
• It is the default character set of html.
• Positions 128 to 159 are not used.
• Cp1252 fills these with graphic chars. It is
as Microsoft character set.
This is not enough
• There are around 6800 different languages
around.
• Some of these languages use characters
sets that are not finite, i.e. folks can make
up now characters out of existing ones!
• Setting up a character set for all languages
is almost impossible.
ISO 10646-1
• Defines the Universal Character Set (UCS)
.
• UCS contains
the characters required to
represent characters used by many known
languages, even the likes of Oriya, Telugu,
Bopomofo, Runic.
• ISO 10646 defines formally a 31-bit
character set. They are represented as 32
bits, i.e. 4 bytes, or 8 hex chars.
• Not finished.
Unicode
• ISO is a inter-government agency. Slow
and bureaucratic.
• Industry has come together to work on
Unicode, a 2-byte character set.
• With some minor exceptions, the Unicode
characters are the some as the first 65536
characters in UCS.
• Much better documented standard.
Unicode and legacy sets
• The first 128 characters are identical to
those in ASCII
• The next 128 characters are identical to
ISO 8859-1 (Latin-1).
• Unicode is well documented and the
Unicode book can be downloaded from the
Internet. A must-have for the serious digital
librarian.
Beyond characters
• There is more to text than a string of
characters.
• There is layout
– titles
– abstracts
– mathematical formula spacing
Layout
• Layout can be conveyed by additional text that
has special meaning. Examples
– LaTeX
– HTML
– PostScript
• Another way is to do non-textual layout by adding
some other digital signals. Examples
– DVI
– MS Word
– MS Powerpoint
These can not be shown in these slides!
Example: LaTeX
\bigskip\textbf{Class structure}
Classes will be held in the computer lab in the
Palmer School between 18:15 and 20:45. An
optional practice session will last until 21:15.
\begin{tabular}{@{}llll@{}}
0&2006--09--12&introduction to the course &\\
1&2006--09--19&libraries and food &\\
2&2006--09--26&introduction to shushing &\\
Example: HTML
<p><strong>Class structure</strong><p>Classes will be
held in the computer lab in the Palmer School between
18:15 and 20:45. An optional practice session will last until
21:15.<p>Class details:
<p><center><table width=100% border=1>
<tr><td align=left> 0 </td><td align=left>
2006&#8211;09&#8211;12 </td><td align=left><a
href="lis510w06a-00.ppt">introduction to the course</a>
</td></tr><tr><td align=left> 1 </td><td align=left>
2006&#8211;09&#8211;19 </td><td align=left><a
href="lis510w06a-01.ppt">libraries and food</a> </td>
Example: PostScript
Fc(Class)g(structur)o(e)-104 3956 y
Fd(Classes)26b(will)g(be)e(held)g(in)h(the)f(com
puter)f(lab)i(in)f(the)h(P)o(almer)f(School)g(betwe
en)f(18:15)h(and)g(20:45.)36 b(An)25
b(optional)e(practice)h(session)-104 4055
y(will)d(last)g(until)f(21:15.)-104 4155
y(Class)i(details:)-104 4307 y(0)141
b(2003\22609\22623)94b(introduction)18
b(to)i(the)h(course)-104 4407 y(1)141
b(2002\22609\22630)94 b(bits)21
b(bytes)f(and)g(characters)-104 4507 y(2)141
b(2003\22610\22607)94 b(databases)20
b(and)g(markup)e(languages)-
DVI (rendition, "class structure")
1659: fntnum27 current font is ptmb8t
1660: setchar67 h:=-820459+473168=-347291, hh:=-22
1661: setchar108 h:=-347291+182183=-165108, hh:=-10
1662: setchar97 h:=-165108+327680=162572, hh:=11
1663: setchar115 h:=162572+254928=417500, hh:=27
1664: setchar115 h:=417500+254928=672428, hh:=43
1665: right3 163840 h:=672428+163840=836268, hh:=53
1669: setchar115 h:=836268+254928=1091196, hh:=69
1670: setchar116 h:=1091196+218232=1309428, hh:=83
1671: setchar114 h:=1309428+290976=1600404, hh:=101
1672: setchar117 h:=1600404+364376=1964780, hh:=124
1673: setchar99 h:=1964780+290976=2255756, hh:=142
1674: setchar116 h:=2255756+218232=2473988, hh:=156
1675: setchar117 h:=2473988+364376=2838364, hh:=179
1676: setchar114 h:=2838364+290976=3129340, hh:=197
XML
• XML the extensible markup language. It
have become the lingua franca for
structured textual data.
• It is also increasingly use on the web.
Databases
• Databases are collection of data with some
organization to them.
• The classic example is the relational
database.
• But not all database need to be relational
databases.
Relational databases
• A relational database is a set of tables.
There may be relations between the tables.
• Each table has a number of record. Each
record has a number of fields.
• When the database is being set up, we fix
– the size of each field
– relationships between tables
Example: Movie database
ID
| title
| director
| date
M1
| Gone with the wind
| F. Ford Coppola
| 1963
M2
| Room with a view
| Coppola, F Ford
| 1985
M3
| High Noon
| Woody Allan
| 1974
M4
| Star Wars
| Steve Spielberg
| 1993
M5
| Alien
| Allen, Woody
| 1987
M6
| Blowing in the Wind
| Spielberg, Steven | 1962
• Single table
• No relations between tables, of course
Problem with this database
• All data wrong, but this is just for
illustration.
• Name covered inconsistently. There is no
way to find films by Woody Allan without
having to go through all spelling variations.
• Mistakes are difficult to correct. We have to
wade through all records, a masochist’s
pleasure.
Better movie database
ID
| title
| director
| year
M1
| Gone with the wind
| D1
| 1963
M2
| Room with a view
| D1
| 1985
M3
| High Noon
| D2
| 1974
M4
| Star Wars
| D3
| 1993
M5
| Alien
| D2
| 1987
M6
| Blowing in the Wind
| D3
| 1962
ID
| director name
D1
| Ford Coppola, Francis | 1942
D2
| Allan, Woody
| 1957
D3
| Spielberg, Steven
| 1942
| birth year
Relational database
• We have a one to many relationship
between directors and film
– Each film has one director
– Each director has produced many films
• Here it becomes possible for the computer
– To know which films have been directed by
Woody Allen
– To find which films have been directed by a
director born in 1942
Many-to-many relationships
• Each film has one director, but many actors
star in it. Relationship between actors and
films is a many to many relationship.
• Here are a few actors
ID
| sex | actor name
| birth year
A1
| f | Brigitte Bardot | 1972
A2
| m | George Clooney
A3
| f | Marilyn Monroe| 1934
| 1927
Actor/Movie table
actor id
| movie id
A1
| M4
A2
| M3
A3
| M2
A1
| M5
A1
| M3
A2
| M6
A3
| M4
… as many lines as required
SQL
• Once we have the relational database, we
can ask sophisticated questions:
– Which director has had the most female actors
working for him?
– In which years films have been shot that
starred actors born between 1926 and 1935?
• Such questions can be encoded in a
language know as “structured query
language” or SQL. All relational database
vendors implement a dialect of SQL.
databases in libraries
• Relational databases dominate the world of
structured data
• But not so popular in libraries
– Slow on very large databases (such as catalogs)
– Library data has nasty ad-hoc relationships, e.g.
• Translation of the first edition of a book
• CD supplement that comes with the print version
Difficult to deal with in a system where all relations
and field have to be set up at the start, can not
be changed easily later.
http://openlib.org/home/krichel
Thank you for your attention!