Slides

Digital Media Technology
Week 3: Introduction to TEI
Peter Verhaar
eXtensible Markup Language
<title>La Biblioteca de
Babel</title> is a short story
written by <persName>Jorge Luis
Borges</persName>.
□ General rules which
determine the
validity (e.g.
proper nesting,
single root
element, case
sensitivity)
□ Rules for a
particular language
which determine
well-formedness
Validation
rules
DTD or XML
Schema
<?xml version="1.0"
encoding="UTF-8"?>
<!DOCTYPE TEI SYSTEM
"tei.dtd " >
<tei>
<text>
<salute>
Gentlemen,
</salute>
<body>
I reply to your letter of the
<date>29th Ulto</date>,
offering 30 £ for an early
copy of the novel
(…)
</body>
</text>
</tei>
Document Instance
Deconstruction
Textual aspects
□
□
□
□
□
□
□
Lexical codes
Logical structure
Typography
Literary devices
Grammar and syntax
Semantic contents
Physical structure
Ontologies
□ Models are based on an
ontology
□ The properties of the original
which are represented in the
model
□ Models “inevitably lie, by
omission at least”
□ A DTD can be viewed as an
ontology
John Unsworth, 'What is Humanities Computing and What is Not?', in: Melissa Terras,
Julianne Nyhan, & Edward Vanhoutte (eds.), Defining digital humanities: a reader,
2013, pp. 36–37.
R. Davis, H. Shrobe & P. Szolovits, 'What is a Knowledge Representation?', AI
Magazine, 14:1 (1993).
Dear Sirs,
I will accept £10 for the
rights to make a
translation into Dutch of
my novel entitled
Wanda
Printers will send you
entire
proofs from London
instantly. Please to
send money on
receipt of this /
Address Madame
Ouida. ~c. 2 words
illegible~ ~c. 1 word
illegible~ Ouida L. de
la Ramée
letter
salute
body
closer
p
persName
title
<?xml version="1.0" encoding="UTF-8"?>
<letter>
<salute> Dear Sirs,</salute>
<body>
<p> I will accept £10 for the rights to make a
translation into Dutch of my novel entitled
<title>Wanda</title>
</p>
<p> Printers will send you entire proofs from London
instantly. Please to send money on
receipt of this / Address Madame Ouida. ~c. 2 words
illegible~ ~c. 1 word illegible~
</p>
</body>
<closer> Ouida L. de la Ramée </closer>
</letter>
Text Encoding Initiative
□ More than 500 elements
□ Developed by consortium of
scholars
□ First established in 1987
□ Text in general: “texts in any
natural language, of any date, in
any literary genre”
<choice>
<orig>Impressions</orig>
<reg>Impressions of
Theophrastus Such</reg>
</choice>
<choice>
<abbr>Yrs.</abbr>
<expan>Yours</expan>
</choice>
<unclear reason=“illegible”>
London</unclear>
Madame Ouida <gap
reason=“illegible” extent=“2
words” />
Unicode
<p>En r&#xE9;ponse &#xE0; votre
lettre du 30 Janvier nous avons <lb/>
l'honneur de vous informer que nous
avons pay&#xE9; Mon-<lb/> sieur
Midderigh d&#xE9;j&#xE0; depuis
longtemps et presque toujours <lb/>
d'avance.</p>
Digital information
□ Digital information is numerical information.
Cf. Latin word ‘digitus’
E.g. words for ‘digital’ in Romanic languages:
Digital Studies or ‘Le champ numérique’
‘Studium Librorum et Instrumentorum
Communicationis Numericorum’
□ Digital information is information represented
as combinations of 1s and 0s
A “byte” (by eight) is a sequence of eight bits
0
1
10
11
100
101
111
1000
ASCII
□ Character encoding scheme
□ e.g. ASCII: A = 01100001
□ Uses 7 bits (128 characters)
Unicode
□ 16 bits
□ UTF-8
□ 1,112,064 characters
α:
&#x3B1;
I <3 Digital Media
Technology
Calvin & Hobbes
If the premises “A >
B” and “A” are true, we
can conclude B
Entities
<p>This sentence is in the &lt;p&gt;
element.</p>
&gt;
&lt;
&quot;
&amp;
Greater than
Less than
Quotation mark
Ampersand
Comments
Used to improve the readability of the
XML document:
<!-– The next section contains the
transcription -->