Digital Humanities for Computer Scientists … or: How I

Digital Humanities for Computer Scientists … or:
How I became infected with the Indiana Jones virus
Marco Büchler
eTRAP Research Group
Göttingen Centre for Digital Humanities
Institute of Computer Science
Georg August University Göttingen, Germany
Big Humanities Data
24. November 2015
Who am I?
• 2001/2
• 2006
•
•
•
•
•
20072008
2011
2013
2014-
Big Humanities Data
Head of Quality Assurance department in a
software company
Diploma in Computer Science on big scale cooccurrence analysis
Consultant for several SME in IT sector
Technical project management of eAQUA project
PI and project manager eTRACES project
PhD in „Digital Humanities“ on Text Reuse
Head of Early Career Research Group eTRAP at
Göttingen Centre for Digital Humanities
24. November 2015
The Indiana Jones virus?
Big Humanities Data
24. November 2015
Overview
•
•
•
•
Big Humanities Data
Filling gaps in inscriptions
Uncovering unexpected relations
Text Reuse
Big Humanities Data
24. November 2015
Big (Humanities) Data
• 3 aspects (by Ulrike Rieß, Big Data bestimmt die IT-Welt):
– Huge amount of data that can‘t be processed and analyzed
manually
– Less structured data; e. g. in comparison to databases and data
warehouse systems
– Linked data between heterogeneous and distributed resources
• The fastest growing sources of Big Data are text and images.
• Researchers easily get lost in the information overload (Big
Data) and in the information poverty (Humanities Data).
Big Humanities Data
24. November 2015
Filling (Guessing) gaps in inscriptions and Papyri
Big Humanities Data
24. November 2015
The data
Big Humanities Data
24. November 2015
Damnatio Memoriae
Big Humanities Data
24. November 2015
The papyrus
Big Humanities Data
24. November 2015
Transcribed data
Big Humanities Data
24. November 2015
Input form
Big Humanities Data
24. November 2015
Parsed input (parsed for Leiden conventions)
Big Humanities Data
24. November 2015
Strategy 1: use only information of the word
Big Humanities Data
24. November 2015
Word length + „survived“ pattern
Big Humanities Data
24. November 2015
Strategy 2: use only of context information
Big Humanities Data
24. November 2015
Word bigrams + co-occurrences + classification
Big Humanities Data
24. November 2015
The „real“ Strategy 2: use only of context information and
removing any information about the damaged word
Big Humanities Data
24. November 2015
Reparsing
Big Humanities Data
24. November 2015
Word bigrams + co-occurrences + classification
Big Humanities Data
24. November 2015
Strategy 3: Complete strategy
Big Humanities Data
24. November 2015
Multiple options
Big Humanities Data
24. November 2015
Search & Find
Big Humanities Data
24. November 2015
How to find relevant information in massive data?
Big Humanities Data
24. November 2015
How to find relevant information in massive data?
Big Humanities Data
24. November 2015
Definition of co-occurrence
Definition of co-occurrences:
• Common occurrence of at least two objects/events within a
dedicated window
• Possible windows in Humanities: line, sentence, paragraph,
document, author, century
Motivation:
• Psycholinguistic experiments: Given a word: What is the first
word that comes to mind for test subjects?
Big Humanities Data
24. November 2015
Definition of co-occurrence
Motivation:
• Psycholinguistic experiments: Given a word: What is the first
word that comes to mind for test subjects?
Stimulus
butter
Response Prob.
Bread
soft
Milk
Margarine
Cheese
Fat
yellow
Bread and butter
Box / can
eat
Big Humanities Data
# of Prob.'s
60
40
32
27
20
16
14
8
6
6
Co-occurrence
Bread
Cheese
Sugar
Milk
Margarine
Farina
Eggs
Pound
Meat
Significance
51
49
29
23
22
18
16
14
13
24. November 2015
How to find relevant information in massive data?
3.
1.
The
The search
search results
results for
for
"biomass"
"biomass" are
are displayed
displayed
as
as a
a 3D
3D knowledge
knowledge
network.
network. All
All displayed
displayed
topics
topics (blue
(blue spheres)
spheres)
have
have a
a semantic
semantic relation
relation
to
to the
the search
search term
term
"biomass".
"biomass".
II have
have already
seen
seen the
the topic
topic
"procedures"
"procedures" and
and
want
want to
to know
know
more
more about
about it.
it.
2.
Big Humanities Data
27
24. November 2015
How to find relevant information in massive data?
Big Humanities Data
24. November 2015
How to find relevant information in massive data?
Big Humanities Data
24. November 2015
Contrastive Semantics
Big Humanities Data
24. November 2015
Visualisation of Contrastive semantics
Big Humanities Data
24. November 2015
Main properties
• Contrast:
• Locality:
• Frequency range of contrastive semantic relations:
– Generally less than 10 times of common occurrences
Big Humanities Data
24. November 2015
Connectivity?
Big Humanities Data
24. November 2015
Some observations
• Identified clusters:
–
–
–
–
As shown in examples comedy
Sarcasm
Cynicism
Artificial ambiguity like „Michael Schumacher the red king“ (translated from a
German corpus)
– Scope to gnomology
•
Is there a relation between contrastive semantics and textual
reuse?
– Clearly, yes.
– „Evaluation results“: More than 90% of the contrastive semantics have a
relation to text reuse
Big Humanities Data
24. November 2015
Historical Text Reuse
Big Humanities Data
24. November 2015
Text Reuse for Humanities and Computer Science
• Question: Why is Text Reuse so relevant for Humanities and Computer
Science?
• Premise: The amount of digitally available data is growing exponentially (Big
Data)
• Humanities:
– Lines of transmission and textual criticism
– Transmissions of ideas/thoughts under different circumstances and
conditions
• Computer Science:
– Text Decontamination for stylometry and authorship attribution, dating of texts
– gen. Text Mining, Corpus Linguistics
Big Humanities Data
24. November 2015
Temperature Map
Big Humanities Data
24. November 2015
Visualisation problem: Dotplot view
Big Humanities Data
24. November 2015
Dotplot view
Big Humanities Data
24. November 2015
Visualisation problem: Macro view
Middle Platonism
Neoplatonism
Big Humanities Data
24. November 2015
eTRAP – Electronic Text Reuse Acquisition Project
Big Humanities Data
24. November 2015
Thank you!
"Stealing from one is plagiarism, stealing from many is
research" (Wilson Mitzner, 1876-1933)
Visit us at http://etrap.gcdh.de
Big Humanities Data
24. November 2015