Digital Humanities for Computer Scientists … or: How I became infected with the Indiana Jones virus Marco Büchler eTRAP Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany Big Humanities Data 24. November 2015 Who am I? • 2001/2 • 2006 • • • • • 20072008 2011 2013 2014- Big Humanities Data Head of Quality Assurance department in a software company Diploma in Computer Science on big scale cooccurrence analysis Consultant for several SME in IT sector Technical project management of eAQUA project PI and project manager eTRACES project PhD in „Digital Humanities“ on Text Reuse Head of Early Career Research Group eTRAP at Göttingen Centre for Digital Humanities 24. November 2015 The Indiana Jones virus? Big Humanities Data 24. November 2015 Overview • • • • Big Humanities Data Filling gaps in inscriptions Uncovering unexpected relations Text Reuse Big Humanities Data 24. November 2015 Big (Humanities) Data • 3 aspects (by Ulrike Rieß, Big Data bestimmt die IT-Welt): – Huge amount of data that can‘t be processed and analyzed manually – Less structured data; e. g. in comparison to databases and data warehouse systems – Linked data between heterogeneous and distributed resources • The fastest growing sources of Big Data are text and images. • Researchers easily get lost in the information overload (Big Data) and in the information poverty (Humanities Data). Big Humanities Data 24. November 2015 Filling (Guessing) gaps in inscriptions and Papyri Big Humanities Data 24. November 2015 The data Big Humanities Data 24. November 2015 Damnatio Memoriae Big Humanities Data 24. November 2015 The papyrus Big Humanities Data 24. November 2015 Transcribed data Big Humanities Data 24. November 2015 Input form Big Humanities Data 24. November 2015 Parsed input (parsed for Leiden conventions) Big Humanities Data 24. November 2015 Strategy 1: use only information of the word Big Humanities Data 24. November 2015 Word length + „survived“ pattern Big Humanities Data 24. November 2015 Strategy 2: use only of context information Big Humanities Data 24. November 2015 Word bigrams + co-occurrences + classification Big Humanities Data 24. November 2015 The „real“ Strategy 2: use only of context information and removing any information about the damaged word Big Humanities Data 24. November 2015 Reparsing Big Humanities Data 24. November 2015 Word bigrams + co-occurrences + classification Big Humanities Data 24. November 2015 Strategy 3: Complete strategy Big Humanities Data 24. November 2015 Multiple options Big Humanities Data 24. November 2015 Search & Find Big Humanities Data 24. November 2015 How to find relevant information in massive data? Big Humanities Data 24. November 2015 How to find relevant information in massive data? Big Humanities Data 24. November 2015 Definition of co-occurrence Definition of co-occurrences: • Common occurrence of at least two objects/events within a dedicated window • Possible windows in Humanities: line, sentence, paragraph, document, author, century Motivation: • Psycholinguistic experiments: Given a word: What is the first word that comes to mind for test subjects? Big Humanities Data 24. November 2015 Definition of co-occurrence Motivation: • Psycholinguistic experiments: Given a word: What is the first word that comes to mind for test subjects? Stimulus butter Response Prob. Bread soft Milk Margarine Cheese Fat yellow Bread and butter Box / can eat Big Humanities Data # of Prob.'s 60 40 32 27 20 16 14 8 6 6 Co-occurrence Bread Cheese Sugar Milk Margarine Farina Eggs Pound Meat Significance 51 49 29 23 22 18 16 14 13 24. November 2015 How to find relevant information in massive data? 3. 1. The The search search results results for for "biomass" "biomass" are are displayed displayed as as a a 3D 3D knowledge knowledge network. network. All All displayed displayed topics topics (blue (blue spheres) spheres) have have a a semantic semantic relation relation to to the the search search term term "biomass". "biomass". II have have already seen seen the the topic topic "procedures" "procedures" and and want want to to know know more more about about it. it. 2. Big Humanities Data 27 24. November 2015 How to find relevant information in massive data? Big Humanities Data 24. November 2015 How to find relevant information in massive data? Big Humanities Data 24. November 2015 Contrastive Semantics Big Humanities Data 24. November 2015 Visualisation of Contrastive semantics Big Humanities Data 24. November 2015 Main properties • Contrast: • Locality: • Frequency range of contrastive semantic relations: – Generally less than 10 times of common occurrences Big Humanities Data 24. November 2015 Connectivity? Big Humanities Data 24. November 2015 Some observations • Identified clusters: – – – – As shown in examples comedy Sarcasm Cynicism Artificial ambiguity like „Michael Schumacher the red king“ (translated from a German corpus) – Scope to gnomology • Is there a relation between contrastive semantics and textual reuse? – Clearly, yes. – „Evaluation results“: More than 90% of the contrastive semantics have a relation to text reuse Big Humanities Data 24. November 2015 Historical Text Reuse Big Humanities Data 24. November 2015 Text Reuse for Humanities and Computer Science • Question: Why is Text Reuse so relevant for Humanities and Computer Science? • Premise: The amount of digitally available data is growing exponentially (Big Data) • Humanities: – Lines of transmission and textual criticism – Transmissions of ideas/thoughts under different circumstances and conditions • Computer Science: – Text Decontamination for stylometry and authorship attribution, dating of texts – gen. Text Mining, Corpus Linguistics Big Humanities Data 24. November 2015 Temperature Map Big Humanities Data 24. November 2015 Visualisation problem: Dotplot view Big Humanities Data 24. November 2015 Dotplot view Big Humanities Data 24. November 2015 Visualisation problem: Macro view Middle Platonism Neoplatonism Big Humanities Data 24. November 2015 eTRAP – Electronic Text Reuse Acquisition Project Big Humanities Data 24. November 2015 Thank you! "Stealing from one is plagiarism, stealing from many is research" (Wilson Mitzner, 1876-1933) Visit us at http://etrap.gcdh.de Big Humanities Data 24. November 2015
© Copyright 2026 Paperzz