Infield database course1

Databases in language documenta1on Introduc1ons •  Why are you here? •  What language are you working on / planning on working on? •  What database experience do you have? •  What do you want to use a database for? Choose your DBMS •  We will present you with an overview of databases and their uses in tracking outputs of linguis1c fieldwork •  There are a number of DataBase Management Systems (DBMS) but all have certain characteris1cs in common •  We will illustrate with spreadsheets, FMPro and Open Office Base Sources of informa1on •  Many tutorials on the web, e.g., –  General DBMS •  hNp://www.geekgirls.com/databasics_01.htm –  Open Office Base hNp://www.tutorialsforopenoffice.org/category_index/base.html hNp://showmedo.com/videotutorials/series?name=AXggL6j0a –  FMPro hNp://www.filemaker.com/fmdemos hNp://www.vtc.com/databases.htm Database management systems •  Various systems that linguists have used –  for lexicon management –  for typological / compara1ve work –  for maintaining collec1ons of texts –  for tracking data processing Database management systems •  Various systems that linguists have used –  pre computers •  card files (hence shoebox) •  ‘system cards’, cope-­‐chat cards Cope-­‐chat cards 1930s onwards card index (the hole represents a predefined research category, topic, author etc). www.strychnine.co.uk/copechat.html Kibrik 1977:60 K-6 type, 105-147 mm, with
74 double perforations
Kibrik, A. E. 1977. The methodology of field
investigations in linguistics. Setting up the
problem. The Hague, Paris: Mouton.
This example from 1987:
Navajo lexical database
Werner, Oswald, G. Mark Schoepfle and Julie Ahern. 1987. Systematic fieldwork. Newbury Park, Calif.: Sage Publications. P.59
Computer-­‐based DBMS •  Various so_ware tools for data management from the earliest computers Development from flat databases to rela1onal databases, early 1970s 1940s, 50s Ini1al use of computers as calculators 1960s Business uses. Organiza1onal data 1970s – rela1onal model – tables related by keys 1980s – personal computers 1990s – WWW, internet – based database systems 2000s – ‘Web 2.0’ RDF stores … Databases we use •  Catalogs –  libraries –  ITunes •  Linguis1c databases –  WALS –  LLMap –  RoseNa –  OLAC Databases we use: WALS hNp://wals.info/ Databases we use: LLMap hNp://www.llmap.org/ Databases we use: RoseNa hNp://roseNapanglossia.longnow.org/ Databases we use: OLAC hNp://www.language-­‐archives.org/sta1c/language/llp.html Basics: Records, fields and tables A TABLE is a grid of rows arranged in columns. The rows of a TABLE will contain data about one person, event, item of warehouse stock, book, etc, etc. Each column will give informa1on about a single aNribute of each record. A TABLE is like a spreadsheet. A telephone book is a simple example of the sort of data that a DBMS can manage well. Basics: Records, fields and tables A simple phone book table has three ATTRIBUTES: Name, Address, Number. (The "Name" aNribute might well be split into FirstName and LastName aNribute, but the principle is the same.) These aNributes are stored in FIELDS in the table. The elements of each row of the TABLE, taken together, cons1tute one RECORD. The columns are called FIELDS. A RECORD in a phone book might be: Stella Brown .... 1 Main St .... 919-­‐555-­‐1212 Database basics •  A database consists of tables with records which are made up of fields –  e.g., a record for a speaker could have fields for their name, date of birth, gender and other relevant informa1on (clan, place of birth, languages spoken …) –  But, if we want to sort by speaker last names, then we need a field for last names separate from first names Error correc1on in a DBMS 159,Apay Tang 160,Paulina Yourupi 161,Paulina Yourupi 162,"Kirby, Oscar" 163,Jason Lobel 164,Shoko Kubotera 165,Geoffrey White 166,Joel Bradshaw 167,Robert Blust 168,Emily Bartelson 169,MaNhew Loui 170,James Hafford 171,Jake Terrell 172,Lauren Gawne 173,Fred Dagg 174,Barry Alpher 175,"Cochrane, Percy" 176,George Grace 177,"" 178,Olga Lovick 179,"Cochrane, Percy" 180,"Dunlop, Ros" 181,Sebas1en Lacrampe 182,Deborah Hill Data visualiza1on •  Automa1c view of trends in data –  number of languages in a collec1on –  date ranges of recordings –  age range of par1cipants –  number of texts and stages of comple1on of analysis of them Fieldwork •  Fieldnotes •  Field recordings Fieldwork •  Fieldnotes –  diary –  notes –  formal elicita1on sessions –  transcripts •  Field recordings •  Field recording results in related objects, minimally: –  Recordings (audio / video / images) –  Speakers –  Derived texts •  Field recording results in related objects, minimally: –  Recordings (audio / video / images) •  aNributes– loca1on, genre, date, par1cipants –  Speakers –  Derived texts •  Field recording results in related objects, minimally: –  Recordings (audio / video / images) –  Speakers •  aNributes– age, clan, family, sex, par1cipant in recordings –  Derived texts •  Field recording results in related objects, minimally: –  Recordings (audio / video / images) –  Speakers –  Derived texts •  aNributes– genres, speakers, date, on a recording •  Field recording results in related objects, minimally: –  Recordings (audio / video / images) •  aNributes– loca1on, genre, date, par1cipants –  Speakers •  aNributes– age, clan, family, sex, par1cipant in recordings –  Derived texts •  aNributes– genres, speakers, date, on a recording Why not use a text file? •  inefficient way to store and locate informa1on •  includes lots of repe11on –  same names recur, same loca1ons, same topics, same data types, same languages etc. •  each is stored as an individual item in a text file, but is stored only once in a database Why not use a text file? •  Doesn’t capture rela1onships –  e.g., how to find everything spoken by a female over 30 from this village •  Doesn’t control data entry –  typos, errors, misrepresenta1on – same name entered in several ways , e.g., Stella Brown; Estella Brown; Brown, Estella; Brown, Stella. Spreadsheet •  A spreadsheet can allow structured data entry Database and DBMS •  A database is a set of data that is structured in some way •  A Database Management System (DBMS) is the tool that allows you to work with a database –  e.g., MySQL, FMPro, OpenOffice Base, MS Access A database can •  Track rela1onships between objects •  Track processing you have done •  Locate all files e.g., associated with a par1cular speaker •  Allow you to quickly assess what you have finished and what remains to be done Database •  Speakers –  age, sex, clan … •  Recordings –  Date, loca1on, speaker •  Derived items –  Transcripts –  Texts Toolbox as a database •  Toolbox keeps data in separate fields –  \lx headword –  \ps part of speech –  \de defini1on –  etc •  But it is not a fully featured DBMS –  flat -­‐ not rela1onal Toolbox example Toolbox example Parse func1on Toolbox as a flat database •  This means that the same data will be stored in different places in the files •  e.g., each gloss/meaning pair is simply looked up and copied into the text •  Any change to a defini1on does not change all gloss/meaning pairs A DBMS can •  Track rela1onships between objects •  Locate all files e.g., associated with a par1cular speaker A DBMS can •  Speed up data entry and ensure accuracy by using dropdown menus drawn from a controlled set or from exis1ng data A DBMS can •  Track processing you have done (is this file transcribed, is it interlinearized, is it archived ….) •  Allow you to quickly assess what you have finished and what remains to be done Let’s Make a Flat-­‐file Database
•  “Flat-­‐file Database” –  consists of a single table –  good to keep track of a single kind of informa1on Let’s Make a Flat-­‐file Database
•  Data: –  metadata about textual materials •  Steps: –  categorize informa1on –  set up a table so that informa1on is categorized by column Flat-­‐file Database!
Text Text Name
ID
Speaker
Tape
Text Type Date Recorded Archived
001 Inglis Polis
Namaf, Kalsarap
005
Life story 1998-­‐09-­‐14
X
002 Planta1on Days Kaltaf, Kaloros
98017 Life story 1998-­‐10-­‐05
X
003 Erakor Island
Iokopeth
28
History
1998-­‐09-­‐17
004 Sokfal
Namaf, Kalsarap
005
Kastom
1998-­‐09-­‐17
005 Ririel and Ririal Kalfau, John
19
Life story 1999-­‐04-­‐15
006 Making Thatch
19
Kastom
1999-­‐04-­‐15
X
19
Life story 1999-­‐06-­‐05
X
Takau, Toukelau
007 Ririel and Ririal Takau, Harris
X
Consistency within a database
•  Consistency is very important in a database. •  Within a database, each column (field) should contain a single type of informa1on.
•  You should be conscious about the type of data you are dealing with. Common data types
• 
• 
• 
• 
• 
• 
Text Numeric Date / Time Currency Boolean (True/False) Enumerated list (=controlled vocabulary = Range Set)
Datatyping
Text Text Name
ID
Speaker
Tape
Text Type Date Recorded Archived
001 Inglis Polis
Namaf, Kalsarap
005
Life story 1998-­‐09-­‐14
X
002 Planta1on Days Kaltaf, Kaloros
98017 Life story 1998-­‐10-­‐05
X
003 Erakor Island
Iokopeth
28
History
1998-­‐09-­‐17
004 Sokfal
Namaf, Kalsarap
005
Kastom
1998-­‐09-­‐17
X
Limita1ons of Flat-­‐file DBs
•  Not great when you need to keep track of mul1ple kinds of informa1on •  E.g. suppose you want to keep track of info about tapes and speakers, as well as info about textual materials? Workaround 1
•  Put everything in one table
ID Title
RecDate
Speaker Sex DOB
01 Ghost story
2009/11/02 John Forma recorder
t
M 1932/02/11 PA005 wav
Sony DM50
1940/01/11 PA005 wav
Sony DM50
M 1932/02/11 PA008 wav
Edirol R09HR
Doe
02 Bed1me 2009/11/02 Mary story
Doe
03 War story
Tape
2010/12/15 John Doe
F
Workaround 2
•  Use mul1ple tables –  Textual Materials table: •  Text 5tle, speaker, rec. date, genre, etc. –  Tape table: •  Tape 5tle, recorder, mic, format, etc. –  Speaker table: •  Name, sex, DOB, clan, educa5on, etc.