LSA Summer Institute 2015

Development of
Linguistic Linked Open
Data (LLOD) Resources
for Collaborative DataIntensive Research in
the Language Sciences
LSA Summer Institute
2015
July 25-26, 2015 University of Chicago
Organizers: María Blume, Antonio Pareja Lora, Barbara Lust
Linguistic theory in a world
of big data: LSA 2015
“Highlight a growing interest within
the field of linguistics to test theory
with increasingly larger data sets,
such as data from extensive
linguistic fieldwork and
documentation, data from crowdsourcing over the web, and corpus
data from archival recordings
and/or written sources”: integrate
large and diverse data sources
Linked Open Data (LOD)
 The pursuit of “Linked Open Data” achieved by exploiting
the internet and cloud resources represents the current
approach to such data integration.
 Linked Open Data in Linguistics (LLOD) represents the
attempt to exploit this paradigm within linguistics and the
language sciences, now intensely in Europe through an
Open Linguistics Working Group (OLWG)
 LLOD purpose is:“discuss types of resources, strategies to
address issues of interoperability between them, protocols
to distribute, access and integrate this information and
technologies and infrastructures developed on this basis”
and to develop a community committed to a “Linked
Open Data in Linguistics” (LLOD) agenda”
http://ldl2014.org/index.html).
Our Purpose
 Develop the European inspired LLOD vision, applied
to actual and concrete research needs in the
language sciences. Use the study of language
acquisition as a case study here.
 Cultivate an emerging community in this area
 Begin to define the challenges to realizing this
vision, both research challenges and technical
challenges
 Inform and cultivate synergies between the
technical and research based scholars
 Together begin to chart path to future solutions to
most pressing current challenges
Our motivation
 Challenges, both active research needs, and
technical, are pressing
Leading Research Questions
in Language Sciences
 How is language acquired
 How are multiple languages acquired leading to
multilingualism
Language sciences: the
challenges
 Cognitive Science: Interdisciplinary: data from
diverse disciplines
 Intensely collaborative
 Cross linguistic: monolingual or multilingual
 Data from diverse types of sources (e.g,
experimental or observational)
 Data from many formats, audio, video, transcript
 Infinitely expandable data analyses and relevant
coding of even a single utterance
Language Sciences:Diverse
data forms
 Organized in different labs by different metadata
standards with often unstable infrastructures.
Makes comparability, and consequent linking,
difficult if not impossible.
 Annotated often with domain specific complex
markups, frequently involving cross-linguistic
comparisons across various cultures and
languages and types of speakers.
Technical challenges
 “The lack of interoperability between linguistic
and language resources represents a major
challenge that needs to be addressed if
information from different sources is to be
combined…”; “….commonly accepted
strategies to distribute, access and integrate their
information have yet to be established, and
technologies and infrastructures to address both
aspects are still under development”. (LLOD)
Cross-cutting Challenges to
confront
 Various linguistic theories can be applied for data description
and analysis. Different disciplines speak different ‘languages’:
a need to interface theoretical vocabularies ( by means of
ontologies)
 Annotation schemas resulting from specific ontologies can
vary widely, with specific research agendas: need precise
and specific theoretically driven data markup and general
knowledge provider frameworks in a “computationally
practical” manner : need to develop metadata standards
 Cybertools must be developed in order to provide individual
researchers with structured infrastructure for creating data
which can efficiently become interoperable.
Cross Cutting Challenges to
confront
Human Issues
 Intellectual property rights of the researcher who
creates the data
 Human subjects protections of any natural
language data
 Legal issues regarding data ownership and
dissemination
Cross Cutting Challenges o
confront
 Sustainability (e.g., Berman and Cerf 2013,
Science). “Who will pay for public access to
research data”.
Vision of this Workshop
 “further growth of an open community relying on
scientific collaboration beyond insular solutions
and national boundaries: (Christian Chiarcos, pc 9/5/14; Open
Linguistics Working Group, Linked Data in Linguistics,Johann Wolfgang GoetheUnivesitat, Frankfurt am Main, Germany.
 Cannot confront all challenges; a beginning
Few Details about workshop
structure
 Have purposely attempted to integrate presentations by those pursuing
technical challenges in LOD with those attempting to design data
representation in active research in the language sciences; case study of
multilingualism.
 Have inserted continual discussion sections not only to facilitate exchange
among participants, but we welcome audience participation.
 As concrete case study, a hands-on session tomorrow/Sunday to explore a
developing cyberinfrastructure for language data representation
developed by a collaborative community with shared research
needs.(Data Transcription and Analysis (DTA) Tool). Will investigate the
requirements now to convert such a local proprietary form of research tool
and resulting relational database which serves a specific community to a
LOD framework.
 Presentation by Nan Ratner on CHILDES extensions to archiving and
dissemination: new interdisciplinary extensions
 In initial pursuit of sustainability challenges, will integrate University Library
representatives(Cornell and U of Chicago). University Library an essential
infrastructure.
Product of this Workshop
 A Workshop – not simply paper presentations to
accomplish knowledge exchange and shared
confrontation of challenges
 Convene at end to draft realistic directions to the
future
 Ideal: Community draft of a sufficiently formal
model (or set of alternative models) for LOD
creation in terms of principles, standards and
procedures, capable of integrating now diverse,
incompatible forms, capable of integrating
technical advances with real research needs in
the language sciences.
Acknowledgements
 Emily Bednarski, project manager
 Carissa Kang, graduate student assistant
 Jonathan Masci, undergraduate student assistant
 LSA Institute Directors (Karlos Arregi); support staff Laura Staum
Casasanto
 NSF Workshop grant BCS-14631965; NSF CI TEAM grant CI-0753415
 Cognitive Science travel grant for our undergrad student assistant, Jon
Masci
 Cornell Institute for Social Science grant supplementing administrative
costs
 LSA Institute Fellowship grant for graduate student, Carissa Kang.