Development of Linguistic Linked Open Data (LLOD) Resources for Collaborative DataIntensive Research in the Language Sciences LSA Summer Institute 2015 July 25-26, 2015 University of Chicago Organizers: María Blume, Antonio Pareja Lora, Barbara Lust Linguistic theory in a world of big data: LSA 2015 “Highlight a growing interest within the field of linguistics to test theory with increasingly larger data sets, such as data from extensive linguistic fieldwork and documentation, data from crowdsourcing over the web, and corpus data from archival recordings and/or written sources”: integrate large and diverse data sources Linked Open Data (LOD) The pursuit of “Linked Open Data” achieved by exploiting the internet and cloud resources represents the current approach to such data integration. Linked Open Data in Linguistics (LLOD) represents the attempt to exploit this paradigm within linguistics and the language sciences, now intensely in Europe through an Open Linguistics Working Group (OLWG) LLOD purpose is:“discuss types of resources, strategies to address issues of interoperability between them, protocols to distribute, access and integrate this information and technologies and infrastructures developed on this basis” and to develop a community committed to a “Linked Open Data in Linguistics” (LLOD) agenda” http://ldl2014.org/index.html). Our Purpose Develop the European inspired LLOD vision, applied to actual and concrete research needs in the language sciences. Use the study of language acquisition as a case study here. Cultivate an emerging community in this area Begin to define the challenges to realizing this vision, both research challenges and technical challenges Inform and cultivate synergies between the technical and research based scholars Together begin to chart path to future solutions to most pressing current challenges Our motivation Challenges, both active research needs, and technical, are pressing Leading Research Questions in Language Sciences How is language acquired How are multiple languages acquired leading to multilingualism Language sciences: the challenges Cognitive Science: Interdisciplinary: data from diverse disciplines Intensely collaborative Cross linguistic: monolingual or multilingual Data from diverse types of sources (e.g, experimental or observational) Data from many formats, audio, video, transcript Infinitely expandable data analyses and relevant coding of even a single utterance Language Sciences:Diverse data forms Organized in different labs by different metadata standards with often unstable infrastructures. Makes comparability, and consequent linking, difficult if not impossible. Annotated often with domain specific complex markups, frequently involving cross-linguistic comparisons across various cultures and languages and types of speakers. Technical challenges “The lack of interoperability between linguistic and language resources represents a major challenge that needs to be addressed if information from different sources is to be combined…”; “….commonly accepted strategies to distribute, access and integrate their information have yet to be established, and technologies and infrastructures to address both aspects are still under development”. (LLOD) Cross-cutting Challenges to confront Various linguistic theories can be applied for data description and analysis. Different disciplines speak different ‘languages’: a need to interface theoretical vocabularies ( by means of ontologies) Annotation schemas resulting from specific ontologies can vary widely, with specific research agendas: need precise and specific theoretically driven data markup and general knowledge provider frameworks in a “computationally practical” manner : need to develop metadata standards Cybertools must be developed in order to provide individual researchers with structured infrastructure for creating data which can efficiently become interoperable. Cross Cutting Challenges to confront Human Issues Intellectual property rights of the researcher who creates the data Human subjects protections of any natural language data Legal issues regarding data ownership and dissemination Cross Cutting Challenges o confront Sustainability (e.g., Berman and Cerf 2013, Science). “Who will pay for public access to research data”. Vision of this Workshop “further growth of an open community relying on scientific collaboration beyond insular solutions and national boundaries: (Christian Chiarcos, pc 9/5/14; Open Linguistics Working Group, Linked Data in Linguistics,Johann Wolfgang GoetheUnivesitat, Frankfurt am Main, Germany. Cannot confront all challenges; a beginning Few Details about workshop structure Have purposely attempted to integrate presentations by those pursuing technical challenges in LOD with those attempting to design data representation in active research in the language sciences; case study of multilingualism. Have inserted continual discussion sections not only to facilitate exchange among participants, but we welcome audience participation. As concrete case study, a hands-on session tomorrow/Sunday to explore a developing cyberinfrastructure for language data representation developed by a collaborative community with shared research needs.(Data Transcription and Analysis (DTA) Tool). Will investigate the requirements now to convert such a local proprietary form of research tool and resulting relational database which serves a specific community to a LOD framework. Presentation by Nan Ratner on CHILDES extensions to archiving and dissemination: new interdisciplinary extensions In initial pursuit of sustainability challenges, will integrate University Library representatives(Cornell and U of Chicago). University Library an essential infrastructure. Product of this Workshop A Workshop – not simply paper presentations to accomplish knowledge exchange and shared confrontation of challenges Convene at end to draft realistic directions to the future Ideal: Community draft of a sufficiently formal model (or set of alternative models) for LOD creation in terms of principles, standards and procedures, capable of integrating now diverse, incompatible forms, capable of integrating technical advances with real research needs in the language sciences. Acknowledgements Emily Bednarski, project manager Carissa Kang, graduate student assistant Jonathan Masci, undergraduate student assistant LSA Institute Directors (Karlos Arregi); support staff Laura Staum Casasanto NSF Workshop grant BCS-14631965; NSF CI TEAM grant CI-0753415 Cognitive Science travel grant for our undergrad student assistant, Jon Masci Cornell Institute for Social Science grant supplementing administrative costs LSA Institute Fellowship grant for graduate student, Carissa Kang.
© Copyright 2026 Paperzz