Universitetet i Bergen Institutt for lingvistiske, litterære og estetiske studier (LLE) Sharing research data and results A research ethics study on language data and intellectual property Theory of Science and Ethics paper Gyri Smørdal Losnegaard November 10, 2013 1 Introduction In this paper I argue that rights holders of source materials are often given too much influence over the distribution and use of language resources (LRs). Although copyright legislation protects the rights of both creators and users, experience from LR exchange projects gives the general impression that common practice in rights clearance1 tends to favour the rights of intellectual property rights (IPR) holders at the expense of LR users. I have thus investigated to what extent IPR holders are actually entitled to determine the use of LRs derived from their work, taking into consideration both legal issues and other relevant factors such as the type of source data and the type of LR. The discussion is based on considerations made during the planning of the data compilation for an inventory of Norwegian multiword expressions (MWEs).2 The creation of this lexical resource will exploit a wide range of data such as raw text (novels, short stories, essays, news articles, blogs, etc.), corpora, electronic dictionaries, and other types of lexical and semantic resources. Several factors argue in favour of making this inventory publicly available with an open license distribution (CC-BY).3 2 Language data in scientific research 2.1 Language resources Language data is fundamental to research in the humanities, in particular philology, linguistics and language technology (LT) research and development. Researchers collect and organise data sets for their own purposes, or they make use of existing LRs. One common type of LR is text corpora, which are large collections of written materials marked with linguistic and other relevant information and made searchable. Other examples are electronic dictionaries, thesauri (conceptual dictionaries), terminology lists and other lexical inventories compiled for specific purposes. In computational linguistics and LT research, it is crucial to have access to large amounts of language data and to lexical and semantic resources, which are fundamental in the development of applications such as spelling and grammar checkers, information retrieval systems and machine translation. LRs are valuable and re-usable resources, and there are many reasons for supporting the exchange of such data. They exploit potentially copyrighted materials, however, and sharing LRs thus often conflicts with other important issues such as personal data protection and copyright. 2.2 Communalism Sharing research data and results is a moral responsibility, a social responsibility, and a prerequisite for scientific activity. Science communication is a specialised task that contributes to “the maintenance and development of cultural traditions, to the informed formation of public opinion and to the dissemination of socially relevant knowledge” [3].4 As such, it has an even more fundamental role, as an “expression of one of the requirements for democracy”. The community often invests heavily in research, and publicly funded research should, where possible, be made available to the public so that they may benefit from the results [3]. The exchange of scientific data and results can be seen in light of what Merton [2] refers to as communism in science.5 Merton groups the moral principles of science into four universal scientific mores: while universalism, disinterestedness and organized skepticism concern obligations towards objectivity, honesty, integrity and acknowledging one’s own fallability [3], communalism primarily hinges on the idea that science, in the sense of obtaining knowledge by the application of scientific methods, can be viewed as a collaborative effort that belongs to the community. The scientific enterprise rests, culturally and historically, on traditions established through previous scientific activity and practice––a practice according to which certified knowledge is granted by methodologically ratified verification and validation of theories, data and results. Knowledge can only be validated, or certified, if other scientists are allowed to examine data and project design. Granting other scien1 Cf. section 2.3. E.g. bring up, grab a bite, to and fro. 3 Open distribution, attribution required. 4 All quotations from this document are my own translations from the Norwegian. 5 Due to the present-day connotations of the term “communism”, “communalism” is often used instead, a convention I henceforth adopt. 2 1 tists access to research data and results is thus a prerequisite for continued research and for the development of knowledge, and is ultimately an acknowledgement that all research is based on the work of others. 2.3 Current practice in rights clearance for LRs Copyright laws grant authors extensive and exclusive rights to control their intellectual property (IP). In accordance with international conventions, the Norwegian Copyright Act (NCA) gives the author as the IPR holder exclusive rights to earn money from his work, to disseminate copies of his work, and to object to any negative use and modification (infringements).6 Authors cannot prevent others from making derivatives based on their work, but are entitled to put restrictions on their distribution and use. In research data exchange, rights clearance is a key task and refers to the process of obtaining authorisation from rights holders to distribute, use and re-use data and research results, and to negotiate the terms of use for such resources. The results of the negotiations are formalised in legally binding agreements where an end user license is set up for the resource in question. LRs are, in compliance with standard copyright legislation, treated as derivative resources dependent on the language data used to create them. As a consequence, their distribution and use are conditioned by the rights holders of the source data, who have the moral and legal rights to authorise the use of their work and to determine the terms of use for LRs based on their material. This often results in licenses that limit the use of LRs considerably. For instance, the application of the condition “no derivatives” in an end user license formally prevents researchers from distributing new and enhanced versions of the LR to end users. IPR holders sometimes also prohibit physical download of an LR as a means of regulating its use, precluding the use of the LR for most LT development tasks. One reason why rights holders would object to making an LR derived from their work available for download could be that they want to prevent users from reconstructing the source material, thus obtaining the opportunity to distribute it and benefit financially from its distribution. However, this is already prohibited by copyright law, which gives the author the exclusive right to distribute and exploit his work financially. The distribution of a derivative does not alter the fundamental rights of the author. Although there are restrictions enforced by copyright legislation, legal systems also offer opportunities to relax these restrictions, in the form of statutory exceptions to the copyright and as juridical precedence where external factors have been judged to modify or overrule restrictions on the use of copyrighted work (also known as fair use [4, p. 68] in the US and UK legal systems). An example of fair use is the use of copyrighted material for purposes such as public education. Whether “some use made of a work in any particular case is a fair use” will depend on the assessment of factors such as (1) the purpose and character of use (commercial, non-profit educational), (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion used, and (4) the effect of the use upon the market or value of the copyrighted work [4, p. 68]. The use of language data for research purposes, in order to extend knowledge on some specific area, verify data and results, and develop LT and new LRs, should be a strong candidate for fair use. However, European copyright legislation has no legal principle corresponding to fair use. In the Norwegian Copyright Act (NCA), special provisions are made for private use, licensing and citation and for purposes such as education, news reporting, scientific work, the inclusion of disabled people, etc.7 There are also concepts in the NCA, such as “intellectual property”,“substantial part” and “purpose of use” that may be used to refine the overall picture. These limitations to the copyright seem to be somewhat underexploited in rights clearance. One reason for this could be that resource creators do not want to infringe on the moral and legal rights of IPR holders, or to risk liability in case of legal prosecution. As a result of precautions taken by both IPR holders and resource creators, LRs are often licensed under what seem to be unnecessary strict user terms. Although the basic moral and legal rights of authors should be acknowledged and respected, I nevertheless argue that the author of a text does not under all circumstances have unrestricted rights to determine the terms of use for an LR that in some way exploits this text. To what extent the distribution and use of an LR should require authorisation by IPR holders of source data will depend on several factors, such as the nature and type of the source material, the number of different source materials used, the degree of exploitation and/or transformation of the original work, the nature of the new LR, and its intended use. I will not here investigate whether IPR holders have a disproportionately strong influence on the use of LRs, but rather try to determine to what extent they do have the right to control the 6 7 NCA articles 2–5. http://www.lovdata.no/all/hl-19610512-002.html 2 licensing of LRs based on their work, given the factors mentioned above. 3 LRs and copyright 3.1 Basic concepts and definitions relating to copyright Copyright legislation essentially regulates the use of copyrighted material. The word “use” may in this context refer to two different scenarios. The first is the exploitation of the original itself, by reference or by dissemination, which I will refer to as direct use. The second form of use is the derivation of secondary work from an original, also called transformative use, which involves any kind of alteration of pre-existing work such as the translation of a text. Note that the term “original” is also ambiguous: it can either mean “unique”, “independent” work, which is the defining characteristic of intellectual property, or “pre-existing” in the sense of being the source of a derivative. To avoid confusion I will use the terms primary and secondary work to distinguish between originals and derivatives. LRs are normally secondary work. The NCA defines IP as “literary, scientific or artistic work of any kind and of any mode of expression or creation”, including translations or adaptations of such work [1].8 To deserve copyright protection, a work must be original, i.e. it must qualify as IP in virtue of being the result of a minimum of creative and original investment. What this minimum is will normally be a question for judicial assessment in the individual case: the term threshold of originality is used for “the degree of individual, creative effort that a work or production must have in order to be defined as intellectual work and consequently earn protection by copyright law” [my translation].9 3.2 Types of copyrighted work Derivatives are “the translations, adaptations, arrangements and similar alterations of preexisting works” [6].10 A typical example is movie adaptations of novels. Someone who adapts an original has copyright to the new work without prejudice to the copyright of the original [1, 6]. 11 This means that the creator of the derivative cannot dispose of it in a way that infringes on the copyright of the original work, and must obtain authorisation from the creator(s) of the primary work(s) in order to make the secondary work available to the public [5]. Collections are compilations of several works, or parts of such works. Examples are encyclopedic works, dictionaries and anthologies [7].12 They are protected because they “by reason of the selection and arrangement of their contents, constitute intellectual creations” (ibid). Collections and compilations are considered a kind of derivative and thus subject to the same rights and limitations [1, 6]. Databases are protected along with forms, catalogues, tables, programmes and similar work that “collocates a significant amount of information, or which is the result of a significant investment”.13 The compilation of databases does not necessarily involve originality and creativity, which is the primary criterion for a work to be considered IP. However, they often require significant investments on behalf of the creator in terms of efforts and competence [5]. Consequently, databases are copyrighted either in virtue of being creative, original work, or else as a result of investments in the data compilation and organisation. It is not always clear whether databases are primary or secondary (derivative) works. This will probably depend on the nature of the data in the database, and on the degree of exploitation of the source materials. 3.3 Common LRs and copyright In terms of language data, raw material is text that has not been enhanced in any significant way. It is the original work in its original form, “as was” when made public, and thus corresponds to primary work. According to the NCA, direct use of such data requires attribution of the author, while the re-use of secondary work derived from such data must be authorised by the author. 8 We here only consider written material. http://no.wikipedia.org/wiki/Verksh%C3%B8yde 10 All quotations from this document are my own translations from the Norwegian. 11 Article 5. 12 Article 2(5). 13 The NCA article 43, known as the “catalogue provision”, is based on the EU directive on statutory protection of databases. 9 3 Corpora can be regarded as both linguistic databases and collections. They include original works, either complete works or substantial parts of such works. In rights clearance, corpora are treated as a kind of collection, although they are not traditional compilations of raw materials. Corpora texts are annotated and usually not very human-readable, and most corpora either contain only part of complete works, or the text is only a very small part of the total collection of texts, since the purpose is to document language use, either in general or particular to some specific genre, period, etc. This raises some highly relevant questions concerning the degree of exploitation and transformation of the source materials of corpora in general, and whether this could be argued to reduce the influence of the rights holders on this particular type of LR. Unfortunately, the scope of this paper does not allow us to explore this in more detail. Limiting our attention to corpora as a data source in the creation of new LRs, we have to accept the terms of use specified in the licenses for the individual corpora. Dictionaries and other lexical and semantic resources (reference works) are also situated somewhere inbetween database and collection. WIPO defines dictionaries as a type of collection, although dictionaries, thesauri and other lexical and semantic resources, unlike corpora, do not collocate text. On the contrary, they are inventories of lexical or semantic items supplied with relevant linguistic information. In that respect, they could be argued to be databases. When using such LRs as source data, this distinction seems not to be very relevant, since both types of work are copyright protected. As with corpora, it is the degree of exploitation that seems most relevant. 4 Discussion The main question in this paper is whether IPR holders always have unlimited rights to restrict the use of secondary work based on their materials. I will here discuss factors that may have a bearing on this: the concept of originality in copyright, the types of source materials and the ways in which these materials are used in developing the new LR, and the type of this LR. I limit the discussion to the development and use of lexical resources, which is the kind of LR that I will create in my PhD project. In the use of original works, there is arguably a difference between exploiting a text as an “artistic expression”, “narrative” or “expression of ideas”, and using the text in scientific work. In linguistics, the term text is used about sequences of words, phrases and sentences, demarcated by punctuation. The object of linguistics and related disciplines is not to exploit the text as a literary work, but to study language itself. In linguistic research, language data is used as evidence of how language is actually used by speakers and writers. This kind of use could be argued to be less exploitative than for instance the adaptation of a literary work to a motion picture, which is itself an artistic expression that builds on the ideas of the original. It is thus contestable whether text, when being used as evidence for language use, has a sufficiently high threshold of originality. Pursuing the concept of threshold of originality even further, I will argue that language, in particular words and other lexical items, cannot be intellectual property. Language is made up of lexical units that are combined according to combinatory rules: words combine into phrases, which in turn combine into sentences. Words and syntactic patterns are conventions and have the meaning they have because language users collaboratively have come to associate meaning with them. Individuals can hardly claim authorship to the words of a language, since language itself is public domain. Clearly, combining words into phrases and sentences involves some degree of creative activity, but there are limitations to how much original content will fit into sentence format. I will not here try to determine the limit as concerns linguistic utterances and creativity: sequences of n characters or words (n-grams), phrases, clauses, sentences, paragraphs, chapters, or the text as a whole? A common opinion among language researchers, however, seems to be that anything shorter than a sentence is excepted from copyright. Moving on to the specific case of my inventory of MWEs, I will build a database of lexical units extracted from many different sources. Each source will serve to exemplify instances of lexical items, and it will in many cases be difficult to single out concrete contributions from the individual sources. It thus seems unreasonable to claim that this one inventory is a derivative of every single source, which may amount to tens or even hundreds if we count the individual texts in the corpora. The use of several sources will, in general, argue against treating an LR as a derivative. In the extraction of specific linguistic phenomena from language data, the data sources are not transformed or exploited in any substantial way, but rather used to find instances, or evidence, of the linguistic or lexical phenomena. Besides the question of threshold of originality, one may ask whether this kind of use involves any 4 significant degree of exploitation of the source materials. Unless the new lexical resource is an updated version of a dictionary, or re-uses entries and definitions from other lexical resources, the new resource will usually not contain a substantial part of its sources. This kind of exploitation of source materials can thus be argued to be more related to direct use of language data––citation and reference––than to transformational use. Finally, I will mention relevant aspects relating to each main type of source materials. When using raw text as source data in the creation of lexical inventories, the degree of exploitation of the source data can be argued to be closer to reference than to derivation. The resulting resource can thus be argued not to be a derivative, but rather a compilation of language units that belong to the public domain. It thus seems unreasonable that the rights holders of such materials should be entitled to restrict the distribution and use of this lexical database. Corpora are usually created for purposes such as research and LR development, and are licensed exactly for such use. Although such resources sometimes have user terms that prohibit the distribution of derivatives, it is arguable whether a lexical inventory can be regarded as a direct derivative of a corpus. As in the case of raw text, the exploitation of the source material could be argued to be more referential than transformational. As reference works, dictionaries and other kinds of lexical and semantic resources provide a systematic documentation of the vocabulary of a language. The nature of such materials provides LR developers with structured and qualityassured data. In this respect, LRs that exploit similar types of resources seem to be particularly reliant on the source data. However, lexicographers themselves use corpora in their work to an increasing degree, determining which lexical items belong to the vocabulary of a given language and consequently deserve a place the dictionary. This illustrates how all linguistic work and resource development are to some extent based on the work of others, either the work of authors or other researchers or professionals. Importantly, being reference works, dictionaries are after all meant to be used for reference. Extracting all items of a particular kind from a reference work, or list of words, can be argued to be more exploitative than the extraction of such items from raw text or corpora, in which lexical items occur “naturally” or “spontaneously” in a context. Unlike in dictionaries, the words and other lexical items in corpora were not selected for inclusion in the corpus based on their properties or lexical status, they are only there because they happened to occur in a text. For the same reasons, the public domain argument can more feasibly be applied to text and text collections than to systematically compiled inventories of lexical units. The degree of exploitation makes dictionaries and other reference works a more difficult case than other types of LR. Since the public domain argument is somewhat weaker for this kind of source data, and the degree of exploitation is also somewhat higher, it seems motivated to obtain permission from the IPR holders of such materials to license the resource to end users. 5 Conclusion In the case of my lexical inventory of Norwegian MWEs, this database will “re-use” lexical items that are arguably in the public domain. Several data sources will be exploited, and I will not include definitions and other information from dictionary entries in the new resource, only the lexical items themselves. The creation of the LR will thus not exploit any of the sources in any substantial way. Both the type of resource, its nature and its intended use seem to suggest that the different rights holders of the source data are entitled to only minimal influence over its distribution and use. Except for the dictionaries, for which rights should probably be cleared with the rights holders of the originals, reference to the sources seems to be sufficient for the distribution and use of this lexical inventory. Making the database available to end users under a CC-BY license or an ACA-license,14 even without obtaining authorisation, cannot be said to cause any major infringements on the rights of the rights holders. References [1] Kulturdepartementet. Lov om opphavsrett til åndsverk m.v. (Åndsverksloven.) [the Norwegian Copyright Act]. http://www.lovdata.no/all/nl-19610512-002.html. [2] Robert K. Merton. The Sociology of Science, chapter The Normative Structure of Science. The University of Chicago Press, Chicago and London, 1942. 14 Use for academic purposes. 5 [3] National Committees for Research Ethics in Norway. Guidelines for research ethics in the social sciences, law and the humanities. http://www.etikkom.no/Documents/Publikasjoner-som-PDF/ Guidelines%20for%20research%20ethics%20in%20the%20social%20sciences,%20law%20and% 20the%20humanities%20(2006).pdf, 2005. [4] Sam Ricketson and Victoria Barrister. WIPO Study on Limitations and Exceptions of Copyright and Related Rights in the Digital Environment. http://www.wipo.int/meetings/en/doc_details.jsp?doc_id= 16805, April 2003. [5] Olav Torvund. Opphavsrett – en introduksjon [Copyright — an introduction]. http://www.torvund. net/index.php?page=opph-innl, 2010. [6] World Intellectual Property Organization. WIPO Glossary. http://www.wipo.int/tk/en/resources/ glossary.html#19. [7] World Intellectual Property Organization. Berne Convention for the Protection of Literary and Artistic Works. http://www.wipo.int/treaties/en/ip/berne/trtdocs_wo001.html, 1886/1979. 6
© Copyright 2026 Paperzz