An analysis of type inflation and error in the Legal Maori corpus Contents Error analysis..................................................................................................................................... 2 Sampling ....................................................................................................................................... 2 Scanning the wordlist for error ...................................................................................................... 5 English ........................................................................................................................................... 5 Next steps ........................................................................................ Error! Bookmark not defined. Orthographic variation ...................................................................................................................... 7 Double vowels ............................................................................................................................... 9 Lexical variation............................................................................................................................... 10 Dialect variation .......................................................................................................................... 10 Alternative morphology ............................................................................................................... 10 Appendix 1: Letter substitution errors .................................................. Error! Bookmark not defined. References ...................................................................................................................................... 13 The Legal Māori Corpus spans some of the earliest writing in Māori to contemporary legal and political texts. These texts have been divided into three corpora based on the period in which they were written: pre-1910, 1910-1969 and post-1970. The pre-1910 corpus was compiled using printed texts digitised by the New Zealand Electronic Text Centre (Darwin & Stephens, 2009), while the post1970 corpus consists of a mix of texts that were ‘born digital’ and were retrieved from online publications, and other published texts which were rekeyed by typists with expertise in digitising texts but without knowledge of te reo Māori. The corpora contain the largest structured digital collection of written Māori, with around 8 million words, but they also contain a large number of errors and variant forms of words. These errors and variant words mean that the total wordlist for the corpora yields around 60,000 unique words (known as types), which is considerably more types than one might expect to find. This report summarises the causes of error and other forms of type inflation (where one term is represented by a number of different forms, not as a result of error but as a result of orthographic, dialectal or lexical variation). Errors in the corpus include both original spelling errors, particularly in the variable use of macrons, and errors which were introduced in the digitisation and rekeying processes. Type inflation from language variation is due to the use of different orthographic conventions (chiefly differences in the use of macrons, double vowels and unmarked long vowels), and to variations in the language used by speakers of modern Māori. This variation can be attributed to both the use of reo-a-iwi, or dialectal differences in te reo Māori and also to variations in usage that reflect the development of modern Māori. This report focusses primarily on the post-1970 corpus because a more detailed study was made of these texts in order to evaluate the accuracy of the rekeying process. This corpus of post-1970 texts in the legal Māori corpus contains just under two million words (tokens), which make up 23,457 unique types across 577 texts. However, this report also gives an overview of error and type inflation in the two earlier corpora in order to give a picture of the rate of error and other forms of type inflation in the whole corpus. The error analysis was done using the Māori-only portion of the Legal Māori Corpus1 except where indicated. Error analysis Error identification was undertaken for two main reasons. Firstly, error identification is important for ascertaining the nature of the errors in the corpus by assessing the frequency of errors, the processes that caused errors to be introduced into the corpus and whether particular texts were more prone to error than others. Secondly, error identification is necessary in order to improve the quality of the wordlist by estimating the number of meaningful types in the corpus and identifying likely ‘junk’ types. In order to assess the nature of error in both the corpus and the wordlist, two processes of error identification were used. The first approach was to take a sample of high-frequency words which were likely to be prone to error and to identify the errors associated with these words in order to identify patterns in errors and texts which were particularly prone to error. This process was directed at identifying patterns of error in the corpus. The second process was to scan the wordlist and identify obvious errors (junk types) for removal in order to clean up the wordlist. Sampling Initial error analysis was concentrated on the pre-1910 corpus because the high number of types in this corpus, and in particular the high number of types that occurred only once in the corpus, indicated that it was likely this corpus contained a high number of errors. Analysis of the pre-1910 corpus indicated that these errors typically took two forms. The first of these was letter substitution errors, where one letter had been swapped with another, for example iēnei rather than tēnei. The second area of focus was word boundary errors, for example itēneiwā rather than i tēnei wā. In order to analyse the patterns of error in the post-1970 corpus, the researcher scanned the wordlist to ascertain whether errors in the post-1970 texts also tended to be letter substitution and word boundary errors. Reading the wordlist in the second stage of the error identification process confirmed that these two patterns accounted for the majority of error in the corpus, and thus it was possible to look for errors in the post-1970 corpus using the same sampling strategies that had been used to identify errors in the pre-1910 corpus. In order to identify likely errors in the post-1970 corpus, I began by identifying high frequency words of five or more letters on the grounds that there were likely to be more examples of misspellings of high frequency terms. Restricting the search terms to words of five or more letters excluded the 1 The full Legal Māori Corpus contains texts that include both English and Māori. The corpus was subsequently marked up to distinguish the two languages. This analysis was done using the Māori-only portion of the corpus except for when investigating error types that were removed as part of the process of marking up the English portion of the corpus. This report also comments on the small number of English types in the Māori-only corpus which were not correctly identified in the marking up process. large number of two-syllable words which differ from other legitimate types in the corpus by one letter, or which form shorter parts of longer legitimate words. For example, the highest-frequency four-letter term in the post-1970 corpus is mahi; searching for letter variations of mahi returns 27 types, of which four appear to be erroneous spellings of mahi and 23 are either legitimate words or erroneous spellings of other words. In contrast, searching for letter variations of the highestfrequency five-letter term Māori returns seven types, of which three are erroneous spellings, one is the alternate form Maori with no macron and three are either legitimate words or erroneous spellings of other words. Working with terms of five letters or more thus results in a much more concentrated collection of errors. I created a word list of the post-1970 corpus using WordSmith Tools version 5.0 (Scott, 2010), then identified the highest frequency words of five letters or longer using the length function in Excel. The 100 highest-frequency words of five or more letters were then inputted into the customised Excel macro Wordsearch2. For example, for the search term kaupapa, Wordsearch returns the following letter variations: raupapa, haupapa, kuupapa, kaupopa, kaurapa, koupapa and naupapa, along with their relative positions in the word list of the corpus. The word list is ordered by frequency so words which appear more often in the corpus have a higher ranking. In this case, the search term kaupapa returns: KAUPAPA (Letter Variation) RAUPAPA HAUPAPA KUUPAPA KAUPOPA KAURAPA KOUPAPA NAUPAPA Rank number 1408 5296 5837 15192 15196 15796 16868 Frequency 73 7 6 1 1 1 1 We can assume that high-frequency words are almost certainly genuine words rather than errors. For example, the term raupapa is ranked at 1408 in the wordlist and appears 73 times in the corpus. Similarly, haupapa and kuupapa are also genuine words. The remaining four terms each appear once only in the corpus. The next step is to ascertain whether these words are legitimate, low-frequency words or errors. To do this, the context in which these words appear is examined using the concord function in Wordsmith and the terms may also be cross-checked against existing dictionaries of Māori. Williams’ (2004) dictionary gives an entry for kaurapa that is consistent with the context in which this word appears but the remaining three terms kaupopa, koupapa and naupapa all appear in contexts where kaupapa would make more sense, and Williams does not give entries for these words. From this, it may be assumed that these words are errors. Where it was not certain whether the terms returned were errors, a spot check was made of the original sentence the term occurred in using the Concord function of WordSmith Tools. 2 Wordsearch is an Excel macro developed by Neal Osborne for the Legal Māori Project in order to assist in identifying letter substitution and word boundary errors. The macro compares words in a search list against the whole wordlist and identifies words which differ from the search terms by one character (letter variation function) and words which contain the search word (substring function). Similarly, if the search term kaupapa is searched using the substring function, the following results are returned. KAUPAPA (Substring) KAUPAPAHERE WHAKAKAUPAPATANGA WHAKAKAUPAPA WHAKAKAUPAPATIA KAUPAPATIA KAUPAPAO NGĀKAUPAPA TEKAUPAPA TĒNEIKAUPAPA WHAKAKAUPAPAHIA ĀKAUPAPA KAUPAPAAROTURUKI KAUPAPAHEREATE KAUPAPAPA KAUPAPAPAHERE KAUPAPATANGA KURAKAUPAPA RAKAUPAPA WĀHANGAKAUPAPA WHAKAKAUPAPANGIA WHAKAKAUPAPATIAHEI Rank number 1097 1865 2318 3150 7171 10265 10830 11932 11945 12532 13022 15180 15181 15182 15183 15184 15869 18702 21505 22141 22142 Frequency 106 46 32 19 4 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 Again, the higher frequency terms kaupapahere, whakakaupapatanga, whakakaupapa, whakakaupapatia and kaupapa can immediately be identified as legitimate types and the single occurrence kaupapatanga also falls into this category. Of the remaining terms kaupapao, ngākaupapa, tekaupapa and tēneikaupapa appear twice each in the corpus and can clearly be identified as word boundary errors between kaupapa and the grammatical words o, ngā, te and tēnei respectively. The single occurrences ākaupapa and kaupapahereate, rakaupapa and whakakaupapatiahei are likewise word boundary errors of ā kaupapa, kaupapahere a te, ra kaupapa and whakakaupapatia hei. Although they are not word boundary errors, kaupapapa and kaupapapahere also are clearly erroneous words containing a repeated –pa- syllable. The remaining words demonstrate some of the difficulties in deciding what to classify as errors. Whakakaupapahia, appearing twice in the corpus, and whakakaupapangia, appearing once, contain alternative passive endings for whakakaupapatia. While these variations are clearly used less frequently than whakakaupapatia (19 occurrences across 16 texts), they are most likely are a faithful representation of the language used by speakers of te reo Māori. This variation bears out Harlow’s (2007, pp. 115-118) observation that although –tia is the most productive of the passive suffixes, being most frequently used as the default suffix with transitivised verbs such as whakakaupapa or to transliterated words, some Eastern dialects use –ngia or –hia for this purpose. There are also a small number of words where different passive or nominal suffixes are used for different senses of a word, such as the nominalisations of ako: ākonga (student), akomanga (class, classroom) or akoranga (subject, course of learning). For these reason, I have taken a conservative approach to identifying error and have excluded low-frequency passive and nominal endings from the error list. Similarly, where there is confusion about whether a type may represent a word boundary error or a compound word, I have not categorised the type as an error. The search for kaupapa returned one occurrence each of the terms kaupapaaroturuki, kurakaupapa, wāhangakaupapa. While these are clearly not compound words in common usage, it is not possible to say that the original author did not intend to treat these as compound words. Harlow’s (2007, pp. 129-131) indicate several orthographic patterns used in forming compound words including one word with no spacing, hyphenated words and two words. In light of this orthographic diversity, I have not treated as errors those terms which could conceivably be compound words of the noun+modifier variety. It is difficult, even with reference to the original text, to determine the authors’ intent in many cases and this falls outside the scope of estimating the type count in the corpus. The user of the wordlist should therefore be aware that the wordlist contains a large number of types which are not in common usage but which can be considered true types rather than errors. Searching for letter variation associated with sample search words (the 100 highest frequency terms of five letters or more) and sorting for errors returned 345 letter substitution errors, of which there were 205 unique errors. This process also returned 537 word boundary errors, of which 412 were unique errors. The letter variation errors were sorted by the original, intended letter and the error letter. This process showed that most of the errors appeared to result from confusion between similar-looking letters. The most common errors in the sample were: 57 41 34 29 26 23 a to o t to r a to e o to a o to e e to a Scanning the wordlist for error The second step in the process of error identification process was to scan the wordlist for types that were clearly erroneous. Limitations of this process were: looking at the words out of context, the limited vocabulary of the reviewer and the large number of types to be reviewed (~23,500). The reviewer identified a list of 2,409 errors including misspelled words, run-on words, macron errors and English words. It is likely that if more detailed analysis of the wordlist were undertaken, more errors would be found. As it stands, these types represent around 10% of the number of types in the overall corpus. The overall scan of the wordlist confirmed that looking for one-letter variations and word boundary errors was a sound process because these types accounted for the majority of errors identified. Other errors included a small number of errors involving missing letters or letters that had been added in (as in whakpapa for whakapapa and enngari for engari respectively) or occasionally letters that had been swapped around. English There were 225 English types identified in the Māori-only corpus. Most of these are concentrated in a few texts and so it should be a straightforward matter to correct the mark-up process to exclude these passages from the Māori-only corpus. However, a smaller number of types point to problems distinguishing English and Māori in the mark up process. For example, when I investigated the type who, it became apparent that the process of marking up English for removal did not remove English words that could superficially appear to be Māori. These were words which contained only letters that appeared in the Māori alphabet and which contained only open (consonant-vowel or vowel only) syllables. For example, the file SetTeUrioHauHCS2000M.txt contained the following text in the Englishretained corpus: The Crown acknowledges that the operation and impact of the Native land laws (including the laws governing the operation of the Validation Court) had a prejudicial effect on those of Te Uri o Hau who wished to retain their land and that this was a breach of Te Tiriti o Waitangi/the Treaty of Waitangi and its principles. The Crown also acknowledges that the awarding of reserves exclusively to (Bold type added for emphasis) And the following text in the Māori-only corpus: a Te Uri o Hau who to a Te Tiriti o Waitangi/ Waitangi to Te Uri o Hau to a Te Uri o Hau; The remaining text in the Māori-only corpus indicates that the process of removing English from the texts sometimes results in residual gibberish. Users of the text who encounter such passages may wish to compare the texts they are working with against the corpus with English retained. A further implication of this is that the type/token ratio for certain types which occur in both English and Māori (such as a and to) may be distorted in the corpus, although because these are very highfrequency words it is not practical to determine to what extent this has occurred; a, for example, appears in the top ten high-frequency words in both te reo Māori and English (Boyce, 2006; Kilgarriff, 2006). Origins of the errors An initial look at the distribution of errors in the corpus indicates that some texts have a much higher concentration of errors than others. Furthermore, the kinds of errors identified in the sampling process showed that many of the errors were likely to have been introduced into the corpus in the digitisation process. Reasons for believing this are that the common letter substitution errors appear to concentrated around letters that are visually similar such as a and o, and t and r. Furthermore, high frequency words are not words which one would expect a speaker of te reo to misspell, yet there are quite a large number of errors associated with these high frequency terms. A sample of texts that contained a high numbers of errors was selected for checking. The sample contained texts that were identified as having a very high number of errors and texts that were representative of the different categories of texts in the corpus, including texts that had been scanned and digitised, texts that were originally published in .pdf format and Hansard records retrieved from web pages. Analysis of this sample confirmed that many of the errors identified were present in the digitised texts but not in the original scanned texts. Many word boundary errors corresponded with line breaks in the original texts, where the original text gave two words and the digitised text combined the last word of one line with the first word of the next. Letter variation errors also appear to have been introduced in the digitisation but there were no clearly identifiable factors that led to these errors. It does seem that a small number of these errors were introduced when the scanned document was of poorer quality, but this does not seem to account for the majority of the letter variation errors. It is possible that the error was introduced when the texts were rekeyed by typists who were unfamiliar with te reo Māori. An exception to this pattern of error introduced in the digitisation process was the error profile in the Hansard texts. The word boundary errors in the Hansard texts were also found in the online publications of these speeches and the reasons for the errors in these texts are unclear. The corpus with English retained The majority of the error analysis was conducted using the Māori-only corpus with English removed. However, examining the full wordlist containing types that were identified as both English and Māori revealed that there were a large number of error types which had been removed in the process of marking up the languages in the corpus. In order to determine the extent of this issue, the wordlist of the Māori-only corpus was stoplisted against the full corpus so that only the terms which were not found in the Māori-only corpus were retained. This list contained 18,636 types, the majority of which were English words. However, 1,813 words or 10% of the types were identified as potentially being Māori words, either errors or words which had been eliminated from the Māori-only corpus because, for example, they were personal names which appeared in a passage of text in English. Errors were predominantly letter variation (e.g. whakatlka for whakatikia) or missing letter errors (e.g. whakhaere for whakahaere) that meant the resulting error type violated one of the phonetic rules of Māori and were consequently eliminated as being English types. Also eliminated were words with medial upper case letters, which occurred in word boundary errors involving proper nouns, such as anaTe Karauna for ana Te Karauna or NgaiMāorie for Ngai Māori e. It is unclear as to why these words appear to have been flagged as English in the mark-up process. In all, words which violated the phonetic rules of Māori (letter variation and missing letter errors) accounted for 794 of the possible error types. These additional error types would take the total number of errors identified in the process of scanning the post-1970 wordlist to 3,203, and represent 25% of the error types identified in the corpus. Orthographic variation There are three systems used to represent long vowels in Māori: macrons, double vowels and unmarked long vowels. Vowel marking systems have changed over time; long vowels are typically unmarked in earlier texts but macrons predominate in contemporary texts and in their orthographic guidelines, Te Taura Whiri treats macrons as the “established means of indicating a long vowel” (2012, p. 4). However, this convention is not universal and many writers of texts in the post-1970 corpus use unmarked long vowels, which contributes significantly to the high number of types in the corpus. This is because the corpus contains both the macronised and unmarked variants of the same word, for example, Māori and Maori. Double vowels (as in Maaori) are used by only a few writers in the contemporary corpus but still contribute somewhat to the large number of types. In order to understand to what degree alternative long vowel orthographies lead to inflation of the type count in the corpus, we can start firstly by looking at the distribution of the different vowel marking systems in the corpus. Then we can estimate the number of additional types in the corpus that occur as a result of alternative orthographic systems. The great majority of type inflation due to orthographic difference in the corpus occurs in relation to variable use of macrons and unmarked long vowels. Some texts are written using macrons and some without, meaning that the same spoken word will be represented as two different written types in the corpus. Furthermore, there are a high number of errors associated with the use of macrons, which occur both when a writer using macrons has omitted macrons (for example the use of *nga where the writer typically uses ngā), and when writers have used macrons in words which are typically written with short vowels (such as *māhi). The high frequency of such errors may reflect the finding that younger speakers of te reo Māori differentiate the pronunciation of long and short vowels far less than speakers in the late 1800s-early 1900s (Harlow, 2007, pp. 80-81). In all, the corpus contains 6,774 types with macrons, of which 3,133 correspond to types in the corpus which are identical apart from the macrons. Of course not all of these words are errors. We can categorise these macron/no macron types in three (overlapping) ways: Types which represent different morphemes such as ki ‘to’ or ‘at’ and kī ‘say’ or ‘full’. Types which represent the use of different orthographies such as nga and ngā. Types which represent clear errors such as *Māōri. These categories overlap because words such as nga may be used intentionally in one text where the writer is using unmarked long vowels, but used erroneously in another text where the writer is using macrons to represent long vowels. We can examine the distribution of the different long vowel orthographies in the corpus by looking at the distribution of the highest-frequency word which contains a long vowel, namely ngā (alternatively nga or ngaa)3. Of the 577 texts in the post-1970 corpus, 570 contain at least one instance of ngā, nga or ngaa. Because nearly all texts in the corpus use at least one variant of ngā, nga or ngaa (the remaining seven texts that do not are very short), we can use it to assess the use of different vowel systems in the contemporary corpus. The distribution of ngā/nga/ngaa throughout different texts in the corpus is as follows: All texts containing ngā All texts containing nga All texts containing ngaa Texts with both ngā and nga Texts with both ngā and ngaa Texts with both nga and ngaa 412 343 13 185 4 13 From this distribution we can see that macrons appear to be the most widely-used orthographic system in the corpus. This assumption is supported by the token count: ngā occurs 64,769 times in the corpus, approximately five times as often as nga, which occurs 12,038 times. Ngaa occurs 1,446 times. What is striking is the high proportion of texts that contain both ngā and nga. We can ascertain that nga is likely to be an error by looking at the how frequently it occurs in a text. In texts that use 3 In doing this, I am assuming that the distribution of ngā/nga/ngaa reflects the typical distribution of long vowel orthography throughout the corpus. The advantage of doing this is that this process captures most of the texts in the corpus in a straightforward way. However, the distribution of ngā/nga/ngaa may not be a completely accurate reflection of orthographic systems in the corpus, especially in the case of ngaa, where the number of texts involved is small and many of the instances of ngaa occur as part of the iwi name Ngaa Rauru Kitahi. macrons ngā appears on average 38 times per thousand words. It could be expected that where authors are intentionally using unmarked long vowels, the distribution of nga would be similar, but where nga appears in error in texts that use macrons, it would appear as a low-frequency term. This assumption is validated by looking at the distribution of ngā and nga in the corpus, as illustrated in figure 1. The frequencies of each term in the texts were grouped into bands of five words per thousand. The blue series shows that in texts containing ngā the term appeared most frequently between 35-40 times per thousand words. Texts that contained very high or low rates of ngā were typically shorter texts where a few instances of the term could markedly affect the rate per thousand words. In contrast to the smooth, unimodal distribution of ngā, however, the red series representing nga spikes sharply in the 0-5 words per thousand band. That a large number of texts contain very few instances of nga indicates that this term is likely to slipped in due to the omission of the macron in texts that otherwise use ngā. Frequency /1000 words of ngā and nga in post1970 texts 180 160 Number of texts 140 120 100 ngā 80 nga 60 40 20 0 0 20 40 60 80 100 120 140 160 180 Frequency of term per thousand words Figure 1 Frequency of ngā and nga per thousand words in post-1970 texts Scanning the wordlist for macron errors resulted in the identification of 1,361 possible error types (around 20% of the types with macrons and just under 6% of the total number of types). It is likely that a more detailed examination of the wordlist, which would need to include looking at words in context, would yield a greater number of errors. Double vowels In order to determine the degree to which double vowel orthography was used in the corpus, I searched for the highest frequency term in the corpus that could be written either with double vowels or macrons, namely ngā/ngaa. The search retrieved 13 files out of 577 which contained ngaa, of which ten made extensive use of double vowel orthography. One of the remaining files contained only a few words with double vowels and two of the files only used ngaa as part of the tribal name Ngaa Rauru Kitahi. The texts using double vowels extensively appeared to be clustered around Tainui and and Ngaa Rauru Kitahi, suggesting that double vowels are utilised by only a few isolated writers within the corpus. I then compiled a wordlist from these files and cross-checked it against the full corpus wordlist for unique types that only occurred in these ten texts. The double vowel texts contained 935 unique types, of which 482 types contained double vowels. These 482 types with double vowels thus make up around 2% of the types in the corpus. Lexical variation Dialect variation Dialect variation does not appear to have been responsible for a great increase in the number of types in the corpus. It appears that regional differences in pronunciation such as the merger of *n and *ŋ to /n/ in the Bay of Plenty, and the South Island merger of *k and *ŋ to /k/ are not typically reflected in the orthography of post-1970 texts in the corpus (although three texts do contain brief examples of South Island dialect reflected in the orthography). Users of the corpus interested in finding texts which reflect characteristics of particular dialects may be able to find texts of interest by looking for alternate forms of high-frequency words associated with a particular dialect, such as the variant forms of determiners such as tēnei or ētahi (e.g. tēneki for tēnei, or wētahi for ētahi). Alternative morphology There are a number of base words in the corpus that have a large number of associated passive and nominal forms. This variety, combined with reduplication, means that in some cases one root morpheme may be represented by many words in the corpus. This appears to be particularly true of words that have acquired expanded meanings in modern Māori, especially when these words were not previously used as verbs. For example, for the term arotake “review, evaluate, audit” which occurs 926 times in the post-1970 corpus, there are the following passives: Arotakengia Arotakehia Arotakea Arotaketia 50 occurrences 46 occurrences 36 occurrences 19 occurrences There are also the following nominalisations: Arotakenga Arotakehanga Arotaketanga 216 occurrences 11 occurrences 3 occurrences While there is an established but not absolute preference for the nominalisation arotakenga, there are a number of accepted forms of the passive with no one term dominating. Future research may be able to identify factors (for example, dialect, date of publication or individual variation) that may affect the distribution of these allomorphic suffixes. In the case of at least a small number of words, different nominal forms can carry different shades of meaning. Te Aka Māori-English dictionary (Moorfield, 2013) distinguishes between akomanga “classroom, class”, akoranga “learning, subject, discipline”, and ākonga “student, pupil, learner” (similarly, it lists akonga without a macron in the phrase akonga a mua “alumnus”). It is not clear how many of the terms in the Māori Legal Corpus with alternative suffixes may fall into this category of words with differentiated meanings. Conclusion Upwards of 10% of the types in the post-1970 corpus appear to be errors. These errors are a combination of misplaced macrons, most of which are likely to exist in the original texts, and errors that have been introduced in the process of digitising texts. This estimate was made by reviewing word list without checking the context for each word; because of the length of the list and my incomplete knowledge of te reo Māori it is likely that some errors will have been overlooked and other non-error types included mistakenly. The majority of non-macron errors appear to have been introduced either as run-on words or through the substitution of similar-looking characters. A more detailed analysis of these introduced errors using high-frequency words indicated that word boundary errors accounted for around 60% of the errors in the sample and letter substitution accounted for about 40%. Because a conservative approach was taken to categorising run-on words that could possibly have been compound words, it is likely that the true number of introduced runon errors in higher than this. This could be checked by looking at the original texts to determine whether the author did use a compound word or not, but this would be a very time-consuming process. Areas for further research This initial examination of the corpus has focussed on what can be said about errors and orthographic variation in the corpus. It has not examined in any level of detail the reasons for other forms of language diversity such as reo-a-iwi or change in the language through time. These areas of non-error variation in the language may be of interest for further study. In terms of the diversity of types in the corpus, analysis of compound words and transliterations may be particularly fruitful. The research has examined te reo Māori at the level of individual words. But as Biggs (1973) identifies, the phrase rather than the word is the natural unit of te reo Māori. Although analysis of the corpus at a phrasal level was outside the scope of this project, the software (WordSmith Tools) used to analyse the corpus also offers powerful tools for analysing the language at a phrasal level, and this kind of analysis may be of interest to future researchers. Another avenue for investigation would be an examination of the number of hepaxlegomena (words that only occur once) as a means of estimating rates of error. This could be done through comparing the number of hepaxlegomena in the Legal Māori Corpus to similar corpora such as the Māori Broadcast Corpus, and also looking at the comparative rate of hepaxlegomena in legal language compared to in general corpora in other languages such as English. References Biggs, B. (1973). Let's learn Maori : a guide to the study of the Maori language. Wellington, New Zealand: Reed Education. Boyce, M. (2006). A corpus of modern spoken Māori. (Ph.D.), Victoria University of Wellington, Wellington, New Zealand. (1001855, 11177831) Darwin, J., & Stephens, M. (2009). The Legal Maori Archive: Construction of a large digital collection. The New Zealand Library & Information Management Journal, 51(3), 161-171. Harlow, R. (2007). Māori: A linguistic introduction. Cambridge, England: Cambridge. Kilgarriff, A. (2006). BNC database and word frequency lists. Retrieved from: http://www.kilgarriff.co.uk/bnc-readme.html#lemmatised Moorfield, J. C. (Ed.) (2013) Te aka Māori-English, English-Māori dictionary and index [online version]. Scott, M. (2010). WordSmith Tools (Version 5th). Te Taura Whiri i te Reo Māori. (2012). Guidelines for Māori language orthography. Te Taura Whiri i te Reo Māori. Retrieved from http://www.tetaurawhiri.govt.nz/english/pub_e/downloads/Guidelines_for_Maori_Languag e_Orthography.pdf
© Copyright 2026 Paperzz