Overview of Corpus by COCA September 04, 2015 http://Corpus.byu.edu/coca/ Please register in advance! [1] Question: do you think these are grammatical? which one is ungrammatical? a. 미미가 수학을 공부를 하였다. b. 그 책을 나는 김이 읽었음을 믿는다. c. 바지가 찢어도 졌다. d. 바지들이 찢어들 지었다. [2] What do we do through this corpus site? Full-text Download 440 million words of full-text data for COCA (190,000 texts), or 1.8 billion words for GloWbE (1,800,000 texts). With this data, you will have the texts from the corpora on your own computer, rather than having to use the web interface. Wikipedia corpus (NEW) Quickly and easily create "virtual" corpora from the 4.4 million articles of Wikipedia (1.9 billion words) on almost any topic -- biology, investments, cars, Buddhism, etc. Search these virtual corpora, compare them to each other, and create keyword/frequency lists from your corpora. Word and Phrase (analyze texts) Enter entire texts and see detailed frequency information on the words in the text, and create word lists based on your text. Word and Phrase (frequency lists) Search and browse the most complete frequency dictionary of English. definition, frequency by genre, collocates (nearby words), concordance lines, synonyms, and Wordnet-related words, all with useful links from one resource to another. Word Frequency You can also download lists showing the frequency of the top 60,000 lemmas by genre (and sub-genre). Free list of the top 5,000 lemmas in COCA. Download the 100,000 integrated word list from COCA, COHA, BNC, and SOAP -- the largest, corrected frequency list of English. Collocates Download lists with the top 200-300 collocates (nearby words) for 60,000 different lemmas -- 4,300,000 node/collocate pairs in all. N-grams Download free lists containing the top 1,000,000 2-grams (two word sequences), 3grams, 4-grams, and 5-grams in COCA. There are also other lists that contain the frequency of all 2, 3, and 4-grams (up to 155 million rows of data). Academic vocabulary Download free lists containing "core" academic words in 120 million words of COCAAcademic texts (including grouping by word families), as well as the top 20,000 words overall in COCA-Academic. SeeApplied Linguistics article, or compare to the Academic Word List (Coxhead, 2000). Word and Phrase (academic) Similar to the two Word and Phrase resources below, but limited strictly to the 120 million words of academic texts in COCA. Get detailed information on words and phrases, frequency by sub-genre (e.g. Law, Medicine, Science, Business, Humanities), and concordances and collocates in just the academic text. Also, analyze entire academic texts. Related corpora Besides the 450 million words of American English from 1990-2012 in COCA, you can also use COHA to look at the last 200 years of American English, the BNC to look at British English, or our version of Google Books to look at 155 billion words of data. 1.1 Collocates Collocates are words that occur near a given word (the node word), and they can provide very useful insight into the meaning and usage of the words near which they occur. WORD(S) 1 COLLOCATES 3 2 4 3 4 See an explanation of what happens if you don't enter anything in the [COLLOCATES] field Finds [2] within [3] words to the left and [4] words to the right of [1]. Click on any of the links below to run the query. 1 2 3 / 4 Explanation [thick] [nn*] 0/4 A form of thick followed by a noun laugh.[n*] [j*] 5/5 Adjectives within five words of the noun laugh Any words within five words of the noun laugh (sorted by relevance) Nouns after a form of look + into laugh.[n*] 5/5 look into [nn*] 0/6 eyes clos* 5/5 work/job hard/tough/difficult 4/0 [feel] like [*vvg*] 0/4 Words starting with clos* within five words of eyes Work or job preceded by hard or toughor difficult A form of feel followed by a gerund find time 0/4 Find followed by time [=gorgeous] [n*] 0/4 Nouns after a synonym of gorgeous [=gorgeous] [n*] 0/4 Nouns after a synonym of gorgeous [=expensive] [[[email protected]:clothes]] 0/5 [=expensive] [[[email protected]:clothes]] 0/5 [=beautiful] [=face].[n*] 5/5 [=beautiful] [=face].[n*] 5/5 Synonym of white followed by a form of a word in the clothes list created bydavies Synonym of expensive followed by a form of a word in the clothes list created by davies Synonym of beautiful before synonym of the noun face Synonym of beautiful before synonym of the noun face SORT BY GROUP BY Frequency Collocates Frequency Collocates Relevance Collocates Frequency Collocates Frequency Collocates Frequency Both words Frequency Collocates Frequency Collocates Frequency Collocates Frequency Both words Frequency Collocates Examples glasses, smoke good, little, big hearty, scornful eyes, future closed, close hard//work, tough//job crying, taking time woman, face attractive woman, beautiful day shoes, short Frequency Both words expensive//shoes, pricey shirt Frequency Collocates Frequency Both words happy, delighted happy//child, delighted//boy NOTES: 1. The [COLLOCATES] line of the search form must be visible in order to do a COLLOCATES search. Otherwise, it will simply look for the string in the WORD(S) field. 2. Nearly any search string that is possible for a simple, non-context search is possible for either the WORD(S) or COLLOCATES fields. 3. For queries that have two or more lemma that are possible for both the WORD(S) and COLLOCATES fields (e.g. those with synonyms, word alternates, or customized lists), you will probably want to set [GROUP BY] to [BOTH WORDS] or [BOTH LEMMAS] (more info...). 1.2 Synonym You can easily find the frequency and distribution of the synonyms of a given word, and see which synonyms are used more in different registers or historical periods. You can also include a synonyms list as part of a longer search string. NEW: You can also now look at "synonym chains". When you do a single word search, all of the matching synonyms will appear with a [S] after the word. Click on any of these to see the synonyms for that word, and then do it again, and again... This will allow you to follow a particular concept through chains of related meanings (for example, beautiful, thenexquisite, then delicate, then sensitive, then mild, etc). Examples Explanation Sample words [=beautiful] Synonyms of beautiful attractive, charming [=clean].[v*] Synonyms of clean as a verb (base form of the verb) clean, wipe, dust [[=clean]].[v*] Synonyms of clean as a verb (all forms of the verb) cleaning, wiped, mopping [[=clean]].[v*] the [n*] All forms of synonyms of CLEAN as a verb + the + a NOUN wiped the seat, mopping the floor [=smart] SEC1 = [ACAD] SEC2 = [SPOK] Comparison of the frequency of synonyms of smart in ACADEMIC vs (SPOK) ritzy, brainy SPOKEN (ACAD) vigorous, energetic [=strong] SEC1 = [ACAD] SEC2 = [FIC] Comparison of the frequency of synonyms of strong in ACADEMIC vs (ACAD) effective, persuasive MAGAZINES (FIC) burly, sturdy .. CLICK ON BARS FOR [ WEB ] IN CONTEXT SECTION SPOKEN FICTION MAGAZINE NEWSPAPER ACADEMIC 19901994 19951999 PER MIL SIZE (MW) FREQ 8.6 101.9 877 78.8 131.6 122.4 100.7 101.7 59.2 7943 13391 7245 81.9 73.1 5992 18.2 72.3 1315 121.4 74.0 8987 106.7 72.7 7763 75.6 71.4 5399 20002004 (HELP) 20052009 These are the results for [ web ] 1 Each bar indicates the relative frequency of the word, group of words, phrase, or grammatical constructions in each of the sections of the corpus -- the registers (SPOKEN, FICTION, MAGAZINES, NEWSPAPERS, and ACADEMIC) and the time periods (1990-1994, 19951999, 2000-2004, and 2005-2009). If the search was for an exact word or string, then click on this bar to see the Keyword in Context display. If not, then you can click on this bar to see a list of matching words or strings, sorted by the section on which you click. 2 3 4 The frequency of the matching strings per million words (normalized, to permit comparison across registers) The size of the section in millions of words The raw frequency of the matching strings in each register. 2. Syntax Syntax. Consider the following three examples. • [like] for [p*] to [v*] (I’d really like for you to stay) There are 5 tokens in the BNC, but 330 tokens in COCA. With the BNC there aren't enough examples to see if this is a feature of informal or formal English, but the data from COCA show that it is clearly a feature of spoken English. The data also shows that it is increasing slowly over time, when compared as a ratio to the construction [ like -- him to V ]. • Is it excel in V-ing, or excel at V-ing ? (she excels in/at playing the piano) Granted, this is a very narrow issue, but it is precisely the thing that translators and non-native speakers are interested in. With the BNC there are 5 tokens with at and 6 with in -- probably not enough to say which is more common. In COCA, however, there are 122 with at and 42 with in. This is enough to begin to see which genres prefer one or the other, as well as which subordinate clause verbs occur with each. Such granularity is not possible with the BNC. • [have] been being [vvn] (she had been being watched) There are 2 tokens in the BNC (1 spoken, 1 fiction), and this is not enough data to see any possible genre variation. In COCA, on the other hand, there are 13 tokens (10 spoken, 2 fiction, 1 news). This is enough to show that this is a feature of spoken English, and the data also shows that it is increasing since 1990. (By the way, most native speakers of both dialects will cringe at sentences like this, but they are in the corpora.) SORTING 1 [TYPE] 2 1 Selects how the results will be sorted. The default is sorting by raw FREQUENCY, where the most frequent results appear first. You can also sort by RELEVANCE. Generally relevance is defined by the Mutual Information score, which is a measure of how "tightly" linked two words are. This takes into account the overall frequency of collocates, and sorts out high-frequency "noise" words. Compare the regular listing for hard * or * himself and the relevance-sorted listings of the same: hard * or * himself. If you are doing word comparisons, then relevance shows which collocates occur with Word 1 but not Word 2 and vice versa. If you arecomparing two sections of the corpus, then relevance shows what words are in Section 1 but not Section 2 and vice versa. Note: There is a special page for [SORT BY] GROUP BY W ORDS DISPLAY RAW FREQ SAVE LISTS NO # HITS 100 WORD (default): Entries are not separated by part of speech (e.g. there is only one entry for clean, combining its use as an adjective and a verb). LEMMA: All results are grouped by lemma (e.g. groups together swim, swimming, swam, etc). Useful when comparing results in two sections, where the specific conjugation of a verb (for example) doesn't matter. NONE: This will show separate entries when a word has two or more part of speech tags (e.g. clean = [vvi], [j], etc). Usually you would not use this setting, but you do need to if you want the Keyword in Context entries to be limited to a particular part of speech. BOTH WORDS / BOTH LEMMA: This is the best option for COLLOCATES searches where both the WORD(S) and COLLOCATES fields have several possibilities. For example, if you are looking for synonyms of flowernear synonyms of pretty, then it will list all of the pairs: pretty flower, beautiful roses, etc. Otherwise, it would list only the COLLOCATE (pretty, beautiful, etc). RAW FREQ (default): The number of tokens in each section of the corpus PER/MIL: Tokens per million words; allows better comparison across sections of different sizes. RAW FREQ+: Raw frequency + per million PER/MIL+: Per million + raw frequency In any case, the coloring of the cells in the results table is a function of the normalized frequency (tokens per million words). NO (default) YES: This allows you to save entries from a results set into a customized, userdefined list, which you can then retrieve later on and use as part of subsequent queries. For example, you could search for synonyms of beautiful, select certain entries, add more forms that you feel are missing, and then save this as your own [beautiful] list. (More information ...) Number of "hits" to see in the results set. Default = 100. Q1 Analyze the following sentence, and judge if we can extract general grammatical pattern from the sentence. a. Chef Christian Hermsdorf cooks his way through the menu for $30 a person. (COCA) More… [WORDS] fascis* : fascist, fascists, fascisti, fascism, fascismo, caml* down: [screw] up: screwed, screwing, frustrating.[j*]: frustrating, a frustrating mixture of brillance and stupidity [GRMMATICAL CONSTRUCTION] [end] up [vvg*]: end up v-ing going|gon to|na [v*]: going to V need [x*] [v*]: needn't V [get] [vvn*]: get passive (get tired) . hopefully: sentence-initial hopefully was|be|is|are|am|is|were being [*vvn*]: progressive passive (was being considered) [n*] [be] that|those of: Noun be that of [ROOTS, PREFIXES, SUFFIXES] *heart*: heart, hearts, hearty, heartily, hearth, sweetheart, heartless, heartbeat, kind-hearted, brokenhearted... home*: home, homes, homer, homeward, homely, homeless, homestead, homeland, homework, homesick, hometown, homespun, homecoming... *able.[j*]: -able adjectives: able, considerable, available, remarkable, unable, valuable, resonable, inevitale, probable, favorable, desirable... *ware.[nn*]: -ware nouns: software, hardware, ware, silverware, earthenware, glassware, tinware, chinaware, spyware... *-free: tax-free, duty-free, care-free, scot-free, fat-free, rent-free, dairy-free, interest-free, drug-free, ice-free, risk-free, trouble-free, nuclear-free, snow-free, worry-free... You can also have the corpus generate a list of words that were used more in one period than another, even when you don't know what the specified words might be. For example, you can compare verbs in the 1970s-2000s (left) to the 1930s-1960s (right), adjectives in the 1970s-2000s (left) and the 1930s-1960s (right), or -ly adverbs in the 1900s (left) to the 1800s (right). The corpus can also help to show how the meaning or usage of words have changed over time, by looking at changes in collocates (co-occurring words). For example, the collocates of sexual, gay, chip, engine, or web have changed over time. Notice also how this can signal cultural changes over time, such as nouns used with woman in the 1930s-50s compared to the 1960s-80s, or nouns used with problem 1920-present (left) compared to 1810-1920 (right). Note on advanced queries involving variable length between words Syntax Meaning Examples (Click to run) Sample matches One "slot" : Make sure there is no space, or it will be interpreted as two consecutive words word One exact word lad lad [pos] [pos*] Part of speech (exact) Part of speech (wildcard) [More information] [cs] [v*] going, using find, does, keeping, started [lemma] Lemmas (all forms of a word) [sing] [tall] sing, singing, sang tall, taller, tallest [=word] Synonyms [More information] [New: synonym chains] [=strong] formidable, muscular, fervent [user:list] Customized lists [More information] [[email protected]:clothes] tie, shirt, blouse word|word Any of these words stunning|gorgeous|charming stunning, charming, gorgeous *xx x?xx x?xx* Wildcard: * = any # letters Wildcard: ? = one letter un*ly s?ng s?ng* unlikely, unusually sing, sang, song song, singer, songbirds -word NOT (followed by PoS, lemma, word, etc. Most useful for "multiple slot" queries; see below) -[nn*] the, in, is Combinations of preceding (samples) You can limit to a particular part of speech by adding a period (full stop) and then the part of speech tag in brackets. This is always optional. Make sure there is no space before or after the period (full stop), or it will be interpreted as two consecutive words word.[pos] Exact word and part of speech strike.[v*] strike (only as a verb) word*.[pos] Substring and part of speech dis*.[vvd] discovered, disappeared, discussed [lemma].[pos] Lemma and part of speech [strike].[v*] strike, struck, striking [=word].[pos] Synonym and part of speech [=beat].[v*] hit, strike, defeat (but not nouns, like rhythm ordrumming) You can add "lemma" to any other type of search, such as synonym or customized list, to see all forms of the matching words. Just use an extra set of brackets. [[=word]] Synonym and lemma [[=publish]] announced, circulating, publishes, issue (no part of speech specified, so some noun uses) [[user:list]] Customized list and [[[email protected]:clothes]] tie, tying, socks, socked, shirt, lemma blouses (no part of speech specified, hencetying) You can also choose lemma and part of speech by combining the preceding symbols [[=word]].[pos] Synonym and lemma and part of speech [[=clean]].[v*] mop, scrubs, polishing [[user:list]].[pos] Customized list and lemma and part of speech [[[email protected]:clothes]].[n*] tie, ties, sock, socks (i.e. just nouns) Multiple "slots" : Create sequences of words, using any of the preceding query types. Note that in each case, there is a space between the word "slots" in the query. These are just a few examples, from an unlimited number of combinations. Note on advanced queries involving variable length between words. of no little of no little fast|quick|rapid [nn*] fast food rapid transit pretty -[nn*] pretty smart pretty as (but not pretty girl, pretty picture, etc) [v] [p*] into [vvg*] talk him into staying coerced them into buying .|,|; nevertheless [p*] [v*] (Notice that punctuation can be used like any "word"; just make sure that it is separated from words by a space) . Nevertheless it is ; nevertheless he said [break] the [nn*] break the law broke the story [[beat]].[v*] * [nn*] beat the Yankees beaten to death [=gorgeous] [nn*] beautiful woman attractive wife [put] on [ap*] [[email protected]:clothes].[n*] put on her hat putting on my pants
© Copyright 2025 Paperzz