Overview of Corpus by COCA

Overview of Corpus by COCA
September 04, 2015
http://Corpus.byu.edu/coca/
Please register in advance!
[1] Question: do you think these are grammatical? which one is ungrammatical?
a. 미미가 수학을 공부를 하였다.
b. 그 책을 나는 김이 읽었음을 믿는다.
c. 바지가 찢어도 졌다.
d. 바지들이 찢어들 지었다.
[2] What do we do through this corpus site?
Full-text
Download 440 million words of full-text data for COCA (190,000 texts), or 1.8 billion
words for GloWbE (1,800,000 texts). With this data, you will have the texts from the
corpora on your own computer, rather than having to use the web interface.
Wikipedia corpus
(NEW)
Quickly and easily create "virtual" corpora from the 4.4 million articles of Wikipedia (1.9
billion words) on almost any topic -- biology, investments, cars, Buddhism, etc. Search
these virtual corpora, compare them to each other, and create keyword/frequency lists
from your corpora.
Word and Phrase
(analyze texts)
Enter entire texts and see detailed frequency information on the words in the text, and
create word lists based on your text.
Word and Phrase
(frequency lists)
Search and browse the most complete frequency dictionary of English. definition,
frequency by genre, collocates (nearby words), concordance lines, synonyms, and
Wordnet-related words, all with useful links from one resource to another.
Word Frequency
You can also download lists showing the frequency of the top 60,000 lemmas by genre
(and sub-genre). Free list of the top 5,000 lemmas in COCA. Download the 100,000
integrated word list from COCA, COHA, BNC, and SOAP -- the largest, corrected
frequency list of English.
Collocates
Download lists with the top 200-300 collocates (nearby words) for 60,000 different
lemmas -- 4,300,000 node/collocate pairs in all.
N-grams
Download free lists containing the top 1,000,000 2-grams (two word sequences), 3grams, 4-grams, and 5-grams in COCA. There are also other lists that contain the
frequency of all 2, 3, and 4-grams (up to 155 million rows of data).
Academic vocabulary
Download free lists containing "core" academic words in 120 million words of COCAAcademic texts (including grouping by word families), as well as the top 20,000 words
overall in COCA-Academic. SeeApplied Linguistics article, or compare to the Academic
Word List (Coxhead, 2000).
Word and Phrase
(academic)
Similar to the two Word and Phrase resources below, but limited strictly to the 120 million
words of academic texts in COCA. Get detailed information on words and phrases,
frequency by sub-genre (e.g. Law, Medicine, Science, Business, Humanities), and
concordances and collocates in just the academic text. Also, analyze entire academic
texts.
Related corpora
Besides the 450 million words of American English from 1990-2012 in COCA, you can
also use COHA to look at the last 200 years of American English, the BNC to look at
British English, or our version of Google Books to look at 155 billion words of data.
1.1 Collocates
Collocates are words that occur near a given word (the node word), and they can provide very useful insight into
the meaning and usage of the words near which they occur.
WORD(S)
1
COLLOCATES
3
2
4
3
4
See an explanation of what happens if you don't enter anything in the
[COLLOCATES] field
Finds [2] within [3] words to the left and [4] words to the right of [1]. Click on any of the links below to run the query.
1
2
3 / 4
Explanation
[thick]
[nn*]
0/4
A form of thick followed by a noun
laugh.[n*]
[j*]
5/5
Adjectives within five words of the
noun laugh
Any words within five words of the
noun laugh (sorted by relevance)
Nouns after a form of look + into
laugh.[n*]
5/5
look into
[nn*]
0/6
eyes
clos*
5/5
work/job
hard/tough/difficult
4/0
[feel] like
[*vvg*]
0/4
Words starting with clos* within five
words of eyes
Work or job preceded
by hard or toughor difficult
A form of feel followed by a gerund
find
time
0/4
Find followed by time
[=gorgeous]
[n*]
0/4
Nouns after a synonym of gorgeous
[=gorgeous]
[n*]
0/4
Nouns after a synonym of gorgeous
[=expensive]
[[[email protected]:clothes]]
0/5
[=expensive]
[[[email protected]:clothes]]
0/5
[=beautiful]
[=face].[n*]
5/5
[=beautiful]
[=face].[n*]
5/5
Synonym of white followed by a form
of a word in the clothes list created
bydavies
Synonym of expensive followed by a
form of a word in the clothes list
created by davies
Synonym of beautiful before synonym
of the noun face
Synonym of beautiful before synonym
of the noun face
SORT BY
GROUP BY
Frequency
Collocates
Frequency
Collocates
Relevance
Collocates
Frequency
Collocates
Frequency
Collocates
Frequency
Both words
Frequency
Collocates
Frequency
Collocates
Frequency
Collocates
Frequency
Both words
Frequency
Collocates
Examples
glasses, smoke
good, little, big
hearty, scornful
eyes, future
closed, close
hard//work,
tough//job
crying, taking
time
woman, face
attractive
woman,
beautiful day
shoes, short
Frequency
Both words
expensive//shoes,
pricey shirt
Frequency
Collocates
Frequency
Both words
happy, delighted
happy//child,
delighted//boy
NOTES:
1. The [COLLOCATES] line of the search form must be visible in order to do a COLLOCATES search. Otherwise, it will simply look for
the string in the WORD(S) field.
2. Nearly any search string that is possible for a simple, non-context search is possible for either the WORD(S) or COLLOCATES fields.
3. For queries that have two or more lemma that are possible for both the WORD(S) and COLLOCATES fields (e.g. those with synonyms,
word alternates, or customized lists), you will probably want to set [GROUP BY] to [BOTH WORDS] or [BOTH LEMMAS] (more info...).
1.2 Synonym
You can easily find the frequency and distribution of the synonyms of a given word, and see which
synonyms are used more in different registers or historical periods. You can also include a synonyms
list as part of a longer search string.
NEW: You can also now look at "synonym chains". When you do a single word search, all of the
matching synonyms will appear with a [S] after the word. Click on any of these to see the synonyms
for that word, and then do it again, and again... This will allow you to follow a particular concept
through chains of related meanings (for example, beautiful, thenexquisite, then delicate,
then sensitive, then mild, etc).
Examples
Explanation
Sample words
[=beautiful]
Synonyms of beautiful
attractive, charming
[=clean].[v*]
Synonyms of clean as a verb (base form of the verb)
clean, wipe, dust
[[=clean]].[v*]
Synonyms of clean as a verb (all forms of the verb)
cleaning, wiped, mopping
[[=clean]].[v*] the [n*] All forms of synonyms of CLEAN as a verb + the + a NOUN
wiped the seat, mopping the
floor
[=smart]
SEC1 = [ACAD]
SEC2 = [SPOK]
Comparison of the frequency of synonyms of smart in ACADEMIC vs (SPOK) ritzy, brainy
SPOKEN
(ACAD) vigorous, energetic
[=strong]
SEC1 = [ACAD]
SEC2 = [FIC]
Comparison of the frequency of synonyms of strong in ACADEMIC vs (ACAD) effective, persuasive
MAGAZINES
(FIC) burly, sturdy
..
CLICK ON BARS FOR [ WEB ] IN CONTEXT
SECTION SPOKEN FICTION MAGAZINE NEWSPAPER ACADEMIC
19901994
19951999
PER MIL
SIZE (MW)
FREQ
8.6
101.9
877
78.8 131.6 122.4
100.7 101.7 59.2
7943 13391 7245
81.9
73.1
5992
18.2
72.3
1315
121.4
74.0
8987
106.7
72.7
7763
75.6
71.4
5399
20002004
(HELP)
20052009
These are the results for [ web ]
1
Each bar indicates the relative frequency of the word, group of words, phrase, or grammatical
constructions in each of the sections of the corpus -- the registers (SPOKEN, FICTION,
MAGAZINES, NEWSPAPERS, and ACADEMIC) and the time periods (1990-1994, 19951999, 2000-2004, and 2005-2009).
If the search was for an exact word or string, then click on this bar to see the Keyword in
Context display. If not, then you can click on this bar to see a list of matching words or strings,
sorted by the section on which you click.
2
3
4
The frequency of the matching strings per million words (normalized, to permit comparison
across registers)
The size of the section in millions of words
The raw frequency of the matching strings in each register.
2. Syntax
Syntax. Consider the following three examples.
•
[like] for [p*] to [v*] (I’d really like for you to stay)
There are 5 tokens in the BNC, but 330 tokens in COCA. With the BNC there aren't enough examples
to see if this is a feature of informal or formal English, but the data from COCA show that it is clearly a
feature of spoken English. The data also shows that it is increasing slowly over time, when compared
as a ratio to the construction [ like -- him to V ].
•
Is it excel in V-ing, or excel at V-ing ? (she excels in/at playing the piano)
Granted, this is a very narrow issue, but it is precisely the thing that translators and non-native
speakers are interested in. With the BNC there are 5 tokens with at and 6 with in -- probably not
enough to say which is more common. In COCA, however, there are 122 with at and 42 with in. This
is enough to begin to see which genres prefer one or the other, as well as which subordinate clause
verbs occur with each. Such granularity is not possible with the BNC.
•
[have] been being [vvn] (she had been being watched)
There are 2 tokens in the BNC (1 spoken, 1 fiction), and this is not enough data to see any possible
genre variation. In COCA, on the other hand, there are 13 tokens (10 spoken, 2 fiction, 1 news). This
is enough to show that this is a feature of spoken English, and the data also shows that it is
increasing since 1990. (By the way, most native speakers of both dialects will cringe at sentences like
this, but they are in the corpora.)
SORTING
1
[TYPE] 2
1 Selects how the results will be sorted. The default is sorting by raw FREQUENCY, where the most
frequent results appear first. You can also sort by RELEVANCE. Generally relevance is defined by
the Mutual Information score, which is a measure of how "tightly" linked two words are. This takes into
account the overall frequency of collocates, and sorts out high-frequency "noise" words. Compare the
regular listing for hard * or * himself and the relevance-sorted listings of the same: hard * or * himself.
If you are doing word comparisons, then relevance shows which collocates occur with Word 1 but not
Word 2 and vice versa. If you arecomparing two sections of the corpus, then relevance shows what
words are in Section 1 but not Section 2 and vice versa.
Note: There is a special page for [SORT BY]
GROUP BY
W ORDS
DISPLAY
RAW FREQ
SAVE LISTS
NO
# HITS
100
WORD (default): Entries are not separated by part of speech (e.g. there is only one
entry for clean, combining its use as an adjective and a verb).
LEMMA: All results are grouped by lemma (e.g. groups together swim, swimming,
swam, etc). Useful when comparing results in two sections, where the specific
conjugation of a verb (for example) doesn't matter.
NONE: This will show separate entries when a word has two or more part of speech
tags (e.g. clean = [vvi], [j], etc). Usually you would not use this setting, but you do
need to if you want the Keyword in Context entries to be limited to a particular
part of speech.
BOTH WORDS / BOTH LEMMA: This is the best option for COLLOCATES
searches where both the WORD(S) and COLLOCATES fields have several
possibilities. For example, if you are looking for synonyms of flowernear synonyms
of pretty, then it will list all of the pairs: pretty flower, beautiful roses, etc. Otherwise, it
would list only the COLLOCATE (pretty, beautiful, etc).
RAW FREQ (default): The number of tokens in each section of the corpus
PER/MIL: Tokens per million words; allows better comparison across sections of
different sizes.
RAW FREQ+: Raw frequency + per million
PER/MIL+: Per million + raw frequency
In any case, the coloring of the cells in the results table is a function of the normalized
frequency (tokens per million words).
NO (default)
YES: This allows you to save entries from a results set into a customized, userdefined list, which you can then retrieve later on and use as part of subsequent
queries. For example, you could search for synonyms of beautiful, select certain
entries, add more forms that you feel are missing, and then save this as your own
[beautiful] list. (More information ...)
Number of "hits" to see in the results set.
Default = 100.
Q1
Analyze the following sentence, and judge if we can extract general grammatical pattern from the sentence.
a. Chef Christian Hermsdorf cooks his way through the menu for $30 a person. (COCA)
More…
[WORDS]
fascis* : fascist, fascists, fascisti, fascism, fascismo,
caml* down:
[screw] up: screwed, screwing,
frustrating.[j*]: frustrating, a frustrating mixture of brillance and stupidity
[GRMMATICAL CONSTRUCTION]
[end] up [vvg*]: end up v-ing
going|gon to|na [v*]: going to V
need [x*] [v*]: needn't V
[get] [vvn*]: get passive (get tired)
. hopefully: sentence-initial hopefully
was|be|is|are|am|is|were being [*vvn*]: progressive passive (was being considered)
[n*] [be] that|those of: Noun be that of
[ROOTS, PREFIXES, SUFFIXES]
*heart*: heart, hearts, hearty, heartily, hearth, sweetheart, heartless, heartbeat, kind-hearted, brokenhearted...
home*: home, homes, homer, homeward, homely, homeless, homestead, homeland, homework,
homesick, hometown, homespun, homecoming...
*able.[j*]: -able adjectives: able, considerable, available, remarkable, unable, valuable, resonable,
inevitale, probable, favorable, desirable...
*ware.[nn*]: -ware nouns: software, hardware, ware, silverware, earthenware, glassware, tinware,
chinaware,
spyware...
*-free: tax-free, duty-free, care-free, scot-free, fat-free, rent-free, dairy-free, interest-free, drug-free,
ice-free, risk-free, trouble-free, nuclear-free, snow-free, worry-free...
You can also have the corpus generate a list of words that were used more in one period than another,
even when you don't know what the specified words might be. For example, you can compare verbs
in the 1970s-2000s (left) to the 1930s-1960s (right), adjectives in the 1970s-2000s (left) and the
1930s-1960s (right), or -ly adverbs in the 1900s (left) to the 1800s (right).
The corpus can also help to show how the meaning or usage of words have changed over time, by
looking at changes in collocates (co-occurring words). For example, the collocates of sexual, gay, chip,
engine, or web have changed over time. Notice also how this can signal cultural changes over time,
such as nouns used with woman in the 1930s-50s compared to the 1960s-80s, or nouns used with
problem 1920-present (left) compared to 1810-1920 (right).
Note on advanced queries involving variable length between words
Syntax
Meaning
Examples (Click to run)
Sample matches
One "slot" : Make sure there is no space, or it will be interpreted as two consecutive words
word
One exact word
lad
lad
[pos]
[pos*]
Part of speech (exact)
Part of speech
(wildcard)
[More information]
[cs]
[v*]
going, using
find, does, keeping, started
[lemma]
Lemmas (all forms of a
word)
[sing]
[tall]
sing, singing, sang
tall, taller, tallest
[=word]
Synonyms
[More information]
[New: synonym
chains]
[=strong]
formidable, muscular, fervent
[user:list]
Customized lists
[More information]
[[email protected]:clothes]
tie, shirt, blouse
word|word
Any of these words
stunning|gorgeous|charming
stunning, charming, gorgeous
*xx
x?xx
x?xx*
Wildcard: * = any #
letters
Wildcard: ? = one letter
un*ly
s?ng
s?ng*
unlikely, unusually
sing, sang, song
song, singer, songbirds
-word
NOT (followed by PoS,
lemma, word, etc. Most
useful for "multiple slot"
queries; see below)
-[nn*]
the, in, is
Combinations of preceding (samples)
You can limit to a particular part of speech by adding a period (full stop) and then the part of speech tag in brackets. This is
always optional. Make sure there is no space before or after the period (full stop), or it will be interpreted as two consecutive
words
word.[pos]
Exact word and part of
speech
strike.[v*]
strike (only as a verb)
word*.[pos]
Substring and part of
speech
dis*.[vvd]
discovered, disappeared, discussed
[lemma].[pos]
Lemma and part of
speech
[strike].[v*]
strike, struck, striking
[=word].[pos]
Synonym and part of
speech
[=beat].[v*]
hit, strike, defeat
(but not nouns,
like rhythm ordrumming)
You can add "lemma" to any other type of search, such as synonym or customized list, to see all forms of the matching
words. Just use an extra set of brackets.
[[=word]]
Synonym and lemma
[[=publish]]
announced, circulating, publishes,
issue
(no part of speech specified, so some
noun uses)
[[user:list]]
Customized list and
[[[email protected]:clothes]]
tie, tying, socks, socked, shirt,
lemma
blouses
(no part of speech specified,
hencetying)
You can also choose lemma and part of speech by combining the preceding symbols
[[=word]].[pos]
Synonym and lemma
and part of speech
[[=clean]].[v*]
mop, scrubs, polishing
[[user:list]].[pos]
Customized list and
lemma and part of
speech
[[[email protected]:clothes]].[n*]
tie, ties, sock, socks (i.e. just nouns)
Multiple "slots" : Create sequences of words, using any of the preceding query types. Note that in each case, there is a
space between the word "slots" in the query. These are just a few examples, from an unlimited number of
combinations. Note on advanced queries involving variable length between words.
of no little
of no little
fast|quick|rapid [nn*]
fast food
rapid transit
pretty -[nn*]
pretty smart
pretty as
(but not pretty girl, pretty picture, etc)
[v] [p*] into [vvg*]
talk him into staying
coerced them into buying
.|,|; nevertheless [p*] [v*]
(Notice that punctuation can be used like any "word";
just make sure that it is separated from words by a space)
. Nevertheless it is
; nevertheless he said
[break] the [nn*]
break the law
broke the story
[[beat]].[v*] * [nn*]
beat the Yankees
beaten to death
[=gorgeous] [nn*]
beautiful woman
attractive wife
[put] on [ap*] [[email protected]:clothes].[n*]
put on her hat
putting on my pants