WBC project

-the final project for college English II
I.
Introduction
In the beginning of this semester, I started to work with sketch engine which is a new way for
me to study English. I think it’s a good way to learn English in this way because it’s useless to
memorize the vocabulary without knowing how to use it. The sketch engine gives you the
related context from the search of Google and you can quickly figure out how it use in most
situation. However, the WEC is more useful.
It is a tool that you can make for specialist areas. It’s interesting that you can create your own
corpora just by keying some seed words in the block, and it’s also so convenient that you can
get the close information of the specific word with out searching lots of websites just as the
same function as sketch engine. I would like to do more research on this topic because I will
have to work a lot on my major in English. And it surly has a great help if I can master the
specific words related to my major. Hoping I can thoroughly handle the WBC by doing this
project.
II.
Questions
A. To make a corpus, WBC uses Yahoo! It does an Internet search for
groups of words (usually 3 words) called tuples. Try to describe in more
detail how the program makes a corpus.
The WBC will search the Internet by choosing the sites that included the seed words we
typed in at the first step. And tuples are the permutations of your seed words that WBC
search on the net. For example, if the default number set of terms in tuples is three, it will
arrange three seed words randomly per time and do the search. If you have 10 seed words,
then it will be 120 combinations. It’s a mathematic problem (I have learned it in high school,
it’s something like:
C
10
2
). But it’s too much to search 120 combinations in the web so the
second box in the advanced search produced. It is the maximal number of tuples to get from
Internet. The default number is 10, so the query to Yahoo! is 10 rather than 120. Setting a
small number of maximal tuples will limit the websites relating to your seed words,
therefore makes your corpus relatively small. But I suggest that don’t put your maximal
number too high because it will come out too many URLs and maybe it will increase useless
contexts. Next, you can set the URLs you want for each tuple. The default number is 10 so
you will get 10 URLs for each query, that means you will get 100(10x10) URLs in total.
Actually, the main idea of the maximal number you set is similar to the previous one but it’s
more important. Because the 10 URLs are chosen from the top URLs of 100 hits which each
query returns up (I saw it in “help”), and of course they are more relevant to our seed words.
So if you set a small number of maximal URLs, you will get more precise websites that
related to your major. And there’s also restrict for how long the pages is; the WBC will filter
too short or too long websites out to keep our corpus useful. After setting up these steps,
you can have a corpus conforming to your needs.
B. To make a corpus, is it better to just use the default WBC options
(for example, 10 URLs per query)? Or do you recommend using
different options? Explain how you decided, giving examples.
It depends on your demand to choose using default number or setting it by yourself. I think
the default WBC options are suitable for all created corpora with large data and useful as
well. The maximal URLs per query and the number of tuples are the important settings of
making your corpus large or small, general or particular. More details of the advanced search
setting are in the answer of question A. To tell which one is better, I made an experiment.
First I build a corpus named Economics1 with all the default WBC options, and then I build
another corpus named Economics3 which keep everything the same but the maximal URLs
decrease to 5. The Economics3 is much smaller than Economic1 because of the restriction on
the URLs and therefore affect the results of word sketch. I type the word “supply” in the
word sketch in both corpora, and Economics1 comes out more sketch result than Economic3;
the frequency also varies a lot. Then I type the word “bread” which is irrelevant to the corpus,
and there is no Word Sketch available in both corpora. So I think the quality is good and can
really give me the words sketch of my major related both. The searching result is huge on the
Internet, and the accuracy is almost the same in top5 and top10. With the same quality but
varies in size of data, I will choose the bigger one “Economics1” without doubt. It will give
me more concordance and samples of the specific word which is more helpful for me.
C. Using screen shots and other kinds of examples to help you, describe
how well your corpus represents the topic you chose.
I have extracted keywords from the Economics1, and I find out that the second corpus
named Economics2 having more useful keywords than 1. By using the keywords extraction
form, you can filter out words which are very unlikely to be keywords, such as “where, that,
you” and have a better corpus. As a result, I will compare my corpus Economics2 with the
UKWaC to see the differences in specific corpus and normal corpus. There are three tasks
below to show how well my corpus represents the topic I chose.
First, I do the Word sketch of both corpora with the word “bank” which is one of my seed
words.
We can see that the most frequency used word in ukWaC is account.
While in the Economic2 the highest score is “central”. The second high score in Economics2 is
“chartered” which is a more specialized word. The reason why there’s different is because
people usually use the word account after bank in daily life while we use the phrase “central
bank” to describe how it operate in economics in school. With the specialist-domain seed
words we can learn more combination of words in professional area, and that is the function
of WBC.
Second, l search for the word “cake” which is absolutely no relation to Economic.
We can see there is lots of searching result in ukWaC.
But there is no result in Economics2 which means the WBC has filter out the irrelevant
information for my major.
Third, I search for the word “bond” which has many different meanings.
The top collocation is “hydrogen bond”. “Bond” used here as the meaning as state of being
joined which usually in chemical. Here is a context of “hydrogen bonds”.
“There are many possible hydrogen-bonding arrangements . The one shown has been
chosen for comparison to the modeled minimum-energy structure [ 72 ]. [ Back ] 2 In
the top structure above , a hydrogen bond is shown donating to the hydronium ion .”
Let’s see the result in Economics2.
The top collocation is “be”. Look up the concordances, I find out it is used as the meaning of
certificate issued by government or a company acknowledging that money as been lent to it
and will be paid with interest. Let’s see a context of it.
“ The bonds were exchanged for commercial bank loans in default and are
collateralized by U . S . Treasury zero - coupon bonds .”
The result shows us that there are no collocations with chemical meanings, so does we won’t
use the meaning in chemical in Economics class.
To sum up, my corpus can represent my chosen topic precisely, and it’s more helpful for me
to learn specific words in this than in ordinary like ukWaC.
III.
Conclusion
The WBC is a useful tools and an efficient way to learn English. By creating your own corpus,
you can develop your own professional area in languages supporting by WBC. Doing this
project of learning about WBC and its details of how to set up a corpus that totally belongs to
you is an arduous journey, but I think it’s worthwhile. I discover a new way to learn English
and have finished the longest English project I had ever made. I think the sketch engine and
WEB will become my good partners of learning English.