-the final project for college English II I. Introduction In the beginning of this semester, I started to work with sketch engine which is a new way for me to study English. I think it’s a good way to learn English in this way because it’s useless to memorize the vocabulary without knowing how to use it. The sketch engine gives you the related context from the search of Google and you can quickly figure out how it use in most situation. However, the WEC is more useful. It is a tool that you can make for specialist areas. It’s interesting that you can create your own corpora just by keying some seed words in the block, and it’s also so convenient that you can get the close information of the specific word with out searching lots of websites just as the same function as sketch engine. I would like to do more research on this topic because I will have to work a lot on my major in English. And it surly has a great help if I can master the specific words related to my major. Hoping I can thoroughly handle the WBC by doing this project. II. Questions A. To make a corpus, WBC uses Yahoo! It does an Internet search for groups of words (usually 3 words) called tuples. Try to describe in more detail how the program makes a corpus. The WBC will search the Internet by choosing the sites that included the seed words we typed in at the first step. And tuples are the permutations of your seed words that WBC search on the net. For example, if the default number set of terms in tuples is three, it will arrange three seed words randomly per time and do the search. If you have 10 seed words, then it will be 120 combinations. It’s a mathematic problem (I have learned it in high school, it’s something like: C 10 2 ). But it’s too much to search 120 combinations in the web so the second box in the advanced search produced. It is the maximal number of tuples to get from Internet. The default number is 10, so the query to Yahoo! is 10 rather than 120. Setting a small number of maximal tuples will limit the websites relating to your seed words, therefore makes your corpus relatively small. But I suggest that don’t put your maximal number too high because it will come out too many URLs and maybe it will increase useless contexts. Next, you can set the URLs you want for each tuple. The default number is 10 so you will get 10 URLs for each query, that means you will get 100(10x10) URLs in total. Actually, the main idea of the maximal number you set is similar to the previous one but it’s more important. Because the 10 URLs are chosen from the top URLs of 100 hits which each query returns up (I saw it in “help”), and of course they are more relevant to our seed words. So if you set a small number of maximal URLs, you will get more precise websites that related to your major. And there’s also restrict for how long the pages is; the WBC will filter too short or too long websites out to keep our corpus useful. After setting up these steps, you can have a corpus conforming to your needs. B. To make a corpus, is it better to just use the default WBC options (for example, 10 URLs per query)? Or do you recommend using different options? Explain how you decided, giving examples. It depends on your demand to choose using default number or setting it by yourself. I think the default WBC options are suitable for all created corpora with large data and useful as well. The maximal URLs per query and the number of tuples are the important settings of making your corpus large or small, general or particular. More details of the advanced search setting are in the answer of question A. To tell which one is better, I made an experiment. First I build a corpus named Economics1 with all the default WBC options, and then I build another corpus named Economics3 which keep everything the same but the maximal URLs decrease to 5. The Economics3 is much smaller than Economic1 because of the restriction on the URLs and therefore affect the results of word sketch. I type the word “supply” in the word sketch in both corpora, and Economics1 comes out more sketch result than Economic3; the frequency also varies a lot. Then I type the word “bread” which is irrelevant to the corpus, and there is no Word Sketch available in both corpora. So I think the quality is good and can really give me the words sketch of my major related both. The searching result is huge on the Internet, and the accuracy is almost the same in top5 and top10. With the same quality but varies in size of data, I will choose the bigger one “Economics1” without doubt. It will give me more concordance and samples of the specific word which is more helpful for me. C. Using screen shots and other kinds of examples to help you, describe how well your corpus represents the topic you chose. I have extracted keywords from the Economics1, and I find out that the second corpus named Economics2 having more useful keywords than 1. By using the keywords extraction form, you can filter out words which are very unlikely to be keywords, such as “where, that, you” and have a better corpus. As a result, I will compare my corpus Economics2 with the UKWaC to see the differences in specific corpus and normal corpus. There are three tasks below to show how well my corpus represents the topic I chose. First, I do the Word sketch of both corpora with the word “bank” which is one of my seed words. We can see that the most frequency used word in ukWaC is account. While in the Economic2 the highest score is “central”. The second high score in Economics2 is “chartered” which is a more specialized word. The reason why there’s different is because people usually use the word account after bank in daily life while we use the phrase “central bank” to describe how it operate in economics in school. With the specialist-domain seed words we can learn more combination of words in professional area, and that is the function of WBC. Second, l search for the word “cake” which is absolutely no relation to Economic. We can see there is lots of searching result in ukWaC. But there is no result in Economics2 which means the WBC has filter out the irrelevant information for my major. Third, I search for the word “bond” which has many different meanings. The top collocation is “hydrogen bond”. “Bond” used here as the meaning as state of being joined which usually in chemical. Here is a context of “hydrogen bonds”. “There are many possible hydrogen-bonding arrangements . The one shown has been chosen for comparison to the modeled minimum-energy structure [ 72 ]. [ Back ] 2 In the top structure above , a hydrogen bond is shown donating to the hydronium ion .” Let’s see the result in Economics2. The top collocation is “be”. Look up the concordances, I find out it is used as the meaning of certificate issued by government or a company acknowledging that money as been lent to it and will be paid with interest. Let’s see a context of it. “ The bonds were exchanged for commercial bank loans in default and are collateralized by U . S . Treasury zero - coupon bonds .” The result shows us that there are no collocations with chemical meanings, so does we won’t use the meaning in chemical in Economics class. To sum up, my corpus can represent my chosen topic precisely, and it’s more helpful for me to learn specific words in this than in ordinary like ukWaC. III. Conclusion The WBC is a useful tools and an efficient way to learn English. By creating your own corpus, you can develop your own professional area in languages supporting by WBC. Doing this project of learning about WBC and its details of how to set up a corpus that totally belongs to you is an arduous journey, but I think it’s worthwhile. I discover a new way to learn English and have finished the longest English project I had ever made. I think the sketch engine and WEB will become my good partners of learning English.
© Copyright 2026 Paperzz