Statistics corpus linguistics pdf

2022.01.16 00:55

What corpora are there? There are many types of corpora, which can be used for different kinds of analyses cf. Kennedy Of these, 36 million are British written texts, 10 million American written texts and 10 million American spoken texts. The user can select either one or two or all three subcorpora for the analysis. Much more complex searches are possible on the basis of the CD-Rom version of the corpus, also available at the Department. A few smaller and more specialised corpora have also been made available on the internet.

The corpus contains different academic speech events, and both native and non-native speaker language. The corpus computers are located in the second room, in the right-hand corner, just before the linguistic section of the library. Other corpora are available at the Lehrstuhl Prof. If you intend to carry out research with one of the corpora, please contact Nadja Nesselhauf: Nadja.

Nesselhauf urz. Many manuals are also stored on the corpus computer, in the same directory as the corpus in question. Various text types Contact Prof. Busse 18th Century Fiction BrE 18th c. Two types of software for corpus analysis can be distinguished in principle: software that is tailored to one specific corpus, and software that can be used with almost any kind of corpus.

A subgroup of this type is software which allows searches in one specific corpus over the internet such as the online search facilities provided for the BNC and the Collins Wordbanks Online English mentioned above.

WordSmith is available for corpus research on the right corpus computer in the library; its use will be explained in detail below. What can the software do?

While there are many differences between the software packages designed for corpus analysis, certain basic functions can be performed by practically all the available software. For most kinds of linguistic analyses, the most important one of these is the possibility of searching the corpus in question for the occurrence of certain strings i. In this example, the instance number 8 is irrelevant if the researcher is interested in the adjective stark; this line can then be removed from the concordance see below for a step-to-step introduction to WordSmith.

Corpus software can also usually search for words or phrases occurring within a certain distance of each other and also usually allows the use of wildcards in searches, i. The advantage of the analysis of texts with corpus software is apparent: in a few seconds the researcher receives information about the occurrence of linguistic items in a large amount of text that would take hours or even days if it had to be retrieved manually.

The concordance- line output allows the researcher to see the occurrences in context, so that the use of the linguistic item in question, in particular frequent patterns, can often be investigated with little effort.

Retrieving more context than shown in the concordance lines, searching only a part of a given corpus, and saving the results are also regular features of almost all corpus software. Most programs can also find words frequently occurring in the vicinity of the search term and provide a list of all the words occurring in the corpus and their frequencies. Cancel Please enter your query: In the space provided, you can enter either words or phrases.

A few more complex types of searches, such as for two words with another — unspecified — word intervening, can also be performed a concise explanation of how to perform such more complex queries is provided on the site.

Of the results, 50 at most are displayed; if there are more occurrences in the corpus, the overall number of occurrences is stated, and 50 randomly selected instances are displayed. The results are given in the context of the sentence in which they occur. The disadvantage of this kind of presentation is that, as the search terms are neither highlighted or aligned, patterns are not easily visible. Unlike the software that can be used with a local version of the BNC, basic functions such as sorting or thinning the results or getting more context cannot be performed, and if there are more than 50 hits not all of them can be inspected.

The facility as well as most other online corpus search facilities available is therefore not suited for most kinds of more comprehensive linguistic research. For corpus linguistic studies, this facility as well as others of its kind are useful for a first exploration of whether a certain linguistic feature is worth investigating and which questions as to the use of this feature might yield interesting results. The two most important options, the WordList function and the Concord function will be introduced here.

Now you have to select the corpus you want to work with. Another window will pop up, displaying the structure of the files on the computer. Choose the appropriate corpus file.

The selection of the corpus files you want to work with works exactly as described for the WordList function above. The searches can be more complex than just for simple words or phrases, however. You can, for example, investigate the occurrence of two words within a certain span or distance. Rather than query the corpus for all combinations of all these forms individually, you can use an asterisk i. First of all, some words beginning with the string ma are not instances of the verb make e.

You would also probably like to exclude the noun or adjective decision-making and the noun decision-makers. The thinning of instances i. To obtain the password, please write an email to Nadja. Is the singular or the plural form more frequent? Otherwise, the new selection of texts is added to the old selection. Please also state what x is. If so, what are the words occurring to the right of gamut? Give the full sentence in which it occurs The sentence starts with the word In; ignore the 6-digit combinations of letters and numbers interspersed in the text.

If not, give one example of a result that is not an instance of this collocation. Remove this instance i. First, you formulate some more precise questions or hypotheses that you would like to investigate: - Does Australian English exclusively use vocabulary items from either British or American English or does it use items from both varieties?

Then you choose your corpus. This corpus contains written English of different text types you need to keep in mind, therefore, that your study has two major caveats: first, your investigation is limited to only a very small part of the Australian English vocabulary, and secondly, your investigation is based on written language only.

In the case of flashlights, the co-occurrence of the words with pitch darkness indicate that some kind of lamp is indeed referred to. In the case of torch, the first two occurrences appear to be part of a proper name that refers to some kind of operation and therefore need to be disregarded. The third instance that has been thrown up by the search is neither an instance of torch nor of torches; the fourth is an instance of the verb to torch. These two instances need to be disregarded.

The final two instances actually are instances that seem to refer to a portable lamp. A common notation for this is 2 1 , i. There are no results for elevator. Glancing through them, you notice that many of these are instances of the verb lift rather than of the noun.

This leaves you with 40 instances. As you glance through the remaining instances, you notice that many instances of the verb lift are preceded by to. You perform another sort, this time according to the word to the left of the search term, and remove the instances of lift preceded by to, which leaves you with the following 24 instances: You go trough these one by one and decide whether they are instances of the relevant concept.

While doing that, you notice that it is not only the verb to lift that has to be excluded from the count, but also other meanings of the noun lift, in particular the meaning as in to give sb. To decide whether certain instances refer to a ski lift or a lift in a building, you will need to look at more context in some cases.

In the end, you are left with 6 instances numbers 1, 2, 4, 18, 19, All 5 instances of nappy appear to be instances of the concept in question. To get an overview over your results, you enter them in a table. As all the texts of one category are stored in one file in this corpus, this is difficult to find out in some cases, however. If the instances occur in different categories, they also appear in different texts. If they appear in the same category, a good indication is the word number which is given next to the file name in the concordance line; a word number such as 5,, for example, means that in the category in question the search term is the 5,th word.

As the texts in the Australian Corpus of English are only around 2, words long see the corpus manual , two instances that have word numbers more than 2, words apart are unlikely to occur in the same text.

If they have word numbers that are apart less than 2, words, one way to find out is to open the file with a word processor and use the search function of that program. For the purposes of this investigation, you judge this as unnecessary, however. What do these results tell you with respect to your research questions then? First, Australian English does not exclusively use either British or American vocabulary.

Secondly, the results indicate that in some cases, in Australian English either the British or the American word is used exclusively but the results are no conclusive proof: a larger corpus might throw up instances of the other variety : In the case of freeway — motorway, only the American word occurs in the corpus, in the case of nappy — diaper and lift — elevator, only the British word occurs.

This might lead to the hypothesis which in turn might be the basis for further research that words relating to traffic and cars might be dominantly American in Australian English.

In some cases, however, Australian English uses both the American and the British term. In the case of torch and flashlight, the words might be used with similar frequencies though the numbers are too small for any firm conclusions. In the case of pavement and sidewalk, pavement appears to be the dominant variant.

Moreover, it appears from the results that sidewalk is only used in modifying function sidewalk throng, sidewalk stalls, sidewalk cafes and bazaars. None of the instances of pavement occurs in this function. So this might be a case where the words from the two varieties have undergone a functional distinction in Australian English. Although the words investigated are of course too few for any definite conclusions, the results see the totals in the table indicate that while British vocabulary is dominant in Australian English, there has also been some influence of American English.

In a larger study, you would of course not only include more items, but also contrast your results with those recorded in relevant literature on Australian English. Example 2: Present perfect and simple past in British and American English You are interested in differences of the use of the present perfect and the simple past in British and American English.

First, you formulate some questions or hypotheses on what precisely you want to find out: - Does the simple past not occur at all with these adverbs in British English?

Then you choose corpora on which to base your analyses. Since you are interested in Present Day English, and in American and British English, two comparable corpora of the two varieties in question would be ideal. From the list provided on this website you glean that the two most suitable corpora for your investigation are FLOB and Frown.

This is one important caveat of your study, then, which should be kept in mind: you are only investigating written English and not spoken English. The software you chose for your investigation is WordSmith. You decide to start your investigation with the adverb just. In order to answer the first three questions, you first look at the relative occurrence of the simple present and the present perfect with just in both varieties.

Too many to look at individually. What can you do? View via Publisher. Save to Library Save. Create Alert Alert. Share This Paper. Background Citations. Methods Citations. Results Citations. Topics from this paper. R language.

Citation Type. Has PDF. Publication Type. More Filters. Visual Linguistics with R. Elementary statistical tests for the study of language variation and change : frequencies , means , and correlations.

This paper presents, in an introductory format, an overview of several simple statistical tests and graphs to explore quantitative linguistic data. Using examples from language variation and change, … Expand. Statistics in Corpus Linguistics. Computer Science, Sociology. The book gives step-by-step guidance through the process of statistical analysis and provides multiple examples of how statistical techniques can be used to analyse and visualise linguistic data.

It … Expand. Regression analysis in translation studies. This paper provides an overview of how to compute simple binary logistic regressions and linear regressions with the open source programming language R on the basis of data from the INTERSECT corpus … Expand.

Statistical tests for the analysis of learner corpus data. Commentary: Corpus-based methods. Over the last 25 years or so, linguistics has been changing considerably. T he change l am referring to is twofold.

loronbapum1987's Ownd

0コメント

1000 / 1000