Key word analysis
Concept: Key words are words that characterize texts in a particular way and play a central part within cultural discourses. (Williams, 1983). Statistically, key words are
words that occur more often within a given text corpus or part thereof than they do in other text corpora or in overall usage. (Scott, 2004, 92 pp.). To identify key words in this sense of the word,
the frequencies of a word within two corpora need to be compared. One corpus, the so-called test corpus, is used as a starting point for analysis. This corpus is compared with a reference corpus,
which, as a rule, should be as large as possible. If the latter comes close to the ideal of a "balanced" corpus, it represents usage "as such", an ideal corpus impossible to attain. If the reference
corpus is of a specific composition, the key word analysis will point to differences in word usage between the two corpora.
The analysis compares the frequency of the words in the test corpus with that in the reference corpus, using statistical measures for evaluation. The result will show all words that
are markedly more frequent in the test corpus than they are in the reference corpus. Statistical measures used are the Chi-square test with Yate's correction for continuity (SAS 1990: 865, Oakes
1998: 25) and Dunning's log-likelihood-ratio (SAS 1990: 865, Oakes 1998: 172).
Experience has shown that two types of key words are retrieved in this way: First, not extremely frequent, meaningful (lexical) words. These characterize the subject of a text
corpus (its "aboutness"). Second, closed-class lexemes (function words). These characterize the linguistic habitus ("style") of the texts within the text corpus.
Calling up key word analysis, users will find the test corpus pre-set. In the current implementation, the entire corpus will be the reference corpus throughout. A number of
parameters allow to fine-tune analysis:
When calling up key word analysis, users will find the test corpus pre-set. In the current implementation, the entire corpus will be the reference corpus throughout. A number of parameters allow to
fine-tune analysis:
- Using the drop-down list "Run analysis for word category" will restrict the procedure to specific word classes
- Using the drop-down list "Compute statistics with reference to", users may opt for statistics to be computed relative to all words of the corpora or to words of a selected word category.
- The field "Minimum frequency" allows to limit analysis to words occurring with minimum frequency in the test corpus. Normally, it does not make much sense to include extremely rare lexemes in an
analysis.
- The field "Minimum frequency" allows to limit analysis to words occurring with minimum frequency in the test corpus. Normally, it does not make much sense to include extremely rare lexemes in an
analysis. The higher the value set, the more words drop out of selection. The fourth column in the results table shows "Deviation from expected value".
- The drop-down list "Sort results by" permits users to choose whether words are to be sorted (in descending order) based on Chi-square or log-likelihood statistics.
- The field "Maximum number of words displayed" allows to specify the maximum number of words to be listed. In general, the most significant key words will be shown first.
Results of the analysis will be shown in a table sorting the key words by descending significance and providing for each word: its frequency in the test corpus, its relative
frequency in the entire corpus (reference corpus and test corpus), deviation from expected value assuming uniform distribution, as well as Chi-square and log-likelihood statistics in tabular
format.
References
Oakes, Michael P.: Statistics for Corpus Linguistics, Edinburgh 1998
SAS: SAS STAT User's Guide, Version 6, Fourth Edition, Vol. I, Cary 1990
Scott, Mike: Oxford WordSmith Tools, Oxford 2004
Williams, Reymond: Keywords. A vocabulary of culture and society, New York 1983
|