Collocation analysis of a lemma

Within linguistics, the term collocation denotes the co-occurrence of two or more words on a regular basis. Collocations may be lexical ("New York") or quasi-lexical ("at first go") in status. In a broader sense, collocations reflect idiomatic expressions (e.g., "to kick the bucket") or stereotyped expressions (e.g., "bitterly cold").

Statistical collocation analysis uses statistical methods to find collocation results. To this end, the following approach is taken here: starting from a specific lexeme under investigation (the node word), an analysis is run to find out what other words occur within the environment of this node word more frequently than would correspond to the overall frequency of these words in the text corpus examined.

In order to specify the environment of a node word, a window span is defined around this node word within which to search for collocation. Objective rules governing the location and size of this search window are impossible to lay down. Lexical and quasi-lexical collocations are found in the immediate vicinity of the node word. The larger the search window, the more collocations will be included.

In order to evaluate the word combinations found, test statistics are calculated for each pair of node word and collocate, providing a measure of conspicuous co-occurrence. MI scores and T scores are the most frequently applied measures. MI scores are more likely to show rare but close terminological contexts, whereas T scores prioritize frequent collocations. So both measures are complementary in character. Simply put, MI scores show what is "conspicuous", T scores what is "safe". Hence, it is useful to compare results using both measures.

Note: Both measures serve merely to assess a relative conspicuity of word combinations. The prerequisites needed for tests of statistical significance in their full sense do not exist in the conditions set by the linguistic data the material provides. That is why measures of significance will not be calculated here.

This function, made available in the scope of the Thesaurus Linguae Aegyptiae, is still experimental in nature. Applying the function, users should take into consideration that a fully-fledged collocation analysis can only be done for clearly defined text corpora or a balanced overall corpus. At the present stage of development, the text database provides for such prerequisites only within limits.
In the current situation, results of such an analysis are recommended to be viewed as, above all, an indication of potentially interesting word combinations whose meaning remains to be verified by specific texts cited, and in the framework of a fully valid philological and historico-cultural argumentation.

Requesting a collocation analysis for a lemma: The lemma from whose representation the collocation analysis is requested will be pre-set as node word within the query form. Parameters for an analysis may be set in the relevant fields of this query form. For the search window to be defined, maximum and minimum word spans to the left and to the right of the node word need to be chosen, with '10' being the maximum value. An analysis may be run for individual lemmata or, in the case of hierarchically defined groups, for entire groups. As a rule, the latter option, which is the default setting, should be given preference. The results of such an analysis may be listed according to MI scores or T scores, or both. In addition, a target number of collocations may be set. Generally, results will appear starting from the statistically most conspicuous collocations to move on to less conspicuous ones. The output will terminate when the maximum number is reached. Clicking on the button "Run analysis" upon selecting all parameters will start the procedure.

Results Display: For values to be verifiable, the results window first of all displays all parameters for the analysis run. Following below is a table providing information on the collocates retrieved, the results being sorted as requested. For each collocate, basic information on the lemma is provided. Clicking on the transliteration of each lemma will lead users to view the respective lemmata in the lemma list. The subsequent columns contain data on the overall frequency of the collocation within the corpus, and within the window span around the node word. Clicking on these frequency data will call up all citations for a particular collocation result. The last two columns contain the test statistics.

