Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocationsdev(x, method = "all", size = 2, min_count = 2, smoothing = 0.5, tolower = TRUE, show_counts = FALSE, ...) is.collocationsdev(x)
| x | a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
|
|---|---|
| method | association measure for detecting collocations: |
| size | integer; the length of the collocations to be scored |
| min_count | numeric; minimum frequency of collocations that will be scored |
| smoothing | numeric; a smoothing parameter added to the observed counts (default is 0.5) |
| tolower | logical; if |
| show_counts | logical; if |
| ... | additional arguments passed to |
textstat_collocationsdev returns a data.frame of collocations and their
scores and statistics.
is.collocationdev returns TRUE if the object is of class
collocationsdev, FALSE otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x is a tokens object and some tokens have been removed, this should be done
using tokens_remove(x, pattern, padding = TRUE) so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z is the
Wald \(z\)-statistic computed as the quotient of lambda and the Wald
statistic for lambda as described below.
In detail:
Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated
loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.
Wald test \(z\)-statistic \(z\) is calculated as: $$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$
This function is under active development, with more measures to be added in the the next release of quanteda.
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
txts <- data_corpus_inaugural[1:2] head(cols <- textstat_collocationsdev(txts, size = 2, min_count = 2), 10)#> collocation count length lambda z G2 chi2 pmi #> 1 , and 17 2 2.643957 8.170237 49.47743 108.46212 2.927463 #> 2 have been 5 2 5.731000 7.487958 43.20136 399.03760 6.200685 #> 3 of the 24 2 1.781820 6.830093 37.22476 58.28699 1.935835 #> 4 has been 3 2 5.717327 6.584944 28.52046 323.74321 6.548608 #> 5 i have 5 2 3.772416 6.461199 26.86011 113.55789 4.463719 #> 6 , i 10 2 2.570085 6.377237 29.25016 65.92607 2.956032 #> 7 will be 4 2 3.974267 6.109305 23.64307 112.94349 4.728587 #> 8 less than 2 2 6.431212 5.663496 23.15338 373.56773 7.233106 #> 9 public good 2 2 6.431212 5.663496 23.15338 373.56773 7.233106 #> 10 which i 6 2 2.657154 5.555529 19.98871 52.21109 3.264154 #> LFMD #> 1 11.186029 #> 2 11.119548 #> 3 11.165255 #> 4 10.163318 #> 5 9.382582 #> 6 9.740667 #> 7 9.068437 #> 8 9.876962 #> 9 9.876962 #> 10 8.665034head(cols <- textstat_collocationsdev(txts, size = 3, min_count = 2), 10)#> collocation count length lambda z G2 chi2 pmi #> 1 of which the 2 3 6.179554 2.8579715 13.4611112 23.7539935 3.2516278 #> 2 , and of 2 3 3.066282 1.7161287 4.0540624 3.9852233 1.2377281 #> 3 in which i 3 3 2.907704 1.5893955 3.4809716 3.0412877 0.7012360 #> 4 , or by 2 3 3.086502 1.3263061 2.2762489 1.9886129 0.5716844 #> 5 i have in 2 3 2.484260 1.1250830 1.6346876 1.4070556 0.4984132 #> 6 me by the 2 3 2.362269 1.0839184 1.5158738 1.3075711 0.4867261 #> 7 , and the 3 3 1.017118 1.0243655 1.0678760 1.0779195 0.5158313 #> 8 and of the 2 3 1.057485 0.8988065 0.8416763 0.8445156 0.5606277 #> 9 , i shall 3 3 1.661358 0.7605286 0.6951628 0.6084811 0.1996503 #> 10 . on the 2 3 1.014510 0.5884358 0.3960617 0.3629160 0.2626685 #> LFMD #> 1 5.895484 #> 2 3.881584 #> 3 4.315946 #> 4 3.215541 #> 5 3.142269 #> 6 3.130582 #> 7 4.130541 #> 8 3.204484 #> 9 3.814360 #> 10 2.906525# extracting multi-part proper nouns (capitalized terms) toks2 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks2, stopwords("english"), padding = TRUE) toks2 <- tokens_select(toks2, "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) seqs <- textstat_collocationsdev(toks2, size = 3, tolower = FALSE) head(seqs, 10)#> collocation count length lambda z G2 #> 1 United States Congress 2 3 -2.152404 -1.014623 0.7972545 #> 2 Vice President Bush 2 3 -11.582818 -4.471125 9.6364697 #> chi2 pmi LFMD #> 1 1.182867 -0.1873977 2.456458 #> 2 9474.743454 -0.2634959 2.380360