Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.
dfm(
x,
tolower = TRUE,
remove_padding = FALSE,
verbose = quanteda_options("verbose"),
...
)a tokens or dfm object
convert all features to lowercase
logical; if TRUE, remove the "pads" left as empty tokens after
calling tokens() or tokens_remove() with padding = TRUE
display messages if TRUE
not used directly
a dfm object
In quanteda v3, many convenience functions formerly available in
dfm() were deprecated. Formerly, dfm() could be called directly on a
character or corpus object, but we now steer users to tokenise their
inputs first using tokens(). Other convenience arguments to dfm() were
also removed, such as select, dictionary, thesaurus, and groups. All
of these functions are available elsewhere, e.g. through dfm_group().
See news(Version >= "2.9", package = "quanteda") for details.
## for a corpus
toks <- data_corpus_inaugural %>%
corpus_subset(Year > 1980) %>%
tokens()
dfm(toks)
#> Document-feature matrix of: 11 documents, 3,426 features (78.47% sparse) and 4 docvars.
#> features
#> docs senator hatfield , mr . chief justice president vice bush
#> 1981-Reagan 2 1 174 3 130 1 1 5 2 1
#> 1985-Reagan 4 0 177 0 124 1 1 3 1 1
#> 1989-Bush 2 0 166 6 142 1 2 6 1 0
#> 1993-Clinton 0 0 139 0 81 0 0 2 0 1
#> 1997-Clinton 0 0 131 0 108 0 1 1 0 0
#> 2001-Bush 0 0 110 0 96 0 3 3 1 0
#> [ reached max_ndoc ... 5 more documents, reached max_nfeat ... 3,416 more features ]
# removal options
toks <- tokens(c("a b c", "A B C D")) %>%
tokens_remove("b", padding = TRUE)
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "" "c"
#>
#> text2 :
#> [1] "A" "" "C" "D"
#>
dfm(toks)
#> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars.
#> features
#> docs a c d
#> text1 1 1 1 0
#> text2 1 1 1 1
dfm(toks) %>%
dfm_remove(pattern = "") # remove "pads"
#> Document-feature matrix of: 2 documents, 3 features (16.67% sparse) and 0 docvars.
#> features
#> docs a c d
#> text1 1 1 0
#> text2 1 1 1
# preserving case
dfm(toks, tolower = FALSE)
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs a c A C D
#> text1 1 1 1 0 0 0
#> text2 1 0 0 1 1 1