This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
min_nchar = NULL,
max_nchar = NULL,
padding = FALSE,
verbose = quanteda_options("verbose")
)
dfm_remove(x, ...)
dfm_keep(x, ...)
fcm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
verbose = quanteda_options("verbose"),
...
)
fcm_remove(x, ...)
fcm_keep(x, ...)a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
whether to keep or remove the features
the type of pattern matching: "glob" for "glob"-style
wildcard expressions; "regex" for regular expressions; or "fixed" for
exact matching. See valuetype for details.
logical; if TRUE, ignore case when matching a
pattern or dictionary values
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
NULL for no limits. These are applied after (and hence, in addition
to) any selection based on pattern matches.
if TRUE, record the number of removed tokens in the first column.
if TRUE, print message about how many pattern were
removed
used only for passing arguments from dfm_remove or
dfm_keep to dfm_select. Cannot include
selection.
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern is a
dfm object and selection = "keep", then this will be
equivalent to calling dfm_match(). In this case, the following
settings are always used: case_insensitive = FALSE, and
valuetype = "fixed". This functionality is deprecated, however, and
you should use dfm_match() instead.
dfm_remove and fcm_remove are simply a convenience
wrappers to calling dfm_select and fcm_select with
selection = "remove".
dfm_keep and fcm_keep are simply a convenience wrappers to
calling dfm_select and fcm_select with selection = "keep".
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim().
dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")) %>%
dfm(tolower = FALSE)
dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
wordsEndingInY = c("by", "my"),
notintext = "blahblah"))
dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#> features
#> docs My by United_States Sweden
#> text1 1 1 0 0
#> text2 0 0 1 1
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#> features
#> docs by
#> text1 1
#> text2 0
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs My Christmas was by Does United_States
#> text1 1 1 1 1 0 0
#> text2 0 0 0 0 1 1
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#> features
#> docs ruined your opposition tax plan . the or Sweden have
#> text1 1 1 1 1 1 1 0 0 0 0
#> text2 0 0 0 0 0 0 1 1 1 1
#> [ reached max_nfeat ... 4 more features ]
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#> features
#> docs My was by your Does the or have more
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 1 1
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition tax plan . United_States Sweden progressive
#> text1 1 1 1 1 1 1 0 0 0
#> text2 0 0 0 0 0 0 1 1 1
#> features
#> docs taxation
#> text1 0
#> text2 1
#> [ reached max_nfeat ... 1 more feature ]
# select based on character length
dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition United_States Sweden progressive taxation
#> text1 1 1 1 0 0 0 0
#> text2 0 0 0 1 1 1 1
dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
"No if, and, or but about it: lots of stopwords.")))
dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#> features
#> docs this is a document with lots of stopwords . no
#> text1 1 1 1 1 1 1 1 1 1 0
#> text2 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 8 more features ]
dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#> features
#> docs document lots stopwords . , :
#> text1 1 1 1 1 0 0
#> text2 0 1 1 1 2 1
toks <- tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE)
fcmat <- fcm(toks)
fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#> features
#> features this contains lots of stopwords no if and or but
#> this 0 1 1 1 1 0 0 0 0 0
#> contains 0 0 1 1 1 0 0 0 0 0
#> lots 0 0 0 1 1 1 1 1 1 1
#> of 0 0 0 0 1 0 0 0 0 0
#> stopwords 0 0 0 0 0 0 0 0 0 0
#> no 0 0 0 0 0 0 1 1 1 1
#> if 0 0 0 0 0 0 0 1 1 1
#> and 0 0 0 0 0 0 0 0 1 1
#> or 0 0 0 0 0 0 0 0 0 1
#> but 0 0 0 0 0 0 0 0 0 0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#> features
#> features contains lots stopwords
#> contains 0 1 1
#> lots 0 0 1
#> stopwords 0 0 0