Linear or nonlinear penalized regression of any dependent variable on the wide number of sentiment measures and potentially other explanatory variables. Either performs a regression given the provided variables at once, or computes regressions sequentially for a given sample size over a longer time horizon, with associated prediction performance metrics.
sento_model(sento_measures, y, x = NULL, ctr)
| sento_measures | a |
|---|---|
| y | a one-column |
| x | a named |
| ctr | output from a |
If ctr$do.iter = FALSE, a sento_model object which is a list containing:
optimized regression, i.e., a model-specific glmnet object, including for example the estimated coefficients.
the input argument ctr$model, to indicate the type of model estimated.
calibrated alpha.
calibrated lambda.
output from train call (if ctr$type = "cv"). There is no such
output if the control parameters alphas and lambdas both specify one value.
a list composed of two elements: under "criterion", the type of information criterion used in the
calibration, and under "matrix", a matrix of all information criterion values for alphas as rows
and the respective lambda values as columns (if ctr$type != "cv"). Any NA value in the latter
element means the specific information criterion could not be computed.
sample reference dates as a two-element character vector, being the earliest and most recent date from
the sento_measures object accounted for in the estimation window.
a vector of size two, with respectively the number of sentiment measures, and the number of other explanatory variables inputted.
a named logical vector of length equal to the number of sentiment measures, in which TRUE
indicates that the particular sentiment measure has not been considered in the regression process. A sentiment measure is
not considered when it is a duplicate of another, or when at least 50% of the observations are equal to zero.
all sparse regressions, i.e., separate sento_model objects as above, as a list with as names the
dates from the perspective of the sentiment measures at which the out-of-sample predictions are carried out.
calibrated alphas.
calibrated lambdas.
a data.frame with performance-related measures, being "RMSFE" (root mean squared
forecasting error), "MAD" (mean absolute deviation), "MDA" (mean directional accuracy, in which's calculation
zero is considered as a positive; in p.p.), "accuracy" (proportion of correctly predicted classes in case
of a logistic regression; in p.p.), and each's respective individual values in the sample. Directional accuracy
is measured by comparing the change in the realized response with the change in the prediction between two consecutive time
points (omitting the very first prediction as NA). Only the relevant performance statistics are given
depending on the type of regression. Dates are as in the "models" output element, i.e., from the perspective of the
sentiment measures.
Models are computed using the elastic net regularization as implemented in the glmnet package, to account for
the multidimensionality of the sentiment measures. Independent variables are normalized in the regression process, but
coefficients are returned in their original space. For a helpful introduction to glmnet, we refer to their
vignette. The optimal elastic net parameters
lambda and alpha are calibrated either through a to specify information criterion or through
cross-validation (based on the "rolling forecasting origin" principle, using the train function).
In the latter case, the training metric is automatically set to "RMSE" for a linear model and to "Accuracy"
for a logistic model. We suppress many of the details that can be supplied to the glmnet and
train functions we rely on, for the sake of user-friendliness.
Samuel Borms, Keven Bluteau
if (FALSE) { data("usnews", package = "sentometrics") data("list_lexicons", package = "sentometrics") data("list_valence_shifters", package = "sentometrics") data("epu", package = "sentometrics") set.seed(505) # construct a sento_measures object to start with corpusAll <- sento_corpus(corpusdf = usnews) corpus <- quanteda::corpus_subset(corpusAll, date >= "2004-01-01") l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("equal_weight", "linear"), by = "month", lag = 3) sento_measures <- sento_measures(corpus, l, ctr) # prepare y and other x variables y <- epu[epu$date %in% get_dates(sento_measures), "index"] length(y) == nobs(sento_measures) # TRUE x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables colnames(x) <- c("x1", "x2") # a linear model based on the Akaike information criterion ctrIC <- ctr_model(model = "gaussian", type = "AIC", do.iter = FALSE, h = 4, do.difference = TRUE) out1 <- sento_model(sento_measures, y, x = x, ctr = ctrIC) # attribution and prediction as post-analysis attributions1 <- attributions(out1, sento_measures, refDates = get_dates(sento_measures)[20:25]) plot(attributions1, "features") nx <- nmeasures(sento_measures) + ncol(x) newx <- runif(nx) * cbind(data.table::as.data.table(sento_measures)[, -1], x)[30:40, ] preds <- predict(out1, newx = as.matrix(newx), type = "link") # an iterative out-of-sample analysis, parallelized ctrIter <- ctr_model(model = "gaussian", type = "BIC", do.iter = TRUE, h = 3, oos = 2, alphas = c(0.25, 0.75), nSample = 75, nCore = 2) out2 <- sento_model(sento_measures, y, x = x, ctr = ctrIter) summary(out2) # plot predicted vs. realized values p <- plot(out2) p # a cross-validation based model, parallelized cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) ctrCV <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out3 <- sento_model(sento_measures, y, x = x, ctr = ctrCV) parallel::stopCluster(cl) foreach::registerDoSEQ() summary(out3) # a cross-validation based model for a binomial target yb <- epu[epu$date %in% get_dates(sento_measures), "above"] ctrCVb <- ctr_model(model = "binomial", type = "cv", do.iter = FALSE, h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70, testWindow = 10, oos = 0, do.progress = TRUE) out4 <- sento_model(sento_measures, yb, x = x, ctr = ctrCVb) summary(out4)}