--- title: "`ssp.glm.rF`: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{`ssp.glm.rF`: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette illustrates how to use `ssp.glm.rF()` for generalized linear models with rare binary features. A rare feature is a binary covariate that takes the value 1 in only a small fraction of observations. Ordinary subsampling can miss many of these expressed rare-feature observations, making estimation of the corresponding coefficients unstable. `ssp.glm.rF()` uses rarity-aware sampling probabilities for one-step balance-score sampling, two-step optimal subsampling, optional response balancing for binary outcomes, automatic rare-feature detection, and a combined estimator fitted on the union of the pilot and second-step subsamples. ## Setup ```{r setup} library(subsampling) ``` ## Simulated Logistic Regression Example We first simulate a logistic regression dataset with two rare binary features and two continuous covariates. In the formula `Y ~ .`, the model matrix contains an intercept column. Numeric `rareFeature.index` values follow the original covariate order supplied by the user; the function internally shifts the indices to account for the intercept column. ```{r} set.seed(2) N <- 5000 Z1 <- rbinom(N, 1, 0.04) Z2 <- rbinom(N, 1, 0.07) X1 <- rnorm(N) X2 <- rnorm(N) eta <- 0.5 + 0.5 * Z1 + 0.5 * Z2 + 0.5 * X1 + 0.5 * X2 Y <- rbinom(N, 1, plogis(eta)) data <- data.frame(Y, Z1, Z2, X1, X2) formula <- Y ~ . rareFeature.index <- 1:2 n.plt <- 300 n.ssp <- 700 ``` ## One-Step Balanced Sampling The default criterion is `"BL-Uni"`. It draws one Poisson subsample with probabilities proportional to the rare-feature balance score. For `"BL-Uni"` and `"Uni"`, the expected sample size is `n.plt + n.ssp`. For observation $i$, the balance score is $$ b(Z_i) = \sum_{j=1}^{d_r} \frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j}, $$ where $d_r$ is the number of rare features and $\bar{Z}_j$ is the prevalence of the $j$th rare feature in the full data. Observations containing expressed rare features receive larger scores and therefore larger sampling probabilities. ```{r} fit_bl <- ssp.glm.rF( formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = rareFeature.index ) summary(fit_bl) ``` The summary reports the realized sample size, response composition, rare-feature coverage, and coefficient estimates. Because this is a one-step method, the pilot, subsample, and combined estimators are the same. ## Two-Step Rareness-Aware Optimal Subsampling Two-step criteria such as `"Lopt"`, `"Aopt"`, `"R-Lopt"`, and `"BL-Lopt"` first draw a pilot sample, fit a pilot estimator, compute second-step sampling probabilities, and then refit the model on the second-step subsample. The final combined estimator is fitted on the union of the pilot and second-step samples. For two-step methods, `balance.X.plt = TRUE` draws the pilot sample using the balance score. ```{r} fit_rlopt <- ssp.glm.rF( formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE, rareFeature.index = c("Z1", "Z2") ) summary(fit_rlopt) ``` ## Automatically Account for Rarity if `rareFeature.index = NULL` If `rareFeature.index = NULL`, the function searches for binary columns whose prevalence is below `rareThreshold`. ```{r} fit_auto <- ssp.glm.rF( formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = NULL, rareThreshold = 0.09 ) fit_auto$rareFeature.index ``` ## Balancing the Outcome for Logistic Regression For binary outcomes, `balance.Y.ssp = TRUE` applies a case-control style allocation for one-step `"Uni"` and `"BL-Uni"` methods. The option `balance.Y.all = TRUE` includes all observations with `Y = 1` and subsamples from observations with `Y = 0`. ```{r} fit_y_balanced <- ssp.glm.rF( formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "BL-Uni", balance.Y.ssp = TRUE, rareFeature.index = c("Z1", "Z2") ) c( full_Y_rate = mean(data$Y), subsample_Y_rate = fit_y_balanced$Y.proportion.ssp ) ``` ## Objective Weights By default, sampled observations are fitted with inverse-probability weights. For one-step methods whose sampling probabilities do not depend on the response, `objective.weight = "unweighted"` can be used. Two-step methods currently use a weighted second-step objective. ## Control Options The `control` argument accepts `alpha`, `b`, and `poi.method`. The default `poi.method = "exact"` computes Poisson probabilities using full-data normalization. The alternative `poi.method = "estimated"` uses the pilot sample to estimate the normalizing quantity. ```{r} fit_estimated <- ssp.glm.rF( formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE, rareFeature.index = c("Z1", "Z2"), control = list(poi.method = "estimated", b = 2), record.stage.time = TRUE ) fit_estimated$stage.time ``` ## Non-Binomial Families The rare-feature machinery can also be used with non-binomial GLMs. Response balancing options are ignored for non-binary outcomes. ```{r} set.seed(3) N_g <- 3000 Z <- rbinom(N_g, 1, 0.05) X <- rnorm(N_g) y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g) gaussian_data <- data.frame(y, Z, X) fit_gaussian <- ssp.glm.rF( y ~ ., data = gaussian_data, n.plt = 200, n.ssp = 500, family = "gaussian", criterion = "BL-Uni", rareFeature.index = "Z" ) summary(fit_gaussian) ```