---
title: "`ssp.glm.rF`: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models"
output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{`ssp.glm.rF`: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette illustrates how to use `ssp.glm.rF()` for generalized linear
models with rare binary features. A rare feature is a binary covariate that takes
the value 1 in only a small fraction of observations. Ordinary subsampling can
miss many of these expressed rare-feature observations, making estimation of the
corresponding coefficients unstable.

`ssp.glm.rF()` uses rarity-aware sampling probabilities for one-step
balance-score sampling, two-step optimal subsampling, optional response
balancing for binary outcomes, automatic rare-feature detection, and a combined
estimator fitted on the union of the pilot and second-step subsamples.

## Setup

```{r setup}
library(subsampling)
```

## Simulated Logistic Regression Example

We first simulate a logistic regression dataset with two rare binary features and
two continuous covariates. In the formula `Y ~ .`, the model matrix contains an
intercept column. Numeric `rareFeature.index` values follow the original
covariate order supplied by the user; the function internally shifts the indices
to account for the intercept column.

```{r}
set.seed(2)
N <- 5000
Z1 <- rbinom(N, 1, 0.04)
Z2 <- rbinom(N, 1, 0.07)
X1 <- rnorm(N)
X2 <- rnorm(N)
eta <- 0.5 + 0.5 * Z1 + 0.5 * Z2 + 0.5 * X1 + 0.5 * X2
Y <- rbinom(N, 1, plogis(eta))

data <- data.frame(Y, Z1, Z2, X1, X2)
formula <- Y ~ .
rareFeature.index <- 1:2
n.plt <- 300
n.ssp <- 700
```

## One-Step Balanced Sampling

The default criterion is `"BL-Uni"`. It draws one Poisson subsample with
probabilities proportional to the rare-feature balance score. For `"BL-Uni"` and
`"Uni"`, the expected sample size is `n.plt + n.ssp`.

For observation $i$, the balance score is
$$
b(Z_i) = \sum_{j=1}^{d_r}
\frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j},
$$
where $d_r$ is the number of rare features and $\bar{Z}_j$ is the prevalence of
the $j$th rare feature in the full data. Observations containing expressed rare
features receive larger scores and therefore larger sampling probabilities.

```{r}
fit_bl <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = rareFeature.index
)

summary(fit_bl)
```

The summary reports the realized sample size, response composition, rare-feature
coverage, and coefficient estimates. Because this is a one-step method, the
pilot, subsample, and combined estimators are the same.

## Two-Step Rareness-Aware Optimal Subsampling

Two-step criteria such as `"Lopt"`, `"Aopt"`, `"R-Lopt"`, and `"BL-Lopt"` first
draw a pilot sample, fit a pilot estimator, compute second-step sampling
probabilities, and then refit the model on the second-step subsample. The final
combined estimator is fitted on the union of the pilot and second-step samples.

For two-step methods, `balance.X.plt = TRUE` draws the pilot sample using the
balance score.

```{r}
fit_rlopt <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

summary(fit_rlopt)
```

## Automatically Account for Rarity if `rareFeature.index = NULL`

If `rareFeature.index = NULL`, the function searches for binary columns whose
prevalence is below `rareThreshold`.

```{r}
fit_auto <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = NULL,
  rareThreshold = 0.09
)

fit_auto$rareFeature.index
```

## Balancing the Outcome for Logistic Regression

For binary outcomes, `balance.Y.ssp = TRUE` applies a case-control style
allocation for one-step `"Uni"` and `"BL-Uni"` methods. The option
`balance.Y.all = TRUE` includes all observations with `Y = 1` and subsamples
from observations with `Y = 0`.

```{r}
fit_y_balanced <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  balance.Y.ssp = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

c(
  full_Y_rate = mean(data$Y),
  subsample_Y_rate = fit_y_balanced$Y.proportion.ssp
)
```

## Objective Weights

By default, sampled observations are fitted with inverse-probability weights.
For one-step methods whose sampling probabilities do not depend on the response,
`objective.weight = "unweighted"` can be used. Two-step methods currently use a
weighted second-step objective.

## Control Options

The `control` argument accepts `alpha`, `b`, and `poi.method`. The default
`poi.method = "exact"` computes Poisson probabilities using full-data
normalization. The alternative `poi.method = "estimated"` uses the pilot sample
to estimate the normalizing quantity.

```{r}
fit_estimated <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2"),
  control = list(poi.method = "estimated", b = 2),
  record.stage.time = TRUE
)

fit_estimated$stage.time
```

## Non-Binomial Families

The rare-feature machinery can also be used with non-binomial GLMs. Response
balancing options are ignored for non-binary outcomes.

```{r}
set.seed(3)
N_g <- 3000
Z <- rbinom(N_g, 1, 0.05)
X <- rnorm(N_g)
y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g)
gaussian_data <- data.frame(y, Z, X)

fit_gaussian <- ssp.glm.rF(
  y ~ .,
  data = gaussian_data,
  n.plt = 200,
  n.ssp = 500,
  family = "gaussian",
  criterion = "BL-Uni",
  rareFeature.index = "Z"
)

summary(fit_gaussian)
```