---
title: "ggchangepoint: A Unified Tidy Interface for Changepoint Analysis in R"
author: "Youzhi Yu
University of Chicago"
date: "`r Sys.Date()`"
bibliography: vignette_reference.bib
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{ggchangepoint: A Unified Tidy Interface for Changepoint Analysis in R}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
fig.width = 8,
fig.height = 5,
message = FALSE,
warning = FALSE,
comment = "#>",
fig.alt = "ggchangepoint plot of a time series with detected changepoints"
)
# The wrappers for the optional engines live in Suggests; gate the chunks that
# need them so the vignette still builds when an engine is not installed.
has_fpop <- requireNamespace("fpop", quietly = TRUE)
has_wbs <- requireNamespace("wbs", quietly = TRUE)
has_not <- requireNamespace("not", quietly = TRUE)
```
# Abstract
**ggchangepoint** is an R package that provides a unified, tidy interface to
changepoint detection across multiple algorithmic backends. It introduces the
`ggcpt` S3 result class with `broom`-style methods (`tidy()`, `glance()`,
`augment()`), a central `cpt_detect()` dispatcher supporting over a dozen
detection algorithms, native `ggplot2` visualization via `autoplot()` and
specialised geoms, method comparison and accuracy evaluation modules, and a
data simulation framework with canonical test signals. By harmonising the
disparate APIs of existing R changepoint packages behind a single convention,
ggchangepoint lowers the barrier to exploratory changepoint analysis and
reproducible method comparison.
# Introduction
Changepoint detection---the problem of identifying points in a sequence at
which the underlying statistical properties change---is a fundamental task in
time series analysis [@truong2020selective; @aminikhanghahi2017survey]. It has
applications across virtually every domain that involves sequential data,
including genomics [@picard2005statistical], finance [@athey2022detecting],
climate science [@haslett1989space], and signal processing [@lavielle2005using].
The R ecosystem offers a rich set of changepoint packages, each implementing one
or more detection algorithms with its own conventions for input, output, and
parameterisation. The **changepoint** package [@killick2014changepoint] provides
PELT [@killick2012pelt], Binary Segmentation [@scott1974cluster;
@vostrikova1981detecting], Segmented Neighbourhood, and AMOC. The **wbs**
[@fryzlewicz2014wild] and **breakfast** packages implement Wild Binary
Segmentation and its variants, while **not** [@baranowski2019narrowest],
**mosum** [@eichinger2018mosum], **fpop** [@maidstone2017optimal],
**IDetect** [@anastasiou2022idetect], and others
offer further specialised algorithms. On the nonparametric side,
**changepoint.np** [@haynes2017computationally] and **ecp**
[@james2014ecp; @matteson2014nonparametric] handle distributional changes.
While this diversity is a strength of the R community, it creates practical
difficulties for the analyst. Each package uses a different result class,
different naming conventions for parameters, different plot methods, and
different changepoint indexing conventions. Comparing the output of several
detectors on the same data---a standard practice for robust analysis---requires
the user to write manual conversion code. Furthermore, none of the existing
packages natively produce `ggplot2` [@wickham2016ggplot2] graphics or support
the `broom` [@robinson2017broom] convention for tidy data extraction.
ggchangepoint addresses these problems by providing a single, consistent
interface that wraps the most widely used detection packages. Its design goals
are:
1. **Uniformity**: a single `ggcpt` result class regardless of the underlying
detection engine, with `broom`-style methods for tidy data access.
2. **Discoverability**: a central `cpt_detect()` dispatcher whose documentation
lists all supported methods and their capabilities.
3. **Visualisation**: first-class `ggplot2` integration through `autoplot()`
and specialised geoms.
4. **Comparison and evaluation**: built-in tools for running multiple detectors,
tabulating their results, computing accuracy metrics, and visualising
discrepancies.
5. **Reproducibility**: a simulation framework that generates data with known
changepoints, enabling rigorous benchmarking.
# The ggcpt Result Class
The `ggcpt` class is an S3 class that stores the complete output of a
changepoint detection in a structured format. Every detection function in
ggchangepoint---whether called through `cpt_detect()` or directly---returns a
`ggcpt` object, ensuring a uniform interface for downstream processing.
```{r}
library(ggchangepoint)
library(ggplot2)
library(generics)
theme_set(theme_light())
set.seed(2022)
x <- c(rnorm(100, 0, 1), rnorm(100, 10, 1))
res <- cpt_detect(x, method = "pelt", change_in = "mean")
class(res)
```
A `ggcpt` object is a named list with the following components:
- **`changepoints`**: a tibble of detected changepoint locations (`cp`) and
their corresponding data values (`cp_value`).
- **`segments`**: a tibble describing the fitted segments (segment ID, start,
end, length, and the segment-level parameter estimate).
- **`data`**: the original data series as a tidy tibble of `index` and `value`.
- **metadata fields**: `method`, `change_in`, `penalty`, `fit` (the raw
upstream object), `call`, and `cp_convention`.
```{r}
print(res)
```
The `cp_convention` component records whether changepoint indices follow the
*left-segment* convention---the last observation before the change, used by
the **changepoint** package---or the *right-segment* convention. All
ggchangepoint methods report locations under the left-segment convention, with
results from packages that use the alternative convention (e.g., **ecp**)
normalised automatically so that methods can be compared on a common footing.
```{r}
res$cp_convention
```
## Broom-Style Methods
Following the `broom` convention [@robinson2017broom] for standardised data
access, `ggcpt` objects support `tidy()`, `glance()`, and `augment()`.
**`tidy()`** returns the changepoint locations as a tibble, one row per
changepoint:
```{r}
generics::tidy(res)
```
**`glance()`** returns a one-row summary with the series length, number of
detected changepoints, method, change type, and penalty information:
```{r}
generics::glance(res)
```
**`augment()`** returns the original data augmented with segment identifiers,
fitted segment-level parameter estimates, residuals, and a logical flag
indicating changepoint positions:
```{r}
generics::augment(res)
```
These methods make it straightforward to pipe ggchangepoint results into
further analysis or custom visualisation.
# Unified Detection Dispatcher
The `cpt_detect()` function serves as the primary entry point for changepoint
detection. It accepts a data series, a method name, a change type, and optional
penalty parameters, and dispatches to the appropriate backend wrapper:
```{r}
cpt_detect(x, method = "pelt", change_in = "mean")
```
```{r, eval = has_fpop}
cpt_detect(x, method = "fpop", change_in = "mean")
```
The following detection methods are currently supported:
| Method | Package(s) | Change types |
|--------------|-----------------|---------------------------------|
| `pelt` | changepoint | mean, var, meanvar |
| `binseg` | changepoint | mean, var, meanvar |
| `segneigh` | changepoint | mean, var, meanvar |
| `amoc` | changepoint | mean, var, meanvar |
| `fpop` | fpop | mean |
| `wbs` | wbs | mean |
| `wbs2` | breakfast | mean |
| `not` | not | mean, var, meanvar |
| `mosum` | mosum | mean, var |
| `idetect` | IDetect | mean |
| `tguh` | breakfast | mean |
| `np` | changepoint.np | distribution |
| `ecp` | ecp | distribution (multivariate) |
## Penalty Specification
For methods that support it (PELT, Binary Segmentation, Segmented Neighbourhood,
AMOC, FPOP), the penalty parameter can be controlled via `cpt_penalty()`
or the `penalty` argument. Penalties can be information criteria---`"BIC"`
[@yao1988estimating], `"AIC"`---or user-specified numeric values:
```{r}
cpt_detect(x, method = "pelt", change_in = "mean", penalty = "BIC")
```
```{r, eval = has_fpop}
cpt_detect(x, method = "fpop", change_in = "mean", penalty = 2 * log(200))
```
# Visualisation
ggchangepoint provides several layers of `ggplot2` integration, from
one-function plotting to fully customisable geoms.
## The autoplot() Method
The recommended way to visualise a `ggcpt` result is through `autoplot()`, which
produces a `ggplot2` object showing the data series with changepoint locations
marked by vertical lines:
```{r}
ggplot2::autoplot(res)
```
Alternating shaded segments help delineate regimes:
```{r}
ggplot2::autoplot(res, show_segments = TRUE)
```
## The Original Plotting Functions
The `ggcptplot()` and `ggecpplot()` functions from version 0.1.0 are retained
for backward compatibility:
```{r}
ggcptplot(x)
ggecpplot(x, min_size = 10)
```
## ggplot2 Geoms
For users who wish to build custom visualisations, ggchangepoint provides four
new geoms and stats.
**`geom_changepoint()`** adds vertical lines at changepoint positions:
```{r}
ggplot(data.frame(t = seq_along(x), y = x), aes(t, y)) +
geom_line() +
geom_changepoint(data = generics::tidy(res), aes(xintercept = cp))
```
**`geom_cpt_segment()`** draws the fitted segment-level means between
changepoints:
```{r}
seg <- res$segments
ggplot(data.frame(t = seq_along(x), y = x), aes(t, y)) +
geom_line() +
geom_cpt_segment(data = seg,
aes(x = start, xend = end, y = param_estimate,
yend = param_estimate),
colour = "steelblue", linewidth = 1.2)
```
**`stat_changepoint()`** runs `cpt_detect()` inline within the ggplot pipeline:
```{r}
ggplot(data.frame(t = seq_along(x), y = x), aes(t, y)) +
geom_line() +
stat_changepoint(method = "pelt", change_in = "mean")
```
**`geom_cpt_ci()`** adds confidence intervals around segment estimates.
# Method Comparison
Robust changepoint analysis typically involves running multiple detectors on
the same data and comparing their outputs. ggchangepoint provides dedicated
comparison functions for this purpose.
## Side-by-Side and Overlay Plots
`ggcpt_compare()` runs several methods and arranges the results either as
facetted panels (one per method) or as an overlay with colour-coded
changepoint markers:
```{r}
x3 <- c(rnorm(100, 0, 1), rnorm(100, 10, 1), rnorm(100, 5, 2))
cmp_methods <- if (has_fpop) c("pelt", "binseg", "fpop") else c("pelt", "binseg", "amoc")
ggcpt_compare(x3, methods = cmp_methods, layout = "facet")
```
```{r}
ggcpt_compare(x3, methods = cmp_methods, layout = "overlay")
```
For a numeric summary, `ggcpt_compare_table()` returns a tidy tibble of all
detected changepoints across methods:
```{r}
ggcpt_compare_table(x3, methods = cmp_methods)
```
## Parallel Execution
When many methods are being compared, `ggcpt_compare()` respects the
`future::plan()` parallelisation strategy if the `future` and `future.apply`
packages are available. Detection is fanned out over the requested methods;
supplying a `seed` makes the parallel run reproducible via parallel-safe
L'Ecuyer-CMRG streams:
```{r eval = FALSE}
future::plan(future::multisession, workers = 2)
ggcpt_compare(x, methods = c("pelt", "binseg", "fpop", "wbs", "not"),
seed = 1)
```
# Accuracy Evaluation
When ground-truth changepoint locations are known---either from synthetic data
or from a labelled data set---ggchangepoint provides a comprehensive suite of
accuracy metrics through `cpt_metrics()`:
- **Precision and recall** based on a margin of tolerance (default 5).
- **F1 score**, the harmonic mean of precision and recall.
- **Covering metric**, the length-weighted average Jaccard overlap between the
true and predicted segmentations [@van2020evaluation].
- **Hausdorff distance** between the sets of predicted and true changepoint
locations.
- **Adjusted Rand index**, measuring agreement between the induced segment
labellings [@hariz2007classification].
- **Annotation error**, the absolute difference between the predicted and true
number of changepoints [@van2020evaluation].
- **Mean absolute error (MAE) and root mean squared error (RMSE)** of
changepoint locations.
```{r}
cpt_metrics(pred = c(100, 200), truth = c(100, 200), n = 300)
```
With a tolerance margin, detections within the margin are considered correct:
```{r}
cpt_metrics(pred = c(105, 205), truth = c(100, 200), n = 300, margin = 10)
```
For scenarios with multiple ground-truth annotations, `cpt_metrics_annotated()`
computes metrics against each annotator and averages:
```{r}
cpt_metrics_annotated(pred = c(100, 200),
annotations = list(c(100, 200), c(105, 198)),
n = 300)
```
The `ggcpt_eval()` function provides a visual evaluation plot, showing true
and predicted changepoints colour-coded by match status:
```{r}
ggcpt_eval(pred = c(100, 200), truth = c(100, 200), data_vec = x)
```
# Data Simulation
Reproducible synthetic data is essential for benchmarking detection algorithms.
ggchangepoint provides `cpt_simulate()` (and its shorthand `rcpt()`) for
generating time series with known changepoint locations across a range of
scenarios.
## Flexible Simulation
The simulator supports changes in mean, variance, both, or slope, with four
noise models---Gaussian, Student-t, AR(1), and random walk:
```{r}
seg_params <- list(
list(mean = 0, sd = 1),
list(mean = 10, sd = 1),
list(mean = 5, sd = 0.5),
list(mean = -2, sd = 1)
)
dat <- cpt_simulate(200, changepoints = c(50, 100, 150),
change_in = "meanvar",
params = seg_params)
```
The true changepoint locations are stored as an attribute:
```{r}
attr(dat, "true_changepoints")
```
## Canonical Test Signals
The package also includes five canonical test signals adapted from the wavelet
and changepoint literature [@donoho1994ideal]:
```{r}
blocks <- signal_blocks(512)
fms <- signal_fms(512)
mix <- signal_mix(512)
teeth <- signal_teeth(512)
stairs <- signal_stairs(512)
```
Each signal has known changepoint locations and is suitable for benchmarking
detection accuracy across different signal structures.
# Case Study: Comparative Evaluation on a Block Signal
We illustrate a complete workflow---simulation, detection, comparison, and
evaluation---using the Blocks test signal with added Gaussian noise:
```{r}
set.seed(1)
sig <- signal_blocks(512)
truth <- attr(sig, "true_changepoints")
x_noisy <- sig$value + rnorm(512, 0, 0.5)
```
Detect changepoints with every method available in this build, score each
against the known truth with a tolerance margin of 5, and collect the results
into a single table:
```{r}
methods_cs <- c("pelt", "binseg", "amoc")
if (has_fpop) methods_cs <- c(methods_cs, "fpop")
if (has_wbs) methods_cs <- c(methods_cs, "wbs")
if (has_not) methods_cs <- c(methods_cs, "not")
metrics <- do.call(rbind, lapply(methods_cs, function(m) {
res <- cpt_detect(x_noisy, method = m, change_in = "mean")
pred <- generics::tidy(res)$cp
data.frame(method = m, cpt_metrics(pred, truth, n = 512, margin = 5))
}))
metrics[, c("method", "n_pred", "precision", "recall", "f1", "covering")]
```
Visual evaluation of the PELT result, with the $\pm 5$ tolerance windows shaded
and predictions coloured by match status:
```{r}
pred_pelt <- generics::tidy(cpt_detect(x_noisy, method = "pelt"))$cp
ggcpt_eval(pred = pred_pelt, truth = truth, data_vec = x_noisy)
```
# Summary and Future Work
ggchangepoint provides a unified, tidy interface to the diverse changepoint
detection ecosystem in R. By standardising on a single result class, adopting
`broom` conventions, and integrating natively with `ggplot2`, the package
reduces the friction of exploratory changepoint analysis and facilitates
reproducible method comparison.
Planned directions for future development include:
1. **Additional wrapper coverage**: integration of further detection packages
such as **strucchange**, **segmented**, and **bcp**.
2. **Online detection**: support for streaming and sequential changepoint
detection.
3. **Model selection helpers**: visual tools for penalty selection, including
elbow plots and cross-validated loss curves.
4. **Multivariate and high-dimensional methods**: improved handling of
multivariate changepoint detection, leveraging ecp's existing
multi-dimensional support.
5. **Interactive visualisation**: integration with interactive plotting
frameworks for exploratory data analysis.
Contributions and bug reports are welcome at the package's GitHub repository
(https://github.com/PursuitOfDataScience/ggchangepoint).
# References