--- title: "Get started with pipeflow" output: rmarkdown::html_vignette: toc: true toc_depth: 4 description: > Start here if this is your first time using pipeflow. vignette: > %\VignetteIndexEntry{Get started with pipeflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitr-setup, include = FALSE} knitr::opts_chunk$set( comment = "#", prompt = FALSE, tidy = FALSE, cache = FALSE, collapse = TRUE ) old <- options(width = 100L) ``` ## A simple example to get started In this example, we'll use base R's airquality dataset. ```{r show-airquality} head(airquality) ``` Our goal is to create an analysis pipeline that performs the following steps: * add new data column `Temp.Celsius` containing the temperature in degrees Celsius * fit a linear model to the data * plot the data and the model fit. In the following, we'll show how to define and run the pipeline, how to inspect the output of specific steps, and finally how to re-run the pipeline with different parameter settings, which is one of the selling points of using such a pipeline. ### Pipeline building For easier understanding, we go step by step. First, we create a new pipeline with the name "my-pipeline" and add a `data` step that provides the input dataset. ```{r define-pipeline} library(pipeflow) pip <- pip_new("my-pip") pip <- pip_add(pip, step = "data", fun = function(data = airquality) data ) ``` For each step to add, at minimum we specify the name of the step and a function that defines what is computed in that step. Let's take a first look at the pipeline. ```{r show-initial-pipeline} pip ``` Here, each step is represented by one row in the table as denoted in the first column. The `depends` column lists the dependencies of a step, which is empty for the `data` step since it does not depend on any other step (more on dependencies later). The `out` column will eventually contain the output of the step, which is currently `NULL` since we haven't run the pipeline yet, and the `state` column shows the current, which initially is `new` for all steps. Next, we add a step called `data_prep`, which consists of a function that takes the output of the `data` step as its first argument, adds a new column and returns the modified data as its output. To refer to the output of an earlier pipeline step, we just write the name of the step preceded with the tilde (~) operator, that is, `~data` in this case. Since `pip_add` works "by reference", we can add the step as follows: ```{r define-data-prep-step} pip |> pip_add( "data_prep", function(x = ~data) { replace(x, "Temp.Celsius", (x[, "Temp"] - 32) * 5 / 9) } ) ``` So, a second step called `data_prep` was added and it depends on the `data` step as now visible in column `depends`. Next, we want to add a step called `model_fit` that fits a linear model to the data. The function takes the output of the `data_prep` and defines a parameter `xVar`, which is used to specify the variable that is used as predictor in the linear model. ```{r} pip |> pip_add( "model_fit", function( data = ~data_prep, xVar = "Temp.Celsius" ) { lm(paste("Ozone ~", xVar), data = data) } ) pip ``` Lastly, we add a step called `model_plot`, which plots the data and the linear model fit. The function uses the output from both the `model_fit` and `data_prep` step. It also defines the `xVar` parameter and a parameter `title`, which is used as the title of the plot. ```{r} pip |> pip_add( "model_plot", function( model = ~model_fit, data = ~data_prep, xVar = "Temp.Celsius", xLab = "Temperature in degrees Celsius", title = "Linear model fit" ) { require(ggplot2, quietly = TRUE) coeffs <- coefficients(model) ggplot(data) + geom_point(aes(.data[[xVar]], .data[["Ozone"]])) + geom_abline(intercept = coeffs[1], slope = coeffs[2]) + labs(title = title, x = xLab) } ) pip ``` In the last line, we see that the `model_plot` step depends on both the `model_fit` and `data_prep` step. In addition to the tabular output, {pipeflow} also provides a graphical representation that is compatible with the `visNetwork` package. In particular, the `pip_get_graph()` function returns a list of arguments that can be feed directly to `visNetwork::visNetwork()`. ```{r, eval = FALSE} library(visNetwork) do.call(visNetwork, args = pip_get_graph(pip)) |> visHierarchicalLayout(direction = "LR") ``` ```{r, echo = FALSE} library(visNetwork) do.call( visNetwork, args = c(pip_get_graph(pip), list(height = 100, width = 600)) ) |> visHierarchicalLayout(direction = "LR") ``` Here, the pipeline is visualized as a directed acyclic graph (DAG) where the nodes represent the steps and the edges represent the dependencies. ### Pipeline integrity A key feature of {pipeflow} is that the integrity of a pipeline is verified at definition time. To see this, let's try to add another step that is referring to a non-existent step `foo` as its input. ```{r try-add-bad-step, error = TRUE} pip |> pip_add( "another_step", function(data = ~foo) { data } ) ``` {pipeflow} immediately signals an error and the pipeline remains unchanged. ```{r} pip ``` ### Pipeline run and output To run the pipeline, we simply call `pip_run()`, which produces the following output: ```{r run-pipeline} pip_run(pip) ``` Let's inspect the pipeline again. ```{r pipeline-after-run} pip ``` ```{r, echo = FALSE} do.call( visNetwork, args = c(pip_get_graph(pip), list(height = 100, width = 600)) ) |> visHierarchicalLayout(direction = "LR") ``` We can see that the `state` of all steps have been changed from `new` to `done`, which graphically is represented by the color change from blue to green. In addition, the output was added in the `out` column. To access a specific entry of the pipeline, we just select the row (aka step) and column of pipeline table via the `[[` operator. For example, to inspect the `out`put of the `model_fit` and `model_plot` steps, we do: ```{r inspect-lm, message = FALSE} pip[["model_fit", "out"]] ``` ```{r inspect-plot, message = FALSE, warning = FALSE, fig.alt = "model-plot"} pip[["model_plot", "out"]] ``` ### Pipeline parameters Even for a moderately complex analysis consisting of, say, 15 to 20 different functions, keeping track of all the different analysis parameters can quickly get out of hand. As we will see, with {pipeflow} this becomes much easier, since the pipeline itself keeps track of all parameters and their values. Let's first inspect the parameters of the above defined pipeline using the `pip_get_params()` function. ```{r inspect-params} pip_get_params(pip) |> str() ``` It returns a list of all *independent* parameters (here `data`, `xVar`, and `title`). By *independent* we mean that these parameters don't depend on other steps (i.e. steps defined with the `~` operator). This is important as you never want to mess with parameters defined in terms of other steps. Furthermore, each parameter is only listed once, even if it is used in multiple steps^[For example, the `xVar` parameter is used in both the `model_fit` and `model_plot` step]. To change any independent parameter, we simply call `pip_set_params()`: ```{r set-xVar} pip |> pip_set_params(list(xVar = "Solar.R", xLab = "Solar radiation in Langleys")) pip_get_params(pip) |> str() ``` {pipeflow} automatically propagates the parameter change to all steps that use the respective parameter. In addition, it will recognize which steps are affected by the parameter change and mark them as `outdated`. ```{r show-pipeline-with-outdated-step} pip ``` ```{r, echo = FALSE} library(visNetwork) do.call( visNetwork, args = c(pip_get_graph(pip), list(height = 100, width = 600)) ) |> visHierarchicalLayout(direction = "LR") ``` We can see that the `model_fit` and `model_plot` steps are now in state `outdated` (graphically indicated by the orange color). To update the results, we just run the pipeline again. ```{r run-pipeline-again} pip_run(pip) ``` The outdated steps were re-run as expected and the output was updated accordingly now showing the new x-variable `Solar.R`. ```{r inspect-plot-again, message = FALSE, warning = FALSE, fig.alt = "model-plot"} pip[["model_plot", "out"]] ``` A closer look at the run log shows that the pipeline skipped the first two steps and ran only the steps that were outdated, which basically can be thought of caching or mimicking the behavior of `make` in software development. That is, {pipeflow} always keeps track of which steps are outdated and only re-runs those steps and their downstream dependencies, which can be a huge time saver for larger pipelines^[ Another use case is backend computation in interactive shiny applications, where users change parameters dynamically and want quick updates. ]. Let's visit some more examples of parameter changes and their effects on the pipeline. To just change the title of the plot, only the `model_plot` step needs to be rerun. ```{r set-title} pip |> pip_set_params(list(title = "Some new title")) pip ``` ```{r inspect-plot-after-title-change, message = FALSE, warning = FALSE, fig.alt = "model-plot"} pip_run(pip) pip[["model_plot", "out"]] ``` Once we change the input data parameter from the `data` step, since all other steps depend on it, we expect all steps to be rerun. ```{r} small_airquality <- airquality[1:10, ] pip |> pip_set_params(list(data = small_airquality)) pip ``` ```{r inspect-plot-after-data-change, message = FALSE, warning = FALSE, fig.alt = "model-plot"} pip_run(pip) pip[["model_plot", "out"]] ``` Last but not least let's try to set parameters that don't exist in the pipeline, which mostly happens due to accidental misspells. ```{r set-unknown-parameters, warning = TRUE} pip |> pip_set_params(list(titel = "misspelled variable name", foo = "my foo")) ``` As you see, a warning is given to the user hinting at the respective parameter names, which makes fixing any misspells straight-forward. Next, let's see how to [modify the pipeline](v02-modify-pipeline.html). ```{r, include = FALSE} options(old) ```