---
title: "A summarised result"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{summarised_result}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, message = FALSE, warning = FALSE,
  comment = "#>"
)
```

# Introduction

A summarised result is a table that contains aggregated summary statistics (a result set with no patient-level data). The summarised result object consists of two objects: a **results table** and a **settings table**.

### Results table

This table has 13 columns:

- `result_id` (1), used to identify a group of results with common settings (see settings below).
- `cdm_name` (2), used to identify the name of the cdm object used to obtain those results.
- `group_name` (3) - `group_level` (4), these columns work together as a *name-level* pair. A *name-level* pair is two columns that work together to summarise information from multiple other columns. The *name* column contains the column names separated by `&&&` and the *level* column contains the column values separated by `&&&`. Elements in the *name* column must be snake_case. Usually, group aggregation is used to show high-level aggregations: e.g. cohort name or codelist name.
- `strata_name` (5) - `strata_level` (6), these columns work together as a *name-level* pair. Usually strata aggregation is used to show stratifications of the results: e.g. age groups or sex.
- `variable_name` (7), name of the variable of interest.
- `variable_level` (8), level of the variable of interest, it is usually a subclass of the variable_name.
- `estimate_name` (9), name of the estimate.
- `estimate_type` (10), type of the value displayed, the supported types are: `r omopgenerics::estimateTypeChoices()`. 
- `estimate_value` (11), value of interest.
- `additional_name` (12) - `additional_level` (13), these columns work together as a *name-level* pair. Usually additional aggregation is used to include the aggregations that did not fit in the group/strata definition.

The following table summarises the requirements of each column in the summarised_result format:

```{r, echo=FALSE}
dplyr::tibble(
  `Column name` = omopgenerics::resultColumns(),
  `Column type` = c("integer", rep("character", 12)),
  `is NA allowed?` = c(rep("No", 7), "Yes", rep("No", 5)),
  `Requirements` = c(NA, NA, "name1", "level1", "name2", "level2", NA, NA, "snake_case", "estimateTypeChoices()", NA, "name3", "level3")
) |>
  gt::gt()
```

### Settings

The settings table provides one row per `result_id` with the settings used to generate those results. There is no limit on the number of columns or parameters that can be provided per result_id, but at least three values should be provided:

- `result_type` (1): identifies the type of result provided. We would usually use the name of the function that generated that set of results in snake_case. For example, if the function that generates the summarised result is named *summariseMyCustomData*, then the result_type would be: *summarise_my_custom_data*.
- `package_name` (2): name of the package that generated the result type.
- `package_version` (3): version of the package that generated the result type.

These columns must be character vectors, but this restriction does not apply to other extra columns.

### newSummarisedResult

The `newSummarisedResult()` function can be used to create *<summarised_result>* objects. The inputs to this function are the summarised_result table, which must satisfy the conditions specified above, and the settings argument. The settings argument can be NULL or omit some required columns; missing columns will be populated by default and a warning will appear. Let's see a very simple example:

```{r}
library(omopgenerics)
library(dplyr)

x <- tibble(
  result_id = 1L,
  cdm_name = "my_cdm",
  group_name = "cohort_name",
  group_level = "cohort1",
  strata_name = "sex",
  strata_level = "male",
  variable_name = "Age group",
  variable_level = "10 to 50",
  estimate_name = "count",
  estimate_type = "numeric",
  estimate_value = "5",
  additional_name = "overall",
  additional_level = "overall"
)

result <- newSummarisedResult(x)
result |>
  glimpse()
settings(result)
```

We can also associate settings with our results. These will typically be used to explain how the result was created.

```{r}
result <- newSummarisedResult(
  x = x,
  settings = tibble(
    result_id = 1L,
    package_name = "PatientProfiles",
    study = "my_characterisation_study"
  )
)

result |> glimpse()
settings(result)
```


# Combining summarised results

Multiple summarised result objects can be combined using the bind function. Result IDs will be assigned for each set of results with the same settings. If two groups of results have the same settings, although they are in different objects, they will be merged into a single one.

```{r}
result1 <- newSummarisedResult(
  x = tibble(
    result_id = 1L,
    cdm_name = "my_cdm",
    group_name = "cohort_name",
    group_level = "cohort1",
    strata_name = "sex",
    strata_level = "male",
    variable_name = "Age group",
    variable_level = "10 to 50",
    estimate_name = "count",
    estimate_type = "numeric",
    estimate_value = "5",
    additional_name = "overall",
    additional_level = "overall"
  ),
  settings = tibble(
    result_id = 1L,
    package_name = "PatientProfiles",
    package_version = "1.0.0",
    study = "my_characterisation_study",
    result_type = "stratified_by_age_group"
  )
)

result2 <- newSummarisedResult(
  x = tibble(
    result_id = 1L,
    cdm_name = "my_cdm",
    group_name = "overall",
    group_level = "overall",
    strata_name = "overall",
    strata_level = "overall",
    variable_name = "overall",
    variable_level = "overall",
    estimate_name = "count",
    estimate_type = "numeric",
    estimate_value = "55",
    additional_name = "overall",
    additional_level = "overall"
  ),
  settings = tibble(
    result_id = 1L,
    package_name = "PatientProfiles",
    package_version = "1.0.0",
    study = "my_characterisation_study",
    result_type = "overall_analysis"
  )
)
```

Now that we have our results, we can combine them using bind. Because the two sets of results contain the same result ID, this will be automatically updated when the results are combined.

```{r}
result <- bind(result1, result2)
result |>
  dplyr::glimpse()
settings(result)
```

# Minimum cell count suppression

We have an entire vignette explaining how the summarised_result object is suppressed: [`vignette("suppression", "omopgenerics")`](https://darwin-eu.github.io/omopgenerics/articles/suppression.html).

# Export and import summarised results

The summarised_result object can be exported and imported as a CSV file with the following functions:

- **importSummarisedResult()**

- **exportSummarisedResult()**

Note that `exportSummarisedResult()` also suppresses the results.

```{r}
x <- tempdir()
files <- list.files(x)

exportSummarisedResult(result, path = x, fileName = "result.csv")
setdiff(list.files(x), files)
```

Note that the settings are included in the CSV file:

```{r, echo=FALSE}
fil <- file.path(x, "result.csv")
readLines(fil) |>
  cat()
```

You can later import the results back with `importSummarisedResult()`:

```{r}
res <- importSummarisedResult(path = file.path(x, "result.csv"))
class(res)
res |>
  glimpse()
res |>
  settings()
```

# Tidy a `<summarised_result>`

## Tidy method
`omopgenerics` defines a `tidy()` method for `<summarised_result>` objects. This function:

### 1. Split *group*, *strata*, and *additional* pairs into separate columns:

The `<summarised_result>` object has the following pair columns: group_name-group_level, strata_name-strata_level, and additional_name-additional_level. These pairs use the `&&&` separator to combine multiple fields. For example, if you want to combine cohort_name and age_group in the group_name-group_level pair: `group_name = "cohort_name &&& age_group"` and `group_level = "my_cohort &&& <40"`. By default, if no aggregation is produced in the group_name-group_level pair: `group_name = "overall"` and `group_level = "overall"`.

**ORIGINAL FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  group_name = c("cohort_name", c("cohort_name &&& sex"), c("sex &&& age_group")),
  group_level = c("acetaminophen", c("acetaminophen &&& Female"), c("Male &&& <40"))
) |>
  gt::gt()
```

The tidy format puts each value into its own column. This makes it easier to manipulate, but the output is no longer standardised, as each `<summarised_result>` object will have a different number and set of column names. Missing values will be filled with the "overall" label.

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  group_name = c("cohort_name", c("cohort_name &&& sex"), c("sex &&& age_group")),
  group_level = c("acetaminophen", c("acetaminophen &&& Female"), c("Male &&& <40"))
) |>
  splitGroup() |>
  gt::gt()
```

### 2. Add settings of the `<summarised_result>` object as columns:

Each `<summarised_result>` object has a settings attribute that relates the 'result_id' column to each different set of settings. The columns 'result_type', 'package_name' and 'package_version' are always present in settings, but we may also have extra parameters depending on how the object was created. In the `<summarised_result>` format, we need to use the `settings()` function to see those variables:

**ORIGINAL FORMAT:**

`settings`:
```{r, echo = FALSE}
dplyr::tibble(
  result_id = c(1L, 2L),
  my_setting = c(TRUE, FALSE),
  package_name = "omopgenerics"
) |>
  gt::gt()
```

`<summarised_result>`:
```{r, echo = FALSE}
dplyr::tibble(
  result_id = c("1", "...", "2", "..."),
  cdm_name = c("omop", "...", "omop", "..."),
  " " = c("..."),
  additional_name = c("overall", "...", "overall", "...")
) |>
  gt::gt()
```

In the tidy format, we add the settings as columns, so their values are repeated multiple times (there is only one row per result_id in settings, whereas there can be multiple rows in the `<summarised_result>` object). The column 'result_id' is removed because it no longer provides information. Again, we lose standardisation (multiple different settings), but we gain flexibility:

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  cdm_name = c("omop", "...", "omop", "..."),
  " " = c("..."),
  additional_name = c("overall", "...", "overall", "..."),
  my_setting = c("TRUE", "...", "FALSE", "..."),
  package_name = c("omopgenerics", "...", "omopgenerics", "...")
) |>
  gt::gt()
```

### 3. Pivot estimates as columns:

In the `<summarised_result>` format estimates are displayed in 3 columns: 

- 'estimate_name' indicates the name of the estimate.
- 'estimate_type' indicates the type of the estimate (as all of them will be cast to character). Possible values are: *`r omopgenerics::estimateTypeChoices()`*.
- 'estimate_value' value of the estimate as `<character>`.

**ORIGINAL FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  variable_name = c("number individuals", "age", "age"),
  estimate_name = c("count", "mean", "sd"),
  estimate_type = c("integer", "numeric", "numeric"),
  estimate_value = c("100", "50.3", "20.7")
) |>
  gt::gt()
```

In the tidy format, we pivot the estimates, creating a new column for each 'estimate_name' value. The columns will be cast to 'estimate_type'. If there are multiple estimate_type values for the same estimate_name, they will not be cast and will be displayed as character values (a warning will be thrown). Missing data are populated with NAs.

**TIDY FORMAT:**
```{r, echo = FALSE}
dplyr::tibble(
  variable_name = c("number individuals", "age"),
  count = c(100L, NA),
  mean = c(NA, 50.3),
  sd = c(NA, 20.7)
) |>
  gt::gt()
```

### Example

Let's see a simple example with some toy data:

```{r}
result |>
  tidy()
```

## Split 

The split functions are provided independently:

- `splitGroup()` only splits the pair group_name-group_level columns.
- `splitStrata()` only splits the pair strata_name-strata_level columns.
- `splitAdditional()` only splits the pair additional_name-additional_level columns.

There is also the function:
- `splitAll()` that splits any x_name-x_level pair found in the data.

```{r}
splitAll(result)
```

## Pivot estimates

`pivotEstimates()` can be used to pivot the variables that we are interested in.

The argument `pivotEstimatesBy` specifies which variables we want to use to pivot by. There are four options:

- `NULL/character()` to not pivot anything.
- `c("estimate_name")` to pivot only estimate_name.
- `c("variable_level", "estimate_name")` to pivot estimate_name and variable_level.
- `c("variable_name", "variable_level", "estimate_name")` to pivot estimate_name, variable_level and variable_name.

Note that `variable_level` can contain NA values, these will be ignored on the naming part.

```{r}
pivotEstimates(
  result,
  pivotEstimatesBy = c("variable_name", "variable_level", "estimate_name")
)
```

## Add settings

`addSettings()` is used to add the settings that we want as new columns to our `<summarised_result>` object. 

The `settingsColumn` argument is used to choose which settings we want to add.


```{r}
addSettings(
  result,
  settingsColumn = "result_type"
)
```

## Filter

Dealing with an `<summarised_result>` object can be difficult, especially when we are trying to filter. For example, it can be difficult to filter to a certain result_type or, when many strata are joined together, to filter only one of the variables. The `tidy` format makes filtering easier, but using it means losing the `<summarised_result>` object.

The **omopgenerics** package contains functions that help with this process:

- `filterSettings` to filter the `<summarised_result>` object using the `settings()` attribute.
- `filterGroup` to filter the `<summarised_result>` object using the group_name-group_level tidy columns.
- `filterStrata` to filter the `<summarised_result>` object using the strata_name-strata_level tidy columns.
- `filterAdditional` to filter the `<summarised_result>` object using the additional_name-additional_level tidy columns.

For instance, let's filter `result` so it only has results for males:

```{r}
result |>
  filterStrata(sex == "male")
```

Now let's see an example using the information in settings to filter the result. In this case, we only want results from the "overall_analysis". Since this information is in the result_type column in settings, we proceed as follows:

```{r}
result |>
  filterSettings(result_type == "overall_analysis")
```

# Utility functions for `<summarised_result>`

## Column retrieval functions

Working with `<summarised_result>` objects often involves managing columns for **settings**, **grouping**, **strata**, and **additional** levels. These retrieval functions help you identify and manage columns:

- `settingsColumns()` gives you the setting names that are available in a `<summarised_result>` object.
- `groupColumns()` gives you the new columns that will be generated when splitting group_name-group_level pair into different columns.
- `strataColumns()` gives you the new columns that will be generated when splitting strata_name-strata_level pair into different columns.
- `additionalColumns()` gives you the new columns that will be generated when splitting additional_name-additional_level pair into different columns.
- `tidyColumns()` gives you the columns that the object will have if you tidy it (`tidy(result)`). This function is very useful for knowing which columns can be included in **plot** and **table** functions.

Let's see the different values with our example result data:

```{r}
settingsColumns(result)
groupColumns(result)
strataColumns(result)
additionalColumns(result)
tidyColumns(result)
```


## Unite functions

The unite functions serve as the complementary tools to the split functions, allowing you to generate name-level pair columns from targeted columns within a `<dataframe>`.

There are three `unite` functions that allow you to create group, strata, and additional name-level columns from specified sets of columns:

- `uniteAdditional()`

- `uniteGroup()`

- `uniteStrata()`

### Example

For example, to create group_name and group_level columns from a tibble, you can use:

```{r}
# Create and show mock data
data <- tibble(
  denominator_cohort_name = c("general_population", "older_than_60", "younger_than_60"),
  outcome_cohort_name = c("stroke", "stroke", "stroke")
)
head(data)

# Unite into group name-level columns
data |>
  uniteGroup(cols = c("denominator_cohort_name", "outcome_cohort_name"))
```

These functions can be helpful when creating your own `<summarised_result>`.