---
title: "Tree Species Codings in ForestElementsR"
author: "Peter Biber"
output: 
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
    toc_float: true
bibliography: REFERENCES.bib
vignette: >
  %\VignetteIndexEntry{Tree Species Codings in ForestElementsR}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## 1. Introduction

Unfortunately, the way how tree species are coded in forest data varies
vastly among research institutions, forest administrations, and the
likes. In order to make the package *ForestElementsR* broadly
applicable, it requires a generic coding system that can cover any
specific species coding system and allows to translate from one into the
other. In contrast to what one might expect, this is not a trivial task,
as most existing codings do include not only codes for single species,
but also for species groups. These groups are rarely the same across
different codings which causes certain issues to be covered by a useful
generic coding system. Such a generic approach, in addition, requires to
be open to include any desired additional species and specific codings.

## 2. Where to find things and what they are good for

Before I show how to actually work with species codings in
*ForestElementsR*, I will talk about where to find all implemented
codings in the package. For the code examples below to work, you will
need to attach *ForestElementsR* itself, and the packages *tibble*,
*dplyr*, and *ggplot2* from the
[*tidyverse*](https://tidyverse.org/) which make handling and output
more convenient.

```{r setup, message = FALSE}
library(ForestElementsR)
library(tibble)
library(dplyr)
library(ggplot2)
```

### 2.1 The species master table {#species_master_table}

The *data.frame* (actually a *tibble*) *species_master_table* is the
most important part of the generic species coding system. Any single
species to be included in any specific coding must be absolutely listed
here, as the species master table serves as the common reference for all
implemented codings. Conversely, specific species codings do not need to
comprise all species provided in the species master table. In order to
view this table, it is only necessary to type its name:

```{r view_species_master_table}
species_master_table

# Also show the tail of the table
species_master_table |> tail(10)
```

In contrast to specific codings (see below) the species master table
must contain single species only, i.e. each row represents a species,
never a group of species. Currently, it comprises
`r nrow(species_master_table)` tree species. Let us have a look at the
table's anatomy:

The key fields of the species master table are *genus* and *species_no*.
Together, they must be unique. Both are of type character. *genus*
represents a specie's genus name, *always in lower case letters*, and
*species_no* is always a three-digit number with leading zeroes. This
approach was chosen because of a few advantages: While genus names are
usually stable, species names may change more often. Therefore, the
species inside a genus are identified with a number instead of a name.
New species can be easily added without the danger of running out of
numbers and being thus forced to break the coding concept. For
convenience, the table also contains the column *deciduous_conifer*
which allows only for the two values *conif* and *decid*. This column is
not part of the actual species key, but it is intended for filtering
purposes, and for all relevant forest tree species, the distinction
between both groups should be biologically correct or at least
practical. The three remaining fields, *name_sci*, *name_eng*, and
*name_ger* contain the scientific, colloquial English, and colloquial
German names of all species.

### 2.2 Specific species codings {#species_specific_codings}

#### 2.2.1 General setup

All specific species codings implemented in *ForestElementsR* are stored
in the tibble *species_codings*:

```{r view_species_codings}
species_codings
```

Each row in this tibble represents a specific coding; hereby the column
`species_coding` provides the coding's name, and the column `code_table`
provides an own tibble that defines the coding and links it to the
species master table. Currently, there are six codings implemented
(*master*, *tum_wwk_short*, *tum_wwk_long*, *ger_nfi_2012*,
*bavrn_state*, *bavrn_state_short*). We use the coding *tum_wwk_short* for 
explaining the implementation. This species coding is used for many purposes at 
the Chair of Forest Growth and Yield Science at the Technical University of
Munich. It comprises a small set of the most important tree species in
Central Europe only, while all other species are attributed to three
larger container groups. In order to see the coding table, it could be
accessed by usual indexing of the tibble *species_coding*, but it is
more convenient to use the function *fe_species_get_coding_table* which
needs to be called with the name of the desired coding:

```{r get_coding_table}
fe_species_get_coding_table("tum_wwk_short")
```

Clearly, this table closely resembles the species master table, as they
have in common the columns *genus*, *species_no*, *deciduous_conifer*,
*name_sci*, *name_eng*, and *name_ger*. Most importantly, however, there
is the additional column *species_id*. This column contains the actual
coding, and it is always of type character, even if the coding is
exclusively consisting of numbers. Such a coding table is not required to
comprise all species available in the species master table, but it must
not contain any species which is not included there. In other words, a
coding table is not allowed to contain any combination of *genus* and
*species_no* which is not contained in the species master table. The
species names, however, may differ from those in the master table in
order to allow e.g. for regional colloquial naming preferences or, more
importantly, for naming species groups which, by definition, do not
exist in the master table.

Besides *species_id*, every coding table carries two more columns,
*level* and *is_tree*. The column *level* marks how fine or coarse a code
is: *0* is the finest level (a single species, or a group that is not
contained in any other group of the coding), and higher numbers denote
ever coarser groups that nest the finer ones. For a "flat" coding that
only distinguishes single species and non-overlapping groups, *level* is
*0* throughout. Codings that additionally provide nesting group codes are
called *hierarchical*; they are explained in
[Section 2.2.2](#hierarchical_codings). The column *is_tree* is *TRUE*
for all ordinary tree-species codes and *FALSE* for the rare codes that
denote a non-tree category such as a shrub; see
[Section 2.2.3](#non_tree_codes).

Let us have a view on the coding in compact
form:

```{r view_coding_compact}
fe_species_get_coding_table("tum_wwk_short") |>
  select(species_id, name_eng) |> # English names only for clarity
  distinct()
```

As it is easily visible in this display, the coding distinguishes only
ten species (groups). From the names, it can be already guessed which
species_ids refer to single species and which to groups, but we should
use R to find this out unambiguously:

```{r view_species_numbers_in_coding}
fe_species_get_coding_table("tum_wwk_short") |>
  group_by(species_id, name_eng) |>
  summarise(n_species = n()) |>
  arrange(as.numeric(species_id)) # not required, but output is nicely sorted
```

Clearly, every species_id with n_species \> 1 actually represents a
group of tree species. Let us look at the smallest group (species_id 6),
which comprises only two species:

```{r view_quercus_group}
fe_species_get_coding_table("tum_wwk_short") |>
  select(species_id, genus, species_no, name_eng) |>
  filter(species_id == "6")
```

We see that the two species in this group are *quercus* *001* and
*quercus* *002*, but the colloquial species name in the coding table is
the group name only. In order to find out the species names, we can
obtain them from the species master table with the help of *genus* and
*species_no*:

```{r get_quercus_group_names}
species_master_table |>
  filter(genus == "quercus" & species_no %in% c("001", "002")) |>
  select(-deciduous_conifer)
```

#### 2.2.2 Species groups and hierarchical codings {#hierarchical_codings}

Almost every real-world coding distinguishes not only single species but
also *species groups*. In the simplest case, a coding is a *partition*:
each species belongs to exactly one code, and the codes never overlap.
The *tum_wwk_short* coding used above is such a partition - the group
codes 8, 9, and 10 are disjoint, and so are the single-species codes.

Some codings, however, need a species to appear **both** as its own code
**and** inside a coarser group of the *same* coding. The Bavarian state
coding *bavrn_state*, for instance, codes the pedunculate oak singly as
*54* and the sessile oak singly as *55*, but it also keeps the older
group code *70* ("oak") that comprises both. Such codings are called
*hierarchical*. To keep casting between codings well defined, the codes
of a coding must form a *laminar family*: the species sets of any two
codes are either disjoint or fully nested - partial overlaps are
forbidden. The column *level* records the nesting depth (0 = finest leaf,
higher = coarser group).

We can see this directly in the coding table of *bavrn_state*. The leaf
codes 54 and 55 sit at level 0, while the group code 70 that contains
them sits at level 1:

```{r view_hierarchy_levels}
fe_species_get_coding_table("bavrn_state") |>
  filter(species_id %in% c("54", "55", "70")) |>
  select(species_id, genus, species_no, name_eng, level)
```

When a species is *cast into* a hierarchical coding, it is always
resolved to the **finest** code that represents it. The pedunculate oak
(*quercus_001* in the master coding) therefore becomes the leaf code 54,
not the group code 70:

```{r finest_node_cast}
as_fe_species_bavrn_state(fe_species_master("quercus_001")) |> unclass()
```



#### 2.2.3 Non-tree codes {#non_tree_codes}

A few codings contain legal codes that do not stand for a tree species
and that one cannot compute with - for example the *bavrn_state* code
*99* ("Strauch", shrub). Such codes are flagged with *is_tree = FALSE* in
the coding table. Whether a code is a tree code is *derived* from its
link to the species master table (a code that resolves to at least one
master species is a tree code), so there is no separate flag that could
fall out of sync.

Two exported helpers report this information. `fe_species_non_tree_codes()`
lists the non-tree codes of a coding, and `fe_species_is_tree()` tests,
element by element, whether the codes of an **fe_species** vector denote
tree species:

```{r non_tree_helpers}
fe_species_non_tree_codes("bavrn_state")

spec_ids <- fe_species_bavrn_state(c("10", "60", "99"))
fe_species_is_tree(spec_ids)
```

Constructing a species vector that contains a non-tree code is allowed
(the code is part of the coding), but objects that are meant to hold
computable trees - such as `fe_stand()` and its relatives - reject them
with a clear error. When a non-tree code is *cast* into another coding,
it resolves to `NA` (with a message), because it has no tree-species
equivalent.



#### 2.2.4 Implemented codings {#implemented_codings}

Six species codings are currently implemented. While their
documentation is available in the package, and can be accessed with
`?species_codings`, I list them also here:

-   **master:** This is the original species coding used by the package
    *ForestElementsR*. It contains each species from the
    *species_master_table* and no species groups. This coding
    corresponds directly to the species_master_table. Its species_ids
    are the master table's columns *genus* and *species_no* combined
    into one character string, separated by an underscore.

-   **tum_wwk_short:** This is one of two codings in use at the Chair of
    Forest Growth and Yield Science at the Technical University of
    Munich. It defines only a small set of single species explicitly
    (the most important ones in Central Europe), while all other species
    are attributed to a few large container groups.

-   **tum_wwk_long:** This is one of two codings in use at the Chair of
    Forest Growth and Yield Science at the Technical University of
    Munich. It defines a larger set of single species than the
    *tum_wwk_short* coding. This coding is *hierarchical* (see
    [Section 2.2.2](#hierarchical_codings)): besides the single-species
    (leaf) codes it also provides the coarser species-group codes that
    contain them, and it covers every species of the master table.

-   **bavrn_state:** This species coding is the coding used by the
    Bavarian State Forest Service. It is *hierarchical* (e.g. the single
    oak codes 54 and 55 nest in the group code 70), and it contains one
    non-tree code (99, "Strauch"/shrub; see
    [Section 2.2.3](#non_tree_codes)).
    
-   **bavrn_state_short:** This coding combines the species of 
    *bavrn_state* into larger groups. These groups are typically used
    by the Bavarian State Forest Service in aggregated evaluations.    

-   **ger_nfi_2012:** The ger_nfi_2012 species coding is the species
    coding used by the German National Forest Inventory of 2012
    [@bwi3_methods_2017].


#### 2.2.5 A field-ready coding table {#field_table}

The full coding table returned by `fe_species_get_coding_table()` carries
*one row per elementary species*. For a hierarchical coding this means a
species can occur several times (once as a leaf code and once in each
group that contains it), and a group code spreads over as many rows as it
has member species. That is exactly what the casting machinery needs, but
it is awkward as a printed lookup key for field work.

For that purpose there is `fe_species_get_field_table()`. It returns the
*code-level* view: **each code exactly once**, together with the name of
the species or group it stands for. The names are taken from the coding
itself (not from the master table), so group names appear as such, and
all three name columns are always included regardless of the
`fe_spec_lang` option. The rows are in the coding's canonical order
(leaf codes first, then the coarser groups), and the *level* and
*is_tree* columns are kept, as both matter in the field:

```{r field_table_demo}
fe_species_get_field_table("tum_wwk_short")
```

Rendering such a table into a nicely formatted, printable document (e.g.
a PDF) is deliberately left to the downstream packages that already carry
a document-rendering toolchain; *ForestElementsR* itself only provides
the data.


## 3. Usage

Species codes as implemented in this package are vectors with a few
special properties. Most users of the package, will work with species
codes as columns in a *data.frame* (or *tibble*), where they are
provided in parallel with other columns (i.e. vectors) that contain
other tree information, e.g. tree diameters, heights, or spatial
coordinates. For the sake of clarity, however, we demonstrate most
applications for isolated vectors of species codes.


### 3.1 Creating a species code vector

For each implemented species coding there exists a user friendly
function for constructing a vector of species. The naming convention for
this function is *fe_species_coding_name*, whereby *coding_name* is the
name of the desired coding as in the column *species_coding* in the
tibble *species_codings* (see above). Thus, e.g. for creating a vector
of *tum_wwk_short* or *ger_nfi_2012* codes, one would use the functions
*fe_species_tum_wwk_short* or *fe_species_ger_nfi_2012*, respectively.
As their input, these functions require a vector of codes either in
numeric or character format:

```{r create_species_code_vectors}
spec_ids_1 <- fe_species_tum_wwk_short(c(1, 1, 1, 5, 5, 5, 5, 3, 3, 8, 9, 8))
spec_ids_2 <- fe_species_ger_nfi_2012(
  c(10, 10, 10, 100, 100, 100, 100, 20, 20, 190, 290, 190)
)

spec_ids_1
spec_ids_2
```

If the input vector contains codes which are not supported by the chosen
coding, the attempt terminates with an error:

```{r bad_code_input, error = TRUE}
fe_species_tum_wwk_short(c(1, 321, 1, 9999))
fe_species_ger_nfi_2012(c("100", "290", "Peter", "Paul", "Mary"))
```

For each implemented coding there exists a function
*is_fe_species_coding_name* for checking whether an object is a vector
of species codes of the requested class:

```{r check_object_if_species_codes}
spec_ids <- c(1:10)
is_fe_species_tum_wwk_short(spec_ids)
spec_ids <- fe_species_tum_wwk_short(c(1:10))
is_fe_species_tum_wwk_short(spec_ids)
is_fe_species_bavrn_state(spec_ids)
```

NA values are in principle allowed in species code vectors. There may
be, however, objects (like *fe_stand*, covered in an own vignette) which
enforce species code vectors without NAs.


### 3.2 Display options {#display_options}

By default, species code vectors are displayed "as they are", i.e. what
we see are the original codes as in the column *species_id* in the
corresponding coding's table (see above). Sometimes, e.g. for creating
output for third parties, the actual species names are preferable. The
most convenient way to achieve that is to set the option *fe_spec_lang*
which can take the values *sci*, *eng*, *ger*, and *code*. Let's create
four species code vectors

```{r option_spec_lang}
spec_ids_1 <- fe_species_tum_wwk_short(c(1, 1, 5, 5, 5, 5, 3, 3))
spec_ids_2 <- fe_species_ger_nfi_2012(c(100, 100, 20, 20, 30, 110))
spec_ids_3 <- fe_species_bavrn_state(c(60, 60, 30, 30, 84, 86))
spec_ids_4 <- fe_species_master(c("abies_001", "tilia_002", "ulmus_001"))
```

The default display is:

```{r reset_spec_lang_option, echo = FALSE}
op_help <- options(fe_spec_lang = NULL) # catch user's actual setting
```

```{r display_default_codes, echo = FALSE}
spec_ids_1
spec_ids_2
spec_ids_3
spec_ids_4
```

With the option *fe_spec_lang* set on "sci", the scientific species
names are displayed:

```{r display_scientific_names}
options(fe_spec_lang = "sci") # Display scientific species names

spec_ids_1
spec_ids_2
spec_ids_3
spec_ids_4
```

For printing the colloquial English species names, the option "eng" is
the choice:

```{r display_english_names}
options(fe_spec_lang = "eng") # Display English species names

spec_ids_1
spec_ids_2
spec_ids_3
spec_ids_4
```

```{r set_spec_lang_option_to_original, echo = FALSE}
options(fe_spec_lang = op_help)
```

In the same way, you can use `options(fe_spec_lang = "ger")` for having
the German species names displayed. With
`options(fe_spec_lang = "code")` or `options(fe_spec_lang = NULL)`. If
you do not want to work with such options, and want just a quick check
of the species names corresponding to given codes, you could use the
function *format*. It takes the species code vector to be displayed, and
*spec_lang*, which can be "sci", "eng", "ger", and "code" with exactly
the same meanings as explained above. The output of *format* is never an
fe_species coding object, but always a character vector (which is useful
for some purposes):

```{r format_example}
format(spec_ids_1, spec_lang = "eng")
format(spec_ids_2, spec_lang = "sci")
format(spec_ids_3, spec_lang = "code")
format(spec_ids_4, spec_lang = "ger")
```

Note that the names for display are always taken from the specific
coding's table, not from the species master table. Be also aware that
such species names are not the codes themselves. This means, you cannot
generate a species code vector from a vector of species names:

```{r do_not_try_to_generate_from_names, error = TRUE}
spec_names <- c("Abies alba", "Picea abies")
fe_species_ger_nfi_2012(spec_names)
```

When assigning new values to elements of a species coding vector, the
safest way to do so is to provide the new values as an instance of the
same class. But with all other values, an attempt will be made to
convert them into an instance of the goal class. If this is not
possible, the assignment does not take place, and an error is thrown.

```{r assigning_species_codes, error = TRUE}
spec_vec <- fe_species_bavrn_state(c("10", "10", "10", "50", "50", "50"))
format(spec_vec, "eng")

# Safest way, same class on both sides of the '<-'
spec_vec[3] <- fe_species_bavrn_state("40")
is_fe_species_bavrn_state(spec_vec)
format(spec_vec, "eng")

# Character vector is converted
spec_vec[3:4] <- c("40", "70")
is_fe_species_bavrn_state(spec_vec)
format(spec_vec, "eng")

# Numerical vector is converted
spec_vec[3:4] <- c(60, 87)
is_fe_species_bavrn_state(spec_vec)
format(spec_vec, "eng")

# Species code not supported by goal coding - no assignment and error
spec_vec[1:2] <- c("3333", "12")
is_fe_species_bavrn_state(spec_vec)
format(spec_vec, "eng")

# Vectors of other species codings are converted, if possible
spec_vec[5:6] <- fe_species_tum_wwk_short(c("3", "3")) # "3" Scots pine in rhs
# coding
is_fe_species_bavrn_state(spec_vec)
format(spec_vec, "code") # "3" becomes "20" ...
format(spec_vec, "eng") # ... which is Scots pine in the goal coding
```


### 3.3 Species code conversions {#species_code_conversions}

For each implemented species coding there is a function
*as_fe_species_coding_name* which tries to convert an object of any
other given species coding implemented in *ForestElementsR* into an
instance of the goal object. You can use it also for converting numeric
or character vectors (as an alternative to *fe_species_coding_name*),
but the interesting feature is the conversion between different codings:

```{r unproblematic_conversion}
spec_ids <- as_fe_species_tum_wwk_short(c("1", "3", "5"))
as_fe_species_ger_nfi_2012(spec_ids) |> format("eng")
```

When the initial species code vector contains codes which belong to the
same species group in the goal coding, information is lost when doing
the conversion. This is a *backward ambiguous cast*. In such a case, the
conversion is executed, but a *message* is issued. (In earlier versions
of the package this was a *warning*; it was downgraded to a message
because such information loss is the normal, intended outcome of
aggregating into coarser groups, and a warning forced users to wrap every
deliberate aggregation in `suppressWarnings()`.)

```{r conversion_with_information_loss}
spec_ids_1 <- as_fe_species_ger_nfi_2012(c("170", "150", "140"))
spec_ids_1 |> format("eng")

# Backward ambiguous cast (possibly, but with information loss)
spec_ids_2 <- as_fe_species_tum_wwk_short(spec_ids_1)
spec_ids_2 |> format("eng")
```

Conversely, when casting *into* a hierarchical coding (one that offers
both single-species and group codes, see
[Section 2.2.2](#hierarchical_codings)), each species is resolved to the
*finest* code available for it - the single-species code if there is one,
the smallest containing group otherwise. This happens automatically and
without information loss:

```{r finest_node_cast_usage}
# Pedunculate and sessile oak resolve to the single codes 54 and 55,
# not to the group code 70
as_fe_species_bavrn_state(fe_species_master(c("quercus_001", "quercus_002"))) |>
  format("code")
```

Conversions with no match in the goal coding terminate with an error:

```{r impossible_conversion_no_match, error = TRUE}
spec_ids <- as_fe_species_bavrn_state(c("11", "11", "11"))
spec_ids |> format("eng")

# No Serbian spruce in the tum_wwk_long coding
spec_ids |> as_fe_species_tum_wwk_long()
```

*Forward ambiguous casts* occur when one code in the initial code vector
has several matches in the goal coding. If this is the case, execution
terminates, and an error is thrown:

```{r forward_ambiguous_cast, error = TRUE}
# Each of these codes comprises many single species
spec_ids <- fe_species_tum_wwk_short(c("8", "9", "10"))
spec_ids |> format("eng")

# Conversion attempt terminates with error
spec_ids |> as_fe_species_ger_nfi_2012()

# Similar
as_fe_species_master(fe_species_ger_nfi_2012("90"))
```

There is one controlled exception to the forward-ambiguous error. A few
source group codes genuinely straddle two groups of a goal coding, so
there is no single matching target node, yet a sensible aggregate exists.
For these cases the package ships a small, documented table,
`species_cast_overrides`, that declares the deliberate target code. When
such an override applies, the cast is carried out (lossily, with a
message) instead of raising an error:

```{r cast_override}
species_cast_overrides

# ger_nfi_2012 code 290 has no single match in tum_wwk_short, but the
# override resolves it to code 8
as_fe_species_tum_wwk_short(fe_species_ger_nfi_2012("290")) |> format("code")
```

Finally, a *non-tree* code (see [Section 2.2.3](#non_tree_codes)) has no
tree-species equivalent in any goal coding, so it is resolved to `NA`
(again with a message) rather than treated as a failed match:

```{r non_tree_cast}
as_fe_species_tum_wwk_short(fe_species_bavrn_state(c("10", "99"))) |>
  unclass()
```

Note that the operability of a species coding cast is checked for each
single conversion attempt, because it does depend on the single species
codes to be converted. I.e. some conversions between the same codings
will work well while others fail:

```{r good_and_bad_conversions_between_same_types, error = TRUE}
# Conversion from tum_wwk_short to ger_nfi_2012 - works
spec_ids_1 <- fe_species_tum_wwk_short(c("1", "3", "5"))
spec_ids_1 |> format("eng")

spec_ids_2 <- as_fe_species_ger_nfi_2012(spec_ids_1)
spec_ids_2 |> format("eng")

# Conversion from tum_wwk_short to ger_nfi_2012 - fails
spec_ids_1 <- fe_species_tum_wwk_short(c("8", "9", "10"))
spec_ids_1 |> format("eng")

spec_ids_2 <- as_fe_species_ger_nfi_2012(spec_ids_1)
```

In some cases one might want to extract the character vector of species
codes out of an *fe_species_coding_name* vector. This is possible either
with *unclass* or with *vctrs::vec_data* (the species codings are
implemented based on the package
[*vctrs*](https://vctrs.r-lib.org/index.html)).

```{r option_2a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r get_the_char_vector}
spec_ids <- fe_species_ger_nfi_2012(c("10", "10", "100", "170"))
spec_ids

chars_1 <- unclass(spec_ids)
chars_1
chars_2 <- vctrs::vec_data(spec_ids)
chars_2

is_fe_species_ger_nfi_2012(chars_1)
is_fe_species_ger_nfi_2012(chars_2)

is.character(chars_1)
is.character(chars_2)
```

```{r option_2b, echo = FALSE}
options(fe_spec_lang = opt_help)
```


### 3.4 Practical examples

As mentioned above, species codes do typically not come as isolated
vectors, but as columns in a data frame (tibble). We isolate one such
data frame from the *fe_stand* object *selection_forest_1\_fe_stand*
which is among the example data that come with the package
*ForestElementsR*:

```{r option_3a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r demo_with_selection_forest_1}
dat <- selection_forest_1_fe_stand$trees |> select(
  tree_id, species_id, time_yr, dbh_cm, height_m
)
dat
```

```{r option_3b, echo = FALSE}
options(fe_spec_lang = opt_help)
```

Here, each row represents one tree, the column *species_id* represents
species codes, and the other columns represent additional key fields
(*tree_id*, *time_yr*) and tree data (*dbh_cm*, *height_m*). When the
package *tidyverse* or *tibble* is attached, the tibble is displayed as
shown below, and the abbreviation *tm_wwk_shrt* indicates, that the
coding is *tum_wwk_short*. As by standard only the first ten lines are
shown, we see only the species code "1". For finding out if there are
more species, we could use the function *summary*:

```{r option_4a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r demo_with_selection_forest_2}
dat |> summary()
```

```{r option_4b, echo = FALSE}
options(fe_spec_lang = opt_help)
```

Very similar to a summary for a *factor*, the summary for the column
*species_id* provides the row counts for each of the four coded species.
In order to display species names instead of the codes, we have to set
the option *fe_spec_lang* (see also above):

```{r option_5a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r demo_with_selection_forest_3}
# Set option to display colloquial English species names, and store the
# previous setting in opt_prev
opt_prev <- getOption("fe_spec_lang")
options(fe_spec_lang = "eng")

# Display dat
dat

# Display a summary of dat
dat |> summary()

# Reset option to previous value
options(fe_spec_lang = opt_prev)
```

```{r option_5b, echo = FALSE}
options(fe_spec_lang = opt_help)
```

Let's assume, we want to know the mean stem volume per species (group)
and its standard deviation. In order to achieve that, we require each
tree's volume first. This can be done with the function *v_gri* which
requires the three inputs *species_id*, *dbh_cm*, and *height_m*. The
function *v_gri* is originally designed to work with the species coding
*tum_wwk_short* (as available in the example data), but it can process
any input for *species_id* that can be converted into the former.

```{r option_6a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r single_tree_volumes}
opt_prev <- getOption("fe_spec_lang")
options(fe_spec_lang = "eng")

dat <- dat |>
  mutate(v_cbm = v_gri(species_id, dbh_cm, height_m))

# Note that the summary of species_id does not preserve the original order of
# the codes (species are alphabetically sorted, dependent on language setting)
dat |> summary()

options(fe_spec_lang = opt_prev)
```

```{r option_6b, echo = FALSE}
options(fe_spec_lang = opt_help)
```

The summary reveals a wide range of volumes which is plausible, given
the range of dbh and height values. For obtaining the mean volumes per
species (group), we can use the *dplyr* functions *group_by* and
*summarise* which work also with our species codings. We see from the
summary below that e.g. Abies alba has the smallest mean stem volume
which comes, however, with the highest standard deviation.

```{r option_7a, echo = FALSE}
opt_help <- getOption("fe_spec_lang")
options(fe_spec_lang = "code")
```

```{r mean_volumes}
# Set option for displaying scientific species names
opt_prev <- getOption("fe_spec_lang")
options(fe_spec_lang = "sci")

dat |>
  group_by(species_id) |>
  summarise(
    mean_stem_volume_cbm = mean(v_cbm),
    sd_stem_volume_cbm = sd(v_cbm)
  )
# In contrast to summary, summarise keeps the original order of the species
# codes, no matter the language setting

options(fe_spec_lang = opt_prev)
```

```{r option_7b, echo = FALSE}
options(fe_spec_lang = opt_help)
```

Note, that plotting functions do currently not work with the species
codings. Use the format function for such purposes:

```{r plot_1, fig.cap=, fig.dim=c(4.9, 3.5), fig.align = 'center', fig.cap = 'Stem volume over diameter by species in log-log display'}
# Note: Using simply 'format(species_id)' below would use the current setting
# of the option fe_spec_lang
dat |>
  ggplot() +
  geom_point(aes(x = dbh_cm, y = v_cbm, col = format(species_id, "eng"))) +
  scale_color_discrete("Species") +
  scale_x_log10() +
  scale_y_log10()
```


## 4. Information for developers

There are two rather different developer tasks around species codings,
and it helps to keep them apart:

1. **Maintaining the *data* of the codings** - adding a species to the
   master table, adding a code to an existing coding, fixing a name, or
   building a "short" aggregation coding. This is now entirely
   *CSV-driven*: a set of exported builder functions turns editable CSV
   files into the validated package data, steered by two workbench
   scripts in `data-raw/`. You edit CSV, you do not edit R code.
   [Section 4.1](#dev_data) describes the layout, and
   [Section 4.2](#dev_update) is a step-by-step recipe.

2. **Adding a genuinely *new* coding** - this additionally needs a new
   S3 (vctrs) class and the cast functions that connect it to all the
   other codings. That part still lives in R source files and is
   described in [Section 4.3](#dev_new_coding).

Finally, [Section 4.4](#dev_never) repeats the standing warning never to
touch `fe_species_helper_functions.R` without knowing *exactly* what you
are doing.

Before we get into the details, note that all species codings inherit from the
*vctrs_vctr* class, which is provided by the package [*vctrs*](https://vctrs.r-lib.org/index.html):

```{r all_inherit_from_vctrs}
fe_species_bavrn_state("30") |> class()
fe_species_ger_nfi_2012("20") |> class()
fe_species_tum_wwk_long("87") |> class()
fe_species_tum_wwk_short("7") |> class()
fe_species_master("abies_004") |> class()
```

While this does not allow for building species_coding super- and subclasses, 
which would be an obvious feature for a system of species codings, it has a very convenient way of supporting casts between different classes. As this is a key requirement of our implementation, we decided to design a *vctrs* based 
solution.

### 4.1 The data behind the codings {#dev_data}

All coding data is generated from editable CSV files by exported builder
functions; the package data objects (`species_master_table`,
`species_codings`, `species_cast_overrides`) are the *output*, never
edited directly.

- **The master table** lives in `data-raw/species_master_table.csv`
  (exactly six columns: *genus*, *species_no*, *deciduous_conifer*,
  *name_sci*, *name_eng*, *name_ger*; one row per single species).
  `master_template_csv()` writes a fresh snapshot, and
  `master_table_from_csv()` reads it back with strict validation
  (unique keys, lower-case genus, three-digit *species_no*, no NAs).

- **Each coding** has its own CSV in `data-raw/codings/<coding>.csv`,
  in the *species-indexed + `parent_code`* format. There is one row per
  master species, plus extra declaration rows for group names. The key
  columns are:
    - *species_id* - the code assigned to that species. Leaving it
      empty means "this species is not covered by the coding". Several
      species sharing one *species_id* form a group.
    - *parent_code* - the *species_id* of the coarser group that
      contains this code. This is what makes a coding *hierarchical*; a
      flat coding leaves *parent_code* empty throughout. The builder
      derives the *level* column from the `parent_code` chains.
    - *name_sci*, *name_eng*, *name_ger* - optional for a single-species
      code (then inherited from the master), but required for a group
      code (which has no master row).
    - *master_name_\** and, for "short" codings, *agg_from_\** are
      read-only reference columns to help while editing; the builder
      ignores them.
  Note there is **no** *is_tree* column to maintain: whether a code is a
  tree code is *derived* from its master link (a code with no master
  link, e.g. a "shrub" category, becomes a non-tree code automatically,
  with a message).

- `coding_template_from_master()` produces such a CSV (blank, or
  prefilled from an existing coding), and `coding_table_from_template()`
  turns the edited CSV into a validated coding table. It checks the
  *laminarity* invariant (codes are nested or disjoint, never partially
  overlapping), derives *level* and *is_tree*, and stores the rows in
  canonical order. For a "short" aggregation coding it additionally
  verifies, against the freshly built parent coding, that the coding is a
  valid coarsening (every parent group maps to exactly one short code).
  The parent/child pairs (`bavrn_state_short` &larr; `bavrn_state`,
  `tum_wwk_short` &larr; `tum_wwk_long`) are registered internally.

- **Cast overrides** are their own little CSV
  (`data-raw/codings/cast_overrides.csv`), built and validated by
  `cast_overrides_from_csv()` into the package object
  `species_cast_overrides` (see [Section 3.3](#species_code_conversions)).


### 4.2 How to update the master table or an existing coding {#dev_update}

These are the day-to-day data tasks, both driven by a workbench script in
`data-raw/` that is meant to be run **manually, block by block** (not
`source()`d in one go - the first block can overwrite a CSV).

**To add or change a species in the master table**
(`data-raw/species_master_table.R`):

1. (Optional) run the snapshot step to refresh
   `data-raw/species_master_table.csv` from the installed table.
2. Edit the CSV: add or change rows (six columns, one row per species).
   A new species in an existing genus gets the next free *species_no*; a
   new genus starts at *001*.
3. Run `master_table_from_csv()`, inspect, then `usethis::use_data()` and
   reinstall the package so the coding builder sees the new master.

**To add a code to a coding, fix a name, or build a short coding**
(`data-raw/species_codings.R`):

1. In the config block, set `coding_name` and `mode` (`"new"` starts a
   blank CSV from the master; `"edit"` prefills from the installed
   coding). The parent coding, if any, is resolved automatically.
2. Run the generate block to (re)write `data-raw/codings/<coding>.csv`.
3. Edit the CSV by hand following the rules in
   [Section 4.1](#dev_data). For a short coding, use the *agg_from_\**
   hint columns to keep all species of one parent group on the same code.
4. Run the build-and-validate block. It builds the coding from the CSV
   and reports its codes and any non-tree codes; an invalid laminarity
   or coarsening fails here with a clear message.
5. Run the full-rebuild block to regenerate the whole `species_codings`
   object (parents before children), then `usethis::use_data()` and
   reinstall. If the change affects a cast override, rerun
   `data-raw/cast_overrides.R` as well.
6. Run the tests with `devtools::test()` and a full `R CMD check`. Note
   that the `data-raw` CSVs are excluded from the built package
   (`.Rbuildignore`), so the test suite is deliberately written to work
   *without* them - run the real check, not just `test_dir()` against a
   loaded session.


### 4.3 Adding a brand-new coding {#dev_new_coding}

A genuinely new coding needs two things. Its *data* is built exactly as
in [Sections 4.1](#dev_data) and [4.2](#dev_update): add a new
`data-raw/codings/<coding>.csv`, register the coding name in
`data-raw/species_codings.R` (and, if the new coding is a "short"
aggregation of an existing one, add a parent/child row to the internal
aggregation registry). The builder takes care of *level*, *is_tree*, and
non-tree codes automatically. What remains is the *code*: a vctrs S3
class plus the cast functions that connect the new coding to every other
coding. The remaining steps cover that code side.

#### 4.3.1 Copy the R-source file of an existing coding and adapt it {#adapt-source-file}

Now, you must provide the functions in order to make your new coding workable.
While this sounds difficult, it is actually really easy. Before we explain how
to do that, be aware of the following naming convention:

*The S3 class covering your species coding must be named "fe_species_" followed
by the name of your coding.*

In other words, if your new coding is named *john_doe_coding* (and that is also
*exactly* what you called it in the 
[tibble species codings](#species_specific_codings)), then your S3 class name 
must be *fe_species_john_doe_coding*.

First, copy the R source file of one of the implemented codings, and 
give it the name of your S3 class (in our example 
*fe_species_john_doe_coding.R*). Note, that the files with the existing 
implementation follow this naming convention. For this explanation, I assume
you have copied and renamed the file *fe_species_tum_wwk_short.R*. You could now
literally get an almost working implementation by automatic search for the term
*fe_species_tum_wwk_short* and replace it with *fe_species_john_doe_coding*, 
however, if you must, do it function by function, not for the whole file in one
go. Note, that you must also exchange the terms in the documentation above each
function, not only in the R code itself. Important: you will also have to 
adjust the examples by using species codes which are actually covered by your 
coding. Otherwise, the examples will not work, and the package will not pass
*R CMD check*.

From top to bottom of the file *fe_species_tum_wwk_short.R*, the functions to 
update are:

- the constructor *new_fe_species_tum_wwk_short*
- *is_fe_species_tum_wwk_short*
- the formatter *format.fe_species_tum_wwk_short*
- *summary.fe_species_tum_wwk_short*
- *vec_ptype_abbr.fe_species_tum_wwk_short*; here you should also replace the
  provided abbreviation for the coding name by one of your own (this 
  abbreviation is printed e.g. as type information below the column head if
  your coding is a column of a tibble)
- *validate_fe_species_tum_wwk_short*
- *fe_species_tum_wwk_short*, the function users should use for constructing
  an instance of a species coding object
  
- *vec_proxy_order.fe_species_tum_wwk_short*; guarantees always the same order 
  if species id's are to be sorted. The order will not change, even if the
  option fe_spec_lang is changed
  
- Now comes a block of species type casting functions. Their names are built
  like .e.g *vec_cast.fe_species_tum_wwk_short.fe_species_ger_nfi_2012*, which
  means *vec_cast.fe_species_GOAL_CODING_NAME.fe_species_FROM_CODING_NAME*.
  These functions are very short, and some of them use the coding names 
  internally. If you are qualified to work on this R package, you understand
  immediately, what to adapt. In general, in the function names, you must
  replace *tum_wwk_short* as the goal coding with *john_doe_coding*, In 
  addition, you must copy one of the functions which casts between two species
  codings, and adapt it so that it casts from *tum_wwk_short* to 
  *john_doe_coding*, i.e. name
  it *vec_cast.fe_species_john_doe_coding.fe_species_tum_wwk_short*, and make
  the obvious adaptions in the function's body.
- *as_fe_species_tum_wwk_short* which is the actual functions users call for
  casts between codings
  
  
#### 4.3.2 Add a species coding cast function to each other coding

In the previous step, you have placed a *vec_cast* function that casts 
other codings into your new coding in the implementation of the new coding. Now,
you have to add such a function that casts from your coding into another coding
to the implementation of each other coding. In other words, the implementation
of *fe_species_tum_wwk_short* requires a function called
*vec_cast.fe_species_tum_wwk_short.fe_species_john_doe_coding*, and the 
implementation of *fe_species_ger_nfi_2012* requires a function
*vec_cast.fe_species_ger_nfi_2023.fe_species_john_doe_coding*, and so on.


#### 4.3.3 Document the new coding

Clearly, when implementing your new species coding by 
[editing an existing source file](#adapt-source-file), you must adapt the
existing documentation you find there to the new requirements. However, you must
not forget to add your coding to the general documentation of species codings
of the package. You find this in the file *data_species_codings.R* which is
Roxygen2 code. Add a short description and examples in the same style as you 
find it for the other codings.


#### 4.3.4 Add your new coding to the automated tests for species codings

The package ForestElementsR comprises a suite of automated tests. You must add
your now coding also there. You find the implementations of the tests in the
subdirectory */tests/testhat/*; the files you need are called
*test_species_coding_consistency*, and *test_species_coding_casts*. Several
tests also iterate over *all* codings (e.g. for canonical row order,
completeness, name uniqueness, and non-tree handling); a new coding is
picked up there automatically once it is part of `species_codings`.
See how the tests for the other codings are implemented, and follow these
examples.


### 4.4 **Never** touch the source file *fe_species_helper_functions.R* {#dev_never}

The functions in the source file *fe_species_helper_functions.R* were very 
carefully crafted, and they provide the common technical background for existing
and future species codings implemented in the package *ForestElementsR*. If you
fiddle around there without knowing *500% exactly* what you are doing, you will
almost certainly goof it up.


## References