---
title: "dUtility data-frame manipulations"
author: Klaus Holst & Thomas Scheike
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    fig_caption: yes 
    fig_width: 7.15
    fig_height: 5.5   
vignette: >
  %\VignetteIndexEntry{dUtility data-frame manipulations} 
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(mets)
```


Simple data manipulation for data-frames 
==========================================

   *  Renaming variables, Deleting variables
   *  Looking at the data
   *  Making new variables for the analysis
   *  Making factors (groupings)
   *  Working with factors
   *  Making a factor from existing numeric variable and vice versa


Here are some key data-manipulation steps on a data frame, which is how we typically
organise our data in R.
After reading the data into R it will typically be a data frame; if not, we
can force it to be one. The basic idea of the utility functions is
to provide a simple and easy-to-type way of performing common data manipulations on a
data frame, much like what is possible in SAS or Stata.

The functions `dcut`, `dfactor`, and so on do essentially what
the base R `cut` and `factor` functions do, but are easier to use in the context of data frames
and have additional functionality.

```{r}
library(mets)
data(melanoma)
```

```{r}
is.data.frame(melanoma)
```

Here we work on the melanoma data, which has already been read into R and is
a data frame.


dUtility functions 
==================

The structure for all functions is

*  `dfunction(dataframe, y~x|ifcond, ...)`

to apply the function to `y` in a data frame, grouped by `x`, and only when
condition `ifcond` is satisfied. The basic functions are:

Data processing
    * dsort
    * dreshape
    * dcut
    * drm, drename, ddrop, dkeep, dsubset
    * drelevel
    * dlag
    * dfactor, dnumeric

Data aggregation
    * dby, dby2
    * dscalar, deval, daggregate
    * dmean, dsd, dsum, dquantile, dcor
    * dtable, dcount

Data summaries
    * dhead, dtail, 
    * dsummary, 
    * dprint, dlist, dlevels, dunique


A generic function daggregate, daggr,  can be called with a function as the 
argument:

*  `daggregate(dataframe, y~x|ifcond, fun=function, ...)`

or without the grouping variable (`x`):

*  `daggregate(dataframe, ~y|ifcond, fun=function, ...)`

A useful feature is that `y` and `x` as well as the subset condition
can be specified using regular expressions or wildcards (default).
Here we compute the means of certain variables.

First, overall:
```{r}
dmean(melanoma,~thick+I(log(thick)))
```

now only when days>500
```{r}
dmean(melanoma,~thick+I(log(thick))|I(days>500))
```

and now after sex but only when days>500

```{r}
dmean(melanoma,thick+I(log(thick))~sex|I(days>500))
```

and finally after quartiles of days (via the dcut function)

```{r}
dmean(melanoma,thick+I(log(thick))~I(dcut(days)))
```

or summary of all variables starting with "s" and that contains "a"

```{r}
dmean(melanoma,"s*"+"*a*"~sex|I(days>500))
```


Renaming, deleting, keeping, dropping  variables
===============================================

```{r}
melanoma=drename(melanoma,tykkelse~thick)
names(melanoma)
```

Deleting variables

```{r}
data(melanoma)
melanoma=drm(melanoma,~thick+sex)
names(melanoma)
```

or SAS style:

```{r}
data(melanoma)
melanoma=ddrop(melanoma,~thick+sex)
names(melanoma)
```

alternatively, we can keep certain variables:

```{r}
data(melanoma)
melanoma=dkeep(melanoma,~thick+sex+status+days)
names(melanoma)
```

This can also be done with direct assignment 

```{r}
data(melanoma)
ddrop(melanoma) <- ~thick+sex
names(melanoma)
```

The dkeep function can also be used to re-ordering the variables in the data-frame

```{r}
data(melanoma)
names(melanoma)
melanoma=dkeep(melanoma,~days+status+.)
names(melanoma)
```


Looking at the data 
===================

```{r}
data(melanoma)
dstr(melanoma)
```


The data can be viewed as a data table in RStudio, but to list certain parts of
the data in the output window:

```{r}
dlist(melanoma)
```

```{r}
dlist(melanoma, ~.|sex==1)
```

```{r}
dlist(melanoma, ~ulc+days+thick+sex|sex==1)
```

Getting summaries

```{r}
dsummary(melanoma)
```

or for specific variables

```{r}
dsummary(melanoma,~thick+status+sex)
```


Summaries in different groups (sex)

```{r}
dsummary(melanoma,thick+days+status~sex)
```


and only among those with thin-tumours or only females (sex==1)

```{r}
dsummary(melanoma,thick+days+status~sex|thick<97)
```

```{r}
dsummary(melanoma,thick+status~+1|sex==1)
```

or 

```{r}
dsummary(melanoma,~thick+status|sex==1)
```


For more complex conditions, use `I()`:

```{r}
dsummary(melanoma,thick+days+status~sex|I(thick<97 & sex==1))
```

Tables between variables

```{r}
dtable(melanoma,~status+sex)
```

All bivariate tables

```{r}
dtable(melanoma,~status+sex+ulc,level=2)
```

All univariate tables

```{r}
dtable(melanoma,~status+sex+ulc,level=1)
```

and with new variables 

```{r}
dtable(melanoma,~status+sex+ulc+dcut(days)+I(days>300),level=1)
```


Sorting the data
===============

To sort the data 

```{r}
data(melanoma)
mel= dsort(melanoma,~days)
dsort(melanoma) <- ~days
head(mel)
```


and to sort after multiple variables increasing and decreasing 

```{r}
dsort(melanoma) <- ~days-status
head(melanoma)
```


Making new variables for the analysis
=====================================

To define a bunch of new covariates within a data-frame

```{r}
data(melanoma)
melanoma= transform(melanoma, thick2=thick^2, lthick=log(thick) ) 
dhead(melanoma)
```

When the above definitions are done using a condition this can be achieved 
using the dtransform function that extends transform with a possible condition  

```{r}
 melanoma=dtransform(melanoma,ll=thick*1.05^ulc,sex==1)  
 melanoma=dtransform(melanoma,ll=thick,sex!=1)  
 dmean(melanoma,ll~sex+ulc)
```


Making factors (groupings)
=============================

On the melanoma data, the variable `thick` gives the thickness of the melanoma tumour. For some analyses we would like to create a factor depending on the thickness. This can be done in several different ways:

```{r}
melanoma=dcut(melanoma,~thick,breaks=c(0,200,500,800,2000))
```


New variable is named `thickcat.0` by default.

To see levels of factors in data-frame

```{r}
dlevels(melanoma)
```


Checking group sizes 

```{r}
dtable(melanoma,~thickcat.0)
```


With adding to the data-frame directly 

```{r}
dcut(melanoma,breaks=c(0,200,500,800,2000)) <- gr.thick1~thick
dlevels(melanoma)
```

new variable is named thickcat.0 (after first cut-point), 
or to get quartiles with default names thick.cat.4 

```{r}
dcut(melanoma) <- ~ thick  # new variable is thickcat.4
dlevels(melanoma)
```

or median groups, here starting again with the original data, 

```{r}
data(melanoma)
dcut(melanoma,breaks=2) <- ~ thick  # new variable is thick.2
dlevels(melanoma)
```


to control new names

```{r}
data(melanoma)
mela= dcut(melanoma,thickcat4+dayscat4~thick+days,breaks=4)
dlevels(mela)
```

or

```{r}
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4+dayscat4~thick+days
dlevels(melanoma)
```

This can also be typed out more specifically

```{r}
melanoma$gthick = cut(melanoma$thick,breaks=c(0,200,500,800,2000))
melanoma$gthick = cut(melanoma$thick,breaks=quantile(melanoma$thick),include.lowest=TRUE)
```


Working with factors
====================

To see levels of covariates in data-frame

```{r}
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4~thick
dlevels(melanoma) 
```

To relevel the factor

```{r}
dtable(melanoma,~thickcat4)
melanoma = drelevel(melanoma,~thickcat4,ref="(194,356]")
dlevels(melanoma)
```


or to take the third level in the list of levels, same as above, 

```{r}
melanoma = drelevel(melanoma,~thickcat4,ref=2)
dlevels(melanoma)
```

To combine levels of a factor (first combining the first 3 groups into one):

```{r}
melanoma = drelevel(melanoma,~thickcat4,newlevels=1:3)
dlevels(melanoma)
```

or to combine groups 1 and 2 into one group and 3 and 4 into another

```{r}
dkeep(melanoma) <- ~thick+thickcat4
melanoma = drelevel(melanoma,gthick2~thickcat4,newlevels=list(1:2,3:4))
dlevels(melanoma)
```


Changing order of factor levels 

```{r}
dfactor(melanoma,levels=c(3,1,2,4)) <-  thickcat4.2~thickcat4
dlevel(melanoma,~ "thickcat4*")
dtable(melanoma,~thickcat4+thickcat4.2)
```

Combine levels but now control factor-level names 

```{r}
melanoma=drelevel(melanoma,gthick3~thickcat4,newlevels=list(group1.2=1:2,group3.4=3:4))
dlevels(melanoma)
```

Making a factor from existing numeric variable and vice versa
==============================================================

A numeric variable `status` with values 1, 2, 3 can be converted to a factor by:

```{r}
data(melanoma)
melanoma = dfactor(melanoma,~status, labels=c("malignant-melanoma","censoring","dead-other"))
melanoma = dfactor(melanoma,sexl~sex,labels=c("females","males"))
dtable(melanoma,~sexl+status.f)
```

A gender factor with values `"M"`, `"F"` can be converted to numerics by:

```{r}
melanoma = dnumeric(melanoma,~sexl)
dstr(melanoma,"sex*")
dtable(melanoma,~'sex*',level=2)
```

SessionInfo
============

```{r}
sessionInfo()
```