--- title: "dUtility data-frame manipulations" author: Klaus Holst & Thomas Scheike date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes fig_width: 7.15 fig_height: 5.5 vignette: > %\VignetteIndexEntry{dUtility data-frame manipulations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(mets) ``` Simple data manipulation for data-frames ========================================== * Renaming variables, Deleting variables * Looking at the data * Making new variables for the analysis * Making factors (groupings) * Working with factors * Making a factor from existing numeric variable and vice versa Here are some key data-manipulation steps on a data frame, which is how we typically organise our data in R. After reading the data into R it will typically be a data frame; if not, we can force it to be one. The basic idea of the utility functions is to provide a simple and easy-to-type way of performing common data manipulations on a data frame, much like what is possible in SAS or Stata. The functions `dcut`, `dfactor`, and so on do essentially what the base R `cut` and `factor` functions do, but are easier to use in the context of data frames and have additional functionality. ```{r} library(mets) data(melanoma) ``` ```{r} is.data.frame(melanoma) ``` Here we work on the melanoma data, which has already been read into R and is a data frame. dUtility functions ================== The structure for all functions is * `dfunction(dataframe, y~x|ifcond, ...)` to apply the function to `y` in a data frame, grouped by `x`, and only when condition `ifcond` is satisfied. The basic functions are: Data processing * dsort * dreshape * dcut * drm, drename, ddrop, dkeep, dsubset * drelevel * dlag * dfactor, dnumeric Data aggregation * dby, dby2 * dscalar, deval, daggregate * dmean, dsd, dsum, dquantile, dcor * dtable, dcount Data summaries * dhead, dtail, * dsummary, * dprint, dlist, dlevels, dunique A generic function daggregate, daggr, can be called with a function as the argument: * `daggregate(dataframe, y~x|ifcond, fun=function, ...)` or without the grouping variable (`x`): * `daggregate(dataframe, ~y|ifcond, fun=function, ...)` A useful feature is that `y` and `x` as well as the subset condition can be specified using regular expressions or wildcards (default). Here we compute the means of certain variables. First, overall: ```{r} dmean(melanoma,~thick+I(log(thick))) ``` now only when days>500 ```{r} dmean(melanoma,~thick+I(log(thick))|I(days>500)) ``` and now after sex but only when days>500 ```{r} dmean(melanoma,thick+I(log(thick))~sex|I(days>500)) ``` and finally after quartiles of days (via the dcut function) ```{r} dmean(melanoma,thick+I(log(thick))~I(dcut(days))) ``` or summary of all variables starting with "s" and that contains "a" ```{r} dmean(melanoma,"s*"+"*a*"~sex|I(days>500)) ``` Renaming, deleting, keeping, dropping variables =============================================== ```{r} melanoma=drename(melanoma,tykkelse~thick) names(melanoma) ``` Deleting variables ```{r} data(melanoma) melanoma=drm(melanoma,~thick+sex) names(melanoma) ``` or SAS style: ```{r} data(melanoma) melanoma=ddrop(melanoma,~thick+sex) names(melanoma) ``` alternatively, we can keep certain variables: ```{r} data(melanoma) melanoma=dkeep(melanoma,~thick+sex+status+days) names(melanoma) ``` This can also be done with direct assignment ```{r} data(melanoma) ddrop(melanoma) <- ~thick+sex names(melanoma) ``` The dkeep function can also be used to re-ordering the variables in the data-frame ```{r} data(melanoma) names(melanoma) melanoma=dkeep(melanoma,~days+status+.) names(melanoma) ``` Looking at the data =================== ```{r} data(melanoma) dstr(melanoma) ``` The data can be viewed as a data table in RStudio, but to list certain parts of the data in the output window: ```{r} dlist(melanoma) ``` ```{r} dlist(melanoma, ~.|sex==1) ``` ```{r} dlist(melanoma, ~ulc+days+thick+sex|sex==1) ``` Getting summaries ```{r} dsummary(melanoma) ``` or for specific variables ```{r} dsummary(melanoma,~thick+status+sex) ``` Summaries in different groups (sex) ```{r} dsummary(melanoma,thick+days+status~sex) ``` and only among those with thin-tumours or only females (sex==1) ```{r} dsummary(melanoma,thick+days+status~sex|thick<97) ``` ```{r} dsummary(melanoma,thick+status~+1|sex==1) ``` or ```{r} dsummary(melanoma,~thick+status|sex==1) ``` For more complex conditions, use `I()`: ```{r} dsummary(melanoma,thick+days+status~sex|I(thick<97 & sex==1)) ``` Tables between variables ```{r} dtable(melanoma,~status+sex) ``` All bivariate tables ```{r} dtable(melanoma,~status+sex+ulc,level=2) ``` All univariate tables ```{r} dtable(melanoma,~status+sex+ulc,level=1) ``` and with new variables ```{r} dtable(melanoma,~status+sex+ulc+dcut(days)+I(days>300),level=1) ``` Sorting the data =============== To sort the data ```{r} data(melanoma) mel= dsort(melanoma,~days) dsort(melanoma) <- ~days head(mel) ``` and to sort after multiple variables increasing and decreasing ```{r} dsort(melanoma) <- ~days-status head(melanoma) ``` Making new variables for the analysis ===================================== To define a bunch of new covariates within a data-frame ```{r} data(melanoma) melanoma= transform(melanoma, thick2=thick^2, lthick=log(thick) ) dhead(melanoma) ``` When the above definitions are done using a condition this can be achieved using the dtransform function that extends transform with a possible condition ```{r} melanoma=dtransform(melanoma,ll=thick*1.05^ulc,sex==1) melanoma=dtransform(melanoma,ll=thick,sex!=1) dmean(melanoma,ll~sex+ulc) ``` Making factors (groupings) ============================= On the melanoma data, the variable `thick` gives the thickness of the melanoma tumour. For some analyses we would like to create a factor depending on the thickness. This can be done in several different ways: ```{r} melanoma=dcut(melanoma,~thick,breaks=c(0,200,500,800,2000)) ``` New variable is named `thickcat.0` by default. To see levels of factors in data-frame ```{r} dlevels(melanoma) ``` Checking group sizes ```{r} dtable(melanoma,~thickcat.0) ``` With adding to the data-frame directly ```{r} dcut(melanoma,breaks=c(0,200,500,800,2000)) <- gr.thick1~thick dlevels(melanoma) ``` new variable is named thickcat.0 (after first cut-point), or to get quartiles with default names thick.cat.4 ```{r} dcut(melanoma) <- ~ thick # new variable is thickcat.4 dlevels(melanoma) ``` or median groups, here starting again with the original data, ```{r} data(melanoma) dcut(melanoma,breaks=2) <- ~ thick # new variable is thick.2 dlevels(melanoma) ``` to control new names ```{r} data(melanoma) mela= dcut(melanoma,thickcat4+dayscat4~thick+days,breaks=4) dlevels(mela) ``` or ```{r} data(melanoma) dcut(melanoma,breaks=4) <- thickcat4+dayscat4~thick+days dlevels(melanoma) ``` This can also be typed out more specifically ```{r} melanoma$gthick = cut(melanoma$thick,breaks=c(0,200,500,800,2000)) melanoma$gthick = cut(melanoma$thick,breaks=quantile(melanoma$thick),include.lowest=TRUE) ``` Working with factors ==================== To see levels of covariates in data-frame ```{r} data(melanoma) dcut(melanoma,breaks=4) <- thickcat4~thick dlevels(melanoma) ``` To relevel the factor ```{r} dtable(melanoma,~thickcat4) melanoma = drelevel(melanoma,~thickcat4,ref="(194,356]") dlevels(melanoma) ``` or to take the third level in the list of levels, same as above, ```{r} melanoma = drelevel(melanoma,~thickcat4,ref=2) dlevels(melanoma) ``` To combine levels of a factor (first combining the first 3 groups into one): ```{r} melanoma = drelevel(melanoma,~thickcat4,newlevels=1:3) dlevels(melanoma) ``` or to combine groups 1 and 2 into one group and 3 and 4 into another ```{r} dkeep(melanoma) <- ~thick+thickcat4 melanoma = drelevel(melanoma,gthick2~thickcat4,newlevels=list(1:2,3:4)) dlevels(melanoma) ``` Changing order of factor levels ```{r} dfactor(melanoma,levels=c(3,1,2,4)) <- thickcat4.2~thickcat4 dlevel(melanoma,~ "thickcat4*") dtable(melanoma,~thickcat4+thickcat4.2) ``` Combine levels but now control factor-level names ```{r} melanoma=drelevel(melanoma,gthick3~thickcat4,newlevels=list(group1.2=1:2,group3.4=3:4)) dlevels(melanoma) ``` Making a factor from existing numeric variable and vice versa ============================================================== A numeric variable `status` with values 1, 2, 3 can be converted to a factor by: ```{r} data(melanoma) melanoma = dfactor(melanoma,~status, labels=c("malignant-melanoma","censoring","dead-other")) melanoma = dfactor(melanoma,sexl~sex,labels=c("females","males")) dtable(melanoma,~sexl+status.f) ``` A gender factor with values `"M"`, `"F"` can be converted to numerics by: ```{r} melanoma = dnumeric(melanoma,~sexl) dstr(melanoma,"sex*") dtable(melanoma,~'sex*',level=2) ``` SessionInfo ============ ```{r} sessionInfo() ```