---
title: "Haplotype Discrete Survival Models"
author: Klaus Holst & Thomas Scheike
date: "2026-05-24"
output:
  rmarkdown::html_vignette:
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Haplotype Discrete Survival Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
<!-- DO NOT EDIT: auto-generated from the .Rmd.orig source -->


Haplotype Analysis for discrete TTP
===================================

Cycle-specific logistic regression of haplotype effects with
known haplotype probabilities. Given observed genotype $G$ and
unobserved haplotypes $H$, we marginalise over the possible
haplotypes using $P(H|G)$ supplied as input.

\begin{align*}
   S(t|x,G) & = E( S(t|x,H) | G)  = \sum_{h \in G} P(h|G) S(t|z,h)
\end{align*}
so survival can be computed by marginalising over possible $h$ given $g$.

Survival is based on logistic regression for the discrete hazard
function of the form
\begin{align*}
      \mbox{logit}(P(T=t \mid T \geq t, x, h)) & = \alpha_t + x(h)^T \beta
\end{align*}
where $x(h)$ is a regression design of $x$ and haplotypes $h=(h_1,h_2)$.

Simple binomial data can also be fitted using this function.

Standard errors are computed assuming that the haplotype probabilities are known.

	
We are particularly interested in the types haplotypes: 

``` r
types <- c("DCGCGCTCACG","DTCCGCTGACG","ITCAGTTGACG","ITCCGCTGAGG")

## some haplotypes frequencies for simulations 
data(haplo)
hapfreqs <- haplo$hapfreqs 
print(hapfreqs)
#>             index   haplotype     freq
#> DCGAGCTCACG     1 DCGAGCTCACG 0.010681
#> DCGCGCTCACG     2 DCGCGCTCACG 0.138387
#> DTGAGCTCACG     3 DTGAGCTCACG 0.000310
#> DTGAGCTCACA     4 DTGAGCTCACA 0.006800
#> DTGAGCTCGCG     5 DTGAGCTCGCG 0.034517
#> DTGACCTCACG     6 DTGACCTCACG 0.001336
#> DTGCGCTCACG     7 DTGCGCTCACG 0.009969
#> DTGCGCTCACA     8 DTGCGCTCACA 0.011833
#> DTGCGCTCGCG     9 DTGCGCTCGCG 0.302389
#> DTGCGCCCGCG    10 DTGCGCCCGCG 0.001604
#> DTGCCCTCACG    11 DTGCCCTCACG 0.003912
#> DTCAGCTGACG    12 DTCAGCTGACG 0.001855
#> DTCCGCTGACG    13 DTCCGCTGACG 0.103394
#> DTCCCCTGACG    14 DTCCCCTGACG 0.000310
#> ITCAGTTGACG    15 ITCAGTTGACG 0.048124
#> ITCCGCTGAGG    16 ITCCGCTGAGG 0.291273
#> ITCCGTTGACG    17 ITCCGTTGACG 0.031089
#> ITCCGTCGACG    18 ITCCGTCGACG 0.001502
#> ITCCCCTGAGG    19 ITCCCCTGAGG 0.000653
```

Among the haplotypes of interest we look up the frequencies and choose a baseline:


``` r
www <-which(hapfreqs$haplotype %in% types)
hapfreqs$freq[www]
#> [1] 0.138387 0.103394 0.048124 0.291273

baseline=hapfreqs$haplotype[9]
baseline
#> [1] "DTGCGCTCGCG"
```


We have cycle-specific data with $id$ and outcome $y$:


``` r
haploX  <- haplo$haploX
dlist(haploX,.~id|id %in% c(1,4,7))
#> id: 1
#>   y X1 X2 X3 X4 times lbnr__id Count1
#> 1 0 0  0  0  0  1     1        0     
#> 2 0 0  0  0  0  2     2        0     
#> 3 0 0  0  0  0  3     3        0     
#> 4 0 0  0  0  0  4     4        0     
#> 5 0 0  0  0  0  5     5        0     
#> 6 0 0  0  0  0  6     6        0     
#> ------------------------------------------------------------ 
#> id: 4
#>    y X1 X2 X3 X4 times lbnr__id Count1
#> 19 1 0  0  0  0  1     1        0     
#> ------------------------------------------------------------ 
#> id: 7
#>    y X1 X2 X3 X4 times lbnr__id Count1
#> 37 0 1  0  0  0  1     1        0     
#> 38 0 1  0  0  0  2     2        0     
#> 39 1 1  0  0  0  3     3        0
```
and a list of possible haplotypes for each id and their probabilities $p$
(the probabilities within each id sum to 1):


``` r
ghaplos <- haplo$ghaplos
head(ghaplos)
#>    id      haplo1      haplo2          p
#> 1   1 DTGCGCTCGCG DTGAGCTCGCG 1.00000000
#> 19  2 ITCCGTTGACG DTGAGCTCGCG 0.06867716
#> 21  2 ITCAGTTGACG DTGCGCTCGCG 0.93132284
#> 51  3 ITCCGTTGACG DTGAGCTCGCG 0.06867716
#> 53  3 ITCAGTTGACG DTGCGCTCGCG 0.93132284
#> 66  4 DTGCGCTCGCG DTGCGCTCGCG 1.00000000
```

The first id=1 has the haplotype fully observed, but id=2 has two possible haplotypes consistent with the
observed genotype for this id, the probabilities are 7\% and 93\%, respectively. 

With the baseline given above we can specify a regression design that gives an effect if a haplotype of the specified type
is present (`sm=0`), or an additive effect of haplotypes (`sm=1`):


``` r
designftypes <- function(x,sm=0) {
hap1=x[1]
hap2=x[2]
if (sm==0) y <- 1*( (hap1==types) | (hap2==types))
if (sm==1) y <- 1*(hap1==types) + 1*(hap2==types)
return(y)
}

```

To fit the model we first construct a time design matrix (named `X`)
and supply the haplotype distributions for each id:


``` r
haploX$time <- haploX$times
Xdes <- model.matrix(~factor(time),haploX)
colnames(Xdes) <- paste("X",1:ncol(Xdes),sep="")
X <- dkeep(haploX,~id+y+time)
X <- cbind(X,Xdes)
Haplos <- dkeep(ghaplos,~id+"haplo*"+p)
desnames=paste("X",1:6,sep="")   # six X's related to 6 cycles 
head(X)
#>   id y time X1 X2 X3 X4 X5 X6
#> 1  1 0    1  1  0  0  0  0  0
#> 2  1 0    2  1  1  0  0  0  0
#> 3  1 0    3  1  0  1  0  0  0
#> 4  1 0    4  1  0  0  1  0  0
#> 5  1 0    5  1  0  0  0  1  0
#> 6  1 0    6  1  0  0  0  0  1
```

We can now fit the model with the design given by the design function:

``` r
out <- haplo_surv_discrete(X=X,y="y",time.name="time",
      Haplos=Haplos,desnames=desnames,designfunc=designftypes) 
names(out$coef) <- c(desnames,types)
out$coef
#>          X1          X2          X3          X4          X5          X6 
#> -1.82153345 -0.61608261 -0.17143057 -1.27152045 -0.28635976 -0.19349091 
#> DCGCGCTCACG DTCCGCTGACG ITCAGTTGACG ITCCGCTGAGG 
#>  0.79753613  0.65747412  0.06119231  0.31666905
summary(out)
#>             Estimate Std.Err     2.5%   97.5%   P-value
#> X1          -1.82153  0.1619 -2.13892 -1.5041 2.355e-29
#> X2          -0.61608  0.1895 -0.98748 -0.2447 1.149e-03
#> X3          -0.17143  0.1799 -0.52398  0.1811 3.406e-01
#> X4          -1.27152  0.2631 -1.78719 -0.7559 1.346e-06
#> X5          -0.28636  0.2030 -0.68425  0.1115 1.584e-01
#> X6          -0.19349  0.2134 -0.61184  0.2249 3.647e-01
#> DCGCGCTCACG  0.79754  0.1494  0.50465  1.0904 9.445e-08
#> DTCCGCTGACG  0.65747  0.1621  0.33971  0.9752 5.007e-05
#> ITCAGTTGACG  0.06119  0.2145 -0.35931  0.4817 7.755e-01
#> ITCCGCTGAGG  0.31667  0.1361  0.04989  0.5834 1.999e-02
```

Haplotypes `"DCGCGCTCACG"` and `"DTCCGCTGACG"` give an increased hazard of pregnancy.

The data were generated with these true coefficients:


``` r
tcoef=c(-1.93110204,-0.47531630,-0.04118204,-1.57872602,-0.22176426,-0.13836416,
0.88830288,0.60756224,0.39802821,0.32706859)

cbind(out$coef,tcoef)
#>                               tcoef
#> X1          -1.82153345 -1.93110204
#> X2          -0.61608261 -0.47531630
#> X3          -0.17143057 -0.04118204
#> X4          -1.27152045 -1.57872602
#> X5          -0.28635976 -0.22176426
#> X6          -0.19349091 -0.13836416
#> DCGCGCTCACG  0.79753613  0.88830288
#> DTCCGCTGACG  0.65747412  0.60756224
#> ITCAGTTGACG  0.06119231  0.39802821
#> ITCCGCTGAGG  0.31666905  0.32706859
```

The fitted design matrix can be found in the output:

``` r
head(out$X,10)
#>    X1 X2 X3 X4 X5 X6 haplo1 haplo2 haplo3 haplo4
#> 1   1  0  0  0  0  0      0      0      0      0
#> 2   1  1  0  0  0  0      0      0      0      0
#> 3   1  0  1  0  0  0      0      0      0      0
#> 4   1  0  0  1  0  0      0      0      0      0
#> 5   1  0  0  0  1  0      0      0      0      0
#> 6   1  0  0  0  0  1      0      0      0      0
#> 8   1  0  0  0  0  0      0      0      1      0
#> 10  1  1  0  0  0  0      0      0      1      0
#> 12  1  0  1  0  0  0      0      0      1      0
#> 14  1  0  0  1  0  0      0      0      1      0
```


SessionInfo
============


``` r
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/kkzh/.asdf/installs/r/4.6.0/lib/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Copenhagen
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] timereg_2.0.7  survival_3.8-6 mets_1.3.10   
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.6              knitr_1.51             rlang_1.2.0           
#>  [4] xfun_0.57              otel_0.2.0             future.apply_1.20.2   
#>  [7] listenv_0.10.1         lava_1.9.1             stats4_4.6.0          
#> [10] grid_4.6.0             evaluate_1.0.5         yaml_2.3.12           
#> [13] mvtnorm_1.3-7          numDeriv_2016.8-1.1    compiler_4.6.0        
#> [16] codetools_0.2-20       Rcpp_1.1.1-1.1         ucminf_1.2.3          
#> [19] future_1.70.0          lattice_0.22-9         digest_0.6.39         
#> [22] parallelly_1.47.0      parallel_4.6.0         splines_4.6.0         
#> [25] Matrix_1.7-5           tools_4.6.0            RcppArmadillo_15.2.6-1
#> [28] globals_0.19.1
```