--- title: "Haplotype Discrete Survival Models" author: Klaus Holst & Thomas Scheike date: "2026-05-24" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{Haplotype Discrete Survival Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Haplotype Analysis for discrete TTP =================================== Cycle-specific logistic regression of haplotype effects with known haplotype probabilities. Given observed genotype $G$ and unobserved haplotypes $H$, we marginalise over the possible haplotypes using $P(H|G)$ supplied as input. \begin{align*} S(t|x,G) & = E( S(t|x,H) | G) = \sum_{h \in G} P(h|G) S(t|z,h) \end{align*} so survival can be computed by marginalising over possible $h$ given $g$. Survival is based on logistic regression for the discrete hazard function of the form \begin{align*} \mbox{logit}(P(T=t \mid T \geq t, x, h)) & = \alpha_t + x(h)^T \beta \end{align*} where $x(h)$ is a regression design of $x$ and haplotypes $h=(h_1,h_2)$. Simple binomial data can also be fitted using this function. Standard errors are computed assuming that the haplotype probabilities are known. We are particularly interested in the types haplotypes: ``` r types <- c("DCGCGCTCACG","DTCCGCTGACG","ITCAGTTGACG","ITCCGCTGAGG") ## some haplotypes frequencies for simulations data(haplo) hapfreqs <- haplo$hapfreqs print(hapfreqs) #> index haplotype freq #> DCGAGCTCACG 1 DCGAGCTCACG 0.010681 #> DCGCGCTCACG 2 DCGCGCTCACG 0.138387 #> DTGAGCTCACG 3 DTGAGCTCACG 0.000310 #> DTGAGCTCACA 4 DTGAGCTCACA 0.006800 #> DTGAGCTCGCG 5 DTGAGCTCGCG 0.034517 #> DTGACCTCACG 6 DTGACCTCACG 0.001336 #> DTGCGCTCACG 7 DTGCGCTCACG 0.009969 #> DTGCGCTCACA 8 DTGCGCTCACA 0.011833 #> DTGCGCTCGCG 9 DTGCGCTCGCG 0.302389 #> DTGCGCCCGCG 10 DTGCGCCCGCG 0.001604 #> DTGCCCTCACG 11 DTGCCCTCACG 0.003912 #> DTCAGCTGACG 12 DTCAGCTGACG 0.001855 #> DTCCGCTGACG 13 DTCCGCTGACG 0.103394 #> DTCCCCTGACG 14 DTCCCCTGACG 0.000310 #> ITCAGTTGACG 15 ITCAGTTGACG 0.048124 #> ITCCGCTGAGG 16 ITCCGCTGAGG 0.291273 #> ITCCGTTGACG 17 ITCCGTTGACG 0.031089 #> ITCCGTCGACG 18 ITCCGTCGACG 0.001502 #> ITCCCCTGAGG 19 ITCCCCTGAGG 0.000653 ``` Among the haplotypes of interest we look up the frequencies and choose a baseline: ``` r www <-which(hapfreqs$haplotype %in% types) hapfreqs$freq[www] #> [1] 0.138387 0.103394 0.048124 0.291273 baseline=hapfreqs$haplotype[9] baseline #> [1] "DTGCGCTCGCG" ``` We have cycle-specific data with $id$ and outcome $y$: ``` r haploX <- haplo$haploX dlist(haploX,.~id|id %in% c(1,4,7)) #> id: 1 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 1 0 0 0 0 0 1 1 0 #> 2 0 0 0 0 0 2 2 0 #> 3 0 0 0 0 0 3 3 0 #> 4 0 0 0 0 0 4 4 0 #> 5 0 0 0 0 0 5 5 0 #> 6 0 0 0 0 0 6 6 0 #> ------------------------------------------------------------ #> id: 4 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 19 1 0 0 0 0 1 1 0 #> ------------------------------------------------------------ #> id: 7 #> y X1 X2 X3 X4 times lbnr__id Count1 #> 37 0 1 0 0 0 1 1 0 #> 38 0 1 0 0 0 2 2 0 #> 39 1 1 0 0 0 3 3 0 ``` and a list of possible haplotypes for each id and their probabilities $p$ (the probabilities within each id sum to 1): ``` r ghaplos <- haplo$ghaplos head(ghaplos) #> id haplo1 haplo2 p #> 1 1 DTGCGCTCGCG DTGAGCTCGCG 1.00000000 #> 19 2 ITCCGTTGACG DTGAGCTCGCG 0.06867716 #> 21 2 ITCAGTTGACG DTGCGCTCGCG 0.93132284 #> 51 3 ITCCGTTGACG DTGAGCTCGCG 0.06867716 #> 53 3 ITCAGTTGACG DTGCGCTCGCG 0.93132284 #> 66 4 DTGCGCTCGCG DTGCGCTCGCG 1.00000000 ``` The first id=1 has the haplotype fully observed, but id=2 has two possible haplotypes consistent with the observed genotype for this id, the probabilities are 7\% and 93\%, respectively. With the baseline given above we can specify a regression design that gives an effect if a haplotype of the specified type is present (`sm=0`), or an additive effect of haplotypes (`sm=1`): ``` r designftypes <- function(x,sm=0) { hap1=x[1] hap2=x[2] if (sm==0) y <- 1*( (hap1==types) | (hap2==types)) if (sm==1) y <- 1*(hap1==types) + 1*(hap2==types) return(y) } ``` To fit the model we first construct a time design matrix (named `X`) and supply the haplotype distributions for each id: ``` r haploX$time <- haploX$times Xdes <- model.matrix(~factor(time),haploX) colnames(Xdes) <- paste("X",1:ncol(Xdes),sep="") X <- dkeep(haploX,~id+y+time) X <- cbind(X,Xdes) Haplos <- dkeep(ghaplos,~id+"haplo*"+p) desnames=paste("X",1:6,sep="") # six X's related to 6 cycles head(X) #> id y time X1 X2 X3 X4 X5 X6 #> 1 1 0 1 1 0 0 0 0 0 #> 2 1 0 2 1 1 0 0 0 0 #> 3 1 0 3 1 0 1 0 0 0 #> 4 1 0 4 1 0 0 1 0 0 #> 5 1 0 5 1 0 0 0 1 0 #> 6 1 0 6 1 0 0 0 0 1 ``` We can now fit the model with the design given by the design function: ``` r out <- haplo_surv_discrete(X=X,y="y",time.name="time", Haplos=Haplos,desnames=desnames,designfunc=designftypes) names(out$coef) <- c(desnames,types) out$coef #> X1 X2 X3 X4 X5 X6 #> -1.82153345 -0.61608261 -0.17143057 -1.27152045 -0.28635976 -0.19349091 #> DCGCGCTCACG DTCCGCTGACG ITCAGTTGACG ITCCGCTGAGG #> 0.79753613 0.65747412 0.06119231 0.31666905 summary(out) #> Estimate Std.Err 2.5% 97.5% P-value #> X1 -1.82153 0.1619 -2.13892 -1.5041 2.355e-29 #> X2 -0.61608 0.1895 -0.98748 -0.2447 1.149e-03 #> X3 -0.17143 0.1799 -0.52398 0.1811 3.406e-01 #> X4 -1.27152 0.2631 -1.78719 -0.7559 1.346e-06 #> X5 -0.28636 0.2030 -0.68425 0.1115 1.584e-01 #> X6 -0.19349 0.2134 -0.61184 0.2249 3.647e-01 #> DCGCGCTCACG 0.79754 0.1494 0.50465 1.0904 9.445e-08 #> DTCCGCTGACG 0.65747 0.1621 0.33971 0.9752 5.007e-05 #> ITCAGTTGACG 0.06119 0.2145 -0.35931 0.4817 7.755e-01 #> ITCCGCTGAGG 0.31667 0.1361 0.04989 0.5834 1.999e-02 ``` Haplotypes `"DCGCGCTCACG"` and `"DTCCGCTGACG"` give an increased hazard of pregnancy. The data were generated with these true coefficients: ``` r tcoef=c(-1.93110204,-0.47531630,-0.04118204,-1.57872602,-0.22176426,-0.13836416, 0.88830288,0.60756224,0.39802821,0.32706859) cbind(out$coef,tcoef) #> tcoef #> X1 -1.82153345 -1.93110204 #> X2 -0.61608261 -0.47531630 #> X3 -0.17143057 -0.04118204 #> X4 -1.27152045 -1.57872602 #> X5 -0.28635976 -0.22176426 #> X6 -0.19349091 -0.13836416 #> DCGCGCTCACG 0.79753613 0.88830288 #> DTCCGCTGACG 0.65747412 0.60756224 #> ITCAGTTGACG 0.06119231 0.39802821 #> ITCCGCTGAGG 0.31666905 0.32706859 ``` The fitted design matrix can be found in the output: ``` r head(out$X,10) #> X1 X2 X3 X4 X5 X6 haplo1 haplo2 haplo3 haplo4 #> 1 1 0 0 0 0 0 0 0 0 0 #> 2 1 1 0 0 0 0 0 0 0 0 #> 3 1 0 1 0 0 0 0 0 0 0 #> 4 1 0 0 1 0 0 0 0 0 0 #> 5 1 0 0 0 1 0 0 0 0 0 #> 6 1 0 0 0 0 1 0 0 0 0 #> 8 1 0 0 0 0 0 0 0 1 0 #> 10 1 1 0 0 0 0 0 0 1 0 #> 12 1 0 1 0 0 0 0 0 1 0 #> 14 1 0 0 1 0 0 0 0 1 0 ``` SessionInfo ============ ``` r sessionInfo() #> R version 4.6.0 (2026-04-24) #> Platform: x86_64-pc-linux-gnu #> Running under: Ubuntu 24.04.4 LTS #> #> Matrix products: default #> BLAS: /home/kkzh/.asdf/installs/r/4.6.0/lib/R/lib/libRblas.so #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0 #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> time zone: Europe/Copenhagen #> tzcode source: system (glibc) #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] timereg_2.0.7 survival_3.8-6 mets_1.3.10 #> #> loaded via a namespace (and not attached): #> [1] cli_3.6.6 knitr_1.51 rlang_1.2.0 #> [4] xfun_0.57 otel_0.2.0 future.apply_1.20.2 #> [7] listenv_0.10.1 lava_1.9.1 stats4_4.6.0 #> [10] grid_4.6.0 evaluate_1.0.5 yaml_2.3.12 #> [13] mvtnorm_1.3-7 numDeriv_2016.8-1.1 compiler_4.6.0 #> [16] codetools_0.2-20 Rcpp_1.1.1-1.1 ucminf_1.2.3 #> [19] future_1.70.0 lattice_0.22-9 digest_0.6.39 #> [22] parallelly_1.47.0 parallel_4.6.0 splines_4.6.0 #> [25] Matrix_1.7-5 tools_4.6.0 RcppArmadillo_15.2.6-1 #> [28] globals_0.19.1 ```