Version: 2.5.1
Title: Discrete Time Survival Analysis
Date: 2026-04-28
Description: Provides data transformations, estimation utilities, predictive evaluation measures and simulation functions for discrete time survival analysis.
Depends: R (≥ 3.5.0), treeClust
Imports: mvtnorm, mgcv, data.table, Rdpack, VGAM, gee, rpart, ranger, mvnfast, utils, rpart.plot, moments, lmtest, lme4, car, gbutils
RdMacros: Rdpack
Suggests: Matrix, matrixcalc, numDeriv, caret, Ecdat, pec, survival, nnet, Hmisc
Encoding: UTF-8
License: GPL-3
LazyLoad: yes
RoxygenNote: 7.3.2
NeedsCompilation: no
Packaged: 2026-04-28 16:39:43 UTC; peter
Author: Thomas Welchowski [aut, cre], Moritz Berger [aut], David Koehler [aut], Matthias Schmid [aut]
Maintainer: Thomas Welchowski <t.welchowski@psychologie.uzh.ch>
Repository: CRAN
Date/Publication: 2026-04-29 09:00:09 UTC

Discrete Survival Analysis

Description

Includes functions for data transformations, estimation, evaluation and simulation of discrete survival analysis. Some important functions are listed below:

Data preprocessing

Model fitting

Model evaluation

Post-processing

Details

"DataShort" format is defined as data without repeated measurements. "DataSemiLong" format consists of repeated measurements, but there are gaps between the discrete time intervals. "DataLong" format is expanded to include all time intervals up to the last observation per individual.

Package: discSurv
Type: Package
Version: 2.5.1
Date: 2026-04-28
License: GPL-3

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

Moritz Berger moritz.berger@zi-mannheim.de

David Koehler koehler@imbie.uni-bonn.de

Matthias Schmid matthias.schmid@imbie.uni-bonn.de

References

Berger M, Schmid M (2018). “Semiparametric regression for discrete time-to-event data.” Statistical Modelling, 18(3), 322-345.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.


Adjusted Deviance Residuals

Description

Calculates the adjusted deviance residuals in short format for arbitrary prediction models. The adjusted deviance residuals should be approximately normal distributed, in the case of a well fitting model.

Usage

adjDevResid(dataLong, hazards)

## S3 method for class 'discSurvAdjDevResid'
plot(x, ...)

Arguments

dataLong

Data set in long format (class "data.frame").

hazards

Estimated discrete hazards of the data in long format (class "numeric"). Hazard rates are probabilities and therefore restricted to the interval [0, 1].

x

Object of class "discSurvAdjDevResid"

Details

Is called implicitly by using function qqnorm on an object of class "discSurvAdjDevResid". It plots a qqplot against the normal distribution. If the model fits the data well, it should be approximately normal distributed.

Value

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G (2012). Regression for Categorical Data. Cambridge University Press.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

intPredErr, predErrCurve

Examples


library(survival)

# Transform data to long format
heart[, "stop"] <- ceiling(heart[, "stop"])
set.seed(0)
Indizes <- sample(unique(heart$id), 25)
randSample <- heart[unlist(sapply(1:length(Indizes), 
function(x) which(heart$id == Indizes[x]))),]
heartLong <- dataLongTimeDep(dataSemiLong = randSample, 
timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE)

# Fit a generalized, additive model and predict discrete hazards on data in long format
library(mgcv)
gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial")
hazPreds <- predict(gamFit, type = "response")

# Calculate adjusted deviance residuals
devResiduals <- adjDevResid(dataLong = heartLong, hazards = hazPreds)$Output$AdjDevResid
devResiduals


Concordance Index

Description

Calculates the concordance index for discrete survival models, which does not depend on time. This is the probability that, for a pair of randomly chosen comparable samples, the sample with the higher risk prediction will experience an event before the other sample or belongs to a higher binary class. Estimation of the concordance index requires estimation of the time dependent area under the curve (AUC) with time specific sensitivities and specificities. Additional investigation of these measures allows a more fine grained view on the performance of the discrete survival model.

Usage

cIndex(marker, testTime, testEvent, trainTime, trainEvent)

## S3 method for class 'discSurv_cIndex'
plot(x, selectPlot = c("tpr", "fpr", "auc", "margProb", "margSurv"), ...)

## S3 method for class 'discSurvTprUno'
plot(x, ...)

## S3 method for class 'discSurvFprUno'
plot(x, ...)

## S3 method for class 'discSurvAucUno'
plot(x, ...)

Arguments

marker

Gives the predicted values of the linear predictor of a regression model (class "numeric"). May also be on the response scale.

testTime

New time intervals in the test data (class "integer").

testEvent

Event indicators in the test data (binary vector).

trainTime

Time intervals in the training data (class "integer").

trainEvent

Event indicators in the training data (binary vector).

x

Object of class "discSurvAucUno"

...

Specification of additional arguments in function plot.

Details

Additional measures are available in the attribute "addMeasures" as list with structure

For each list element tprAllTime, fprAllTime, aucAllTime, MargProb, MargSurv are plot methods available.

Value

Value of discrete concordance index between zero and one (class "numeric").

Note

It is assumed that all time points up to the last observed interval \left[ a_{q-1}, a_q \right) are available.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Heagerty PJ, Zheng Y (2005). “Survival Model Predictive Accuracy and ROC Curves.” Biometrics, 61(1), 92-105.

Schmid M, Tutz G, Welchowski T (2018). “Discrimination Measures for Discrete Time-to-Event Predictions.” Econometrics and Statistics, 7(0), 153-164.

Uno H, Cai T, Tian L, Wei LJ (2012). “Evaluating Prediction Rules fort-Year Survivors With Censored Regression Models.” Journal of the American Statistical Association, 102(478), 527-537.

Zadeh SG, Schmid M (2021). “Bias in Cross-Entropy-Based Training of Deep Survival Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(9), 3126-3137.

See Also

cIndex

Examples


##################################################
# Example with unemployment data and prior fitting

library(Ecdat)
library(caret)
library(mgcv)
data(UnempDur)
summary(UnempDur$spell)
# Extract subset of data
set.seed(635)
IDsample <- sample(1:dim(UnempDur)[1], 100)
UnempDurSubset <- UnempDur [IDsample, ]
set.seed(-570)
TrainingSample <- sample(1:100, 75)
UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ]
UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ]

# Convert to long format
UnempDurSubsetTrainLong <- dataLong(dataShort = UnempDurSubsetTrain, 
timeColumn = "spell", eventColumn = "censor1")

# Estimate gam with smooth baseline
gamFit <- gam(formula = y ~ s(I(as.numeric(as.character(timeInt)))) + 
s(age) + s(logwage), data = UnempDurSubsetTrainLong, family = binomial())
gamFitPreds <- predict(gamFit, newdata = cbind(UnempDurSubsetTest, 
timeInt = UnempDurSubsetTest$spell))

# Evaluate C-Index based on short data format
cIndex1 <- cIndex(marker = gamFitPreds, 
testTime = UnempDurSubsetTest$spell, 
testEvent = UnempDurSubsetTest$censor1, 
trainTime = UnempDurSubsetTrain$spell, 
trainEvent = UnempDurSubsetTrain$censor1)
cIndex1[1]

# Plot time-dependent AUC
plot(cIndex1, selectPlot="auc")

# Plot time-dependent sensitivities
plot(cIndex1, selectPlot="tpr")

# Plot time-dependent specificities
plot(cIndex1, selectPlot="fpr")

# Plot marginal probabilities
plot(cIndex1, selectPlot="margProb")

# Plot survival curve
plot(cIndex1, selectPlot="margSurv")

#####################################
# Example National Wilm's Tumor Study

library(survival)
head(nwtco)
summary(nwtco$rel)
# Select subset
set.seed(-375)
Indices <- sample(1:dim(nwtco)[1], 500)
nwtcoSub <- nwtco [Indices, ]

# Convert time range to 30 intervals
intLim <- quantile(nwtcoSub$edrel, prob = seq(0, 1, length.out = 30))
intLim [length(intLim)] <- intLim [length(intLim)] + 1
nwtcoSubTemp <- contToDisc(dataShort = nwtcoSub, timeColumn = "edrel", intervalLimits = intLim)
nwtcoSubTemp$instit <- factor(nwtcoSubTemp$instit)
nwtcoSubTemp$histol <- factor(nwtcoSubTemp$histol)
nwtcoSubTemp$stage <- factor(nwtcoSubTemp$stage)

# Split in training and test sample
set.seed(-570)
TrainingSample <- sample(1:dim(nwtcoSubTemp)[1], round(dim(nwtcoSubTemp)[1]*0.75))
nwtcoSubTempTrain <- nwtcoSubTemp [TrainingSample, ]
nwtcoSubTempTest <- nwtcoSubTemp [-TrainingSample, ]
# Convert to long format
nwtcoSubTempTrainLong <- dataLong(dataShort = nwtcoSubTempTrain,
                                 timeColumn = "timeDisc", eventColumn = "rel", timeAsFactor=TRUE)

# Estimate continuation ratio model
inputFormula <- y ~ timeInt + histol + instit + stage
glmFit <- glm(formula = inputFormula, data = nwtcoSubTempTrainLong, family = binomial())

# Event and time points not available for new patients
# -> Average out predictions with marginal distribution of discrete time intervals
# based on training data
margDiscT <- prop.table(table(factor(nwtcoSubTempTrain$timeDisc,
                                    levels=1:max(nwtcoSubTempTrain$timeDisc))))
# -> First interval was not observed as endpoint in training data

# Integrate out time from predictions
# Convert to long format
nwtcoSubTempTestLong <- dataLong(dataShort = nwtcoSubTempTest,
                                timeColumn = "timeDisc", eventColumn = "rel", timeAsFactor=TRUE,
                                aggTimeFormat = TRUE, lastTheoInt = max(nwtcoSubTempTrain$timeDisc))
linPreds_time <- predict(glmFit, newdata = nwtcoSubTempTestLong)
linPreds <- aggregate(time ~ ID,
                     data=data.frame(time=linPreds_time, ID=nwtcoSubTempTestLong$obj),
                     FUN = function(x) sum(x * margDiscT))[,2]

# Evaluate C-Index based on short data format
cIndex(marker = linPreds,
      testTime = as.numeric(as.character(nwtcoSubTempTest$timeDisc)),
      testEvent = nwtcoSubTempTest$rel,
      trainTime = as.numeric(as.character(nwtcoSubTempTrain$timeDisc)),
      trainEvent = nwtcoSubTempTrain$rel)
# 0.6397157

# Comparison to model without covariates
inputFormula2 <- y ~ timeInt
glmFit2 <- glm(formula = inputFormula2, data = nwtcoSubTempTrainLong, family = binomial())

# Integrate out time from predictions
linPreds_time2 <- predict(glmFit2, newdata = nwtcoSubTempTestLong)
linPreds2 <- aggregate(time ~ ID,
                      data=data.frame(time=linPreds_time2, ID=nwtcoSubTempTestLong$obj),
                      FUN = function(x) sum(x * margDiscT))[,2]

# Evaluate C-Index based on short data format
cIndex(marker = linPreds2,
      testTime = as.numeric(as.character(nwtcoSubTempTest$timeDisc)),
      testEvent = nwtcoSubTempTest$rel,
      trainTime = as.numeric(as.character(nwtcoSubTempTrain$timeDisc)),
      trainEvent = nwtcoSubTempTrain$rel)
# 0.5 -> Model not informative



Discrete Concordance Index For Competing Risks

Description

Estimates the discrete concordance index in the case of competing risks.

Usage

cIndexCompRisks(markers, testTime, testEvents, trainTime, trainEvents)

Arguments

markers

Predictions on the test data with model fitted on training data ("numeric matrix"). Predictions are stored in the rows and the number of columns equal to the number of events.

testTime

New time intervals in the test data (class "integer").

testEvents

New event indicators (0 or 1) in the test data ("binary matrix"). Number of columns are equal to the number of events.

trainTime

Time intervals in the training data (class "integer").

trainEvents

Event indicators (0 or 1) in the training data ("binary matrix"). Number of columns are equal to the number of events.

Value

Value of discrete concordance index between zero and one (class "numeric").

Note

It is assumed that all time points up to the last observed interval \left[ a_{q-1}, a_q \right) are available.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Heyard R, Timsit J, Held L, consortium C (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62(3), 643-657.

See Also

cIndex

Examples


##################################################
# Example with unemployment data and prior fitting

library(Ecdat)
data(UnempDur)
summary(UnempDur$spell)
# Extract subset of data
set.seed(635)
IDsample <- sample(1:dim(UnempDur)[1], 100)
UnempDurSubset <- UnempDur [IDsample, ]
set.seed(-570)
TrainingSample <- sample(1:100, 75)
UnempDurSubsetTrain <- UnempDurSubset [TrainingSample, ]
UnempDurSubsetTest <- UnempDurSubset [-TrainingSample, ]

# Convert to long format
UnempDurSubsetTrainLong <- dataLongCompRisks(dataShort = UnempDurSubsetTrain, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), timeAsFactor = TRUE)

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, 
data = UnempDurSubsetTrainLong, family=VGAM::multinomial(refLevel = "e0"))

gamFitPreds <- VGAM::predictvglm(vglmFit , newdata = cbind(UnempDurSubsetTest, 
timeInt = as.factor(UnempDurSubsetTest$spell)))

# Evaluate C-Index based on short data format
cIndexCompRisks(markers = gamFitPreds, 
testTime = UnempDurSubsetTest$spell, 
testEvents = UnempDurSubsetTest[, c("censor1", "censor4")], 
trainTime = UnempDurSubsetTrain$spell, 
trainEvents = UnempDurSubsetTrain[, c("censor1", "censor4")])


Calibration Plot

Description

Generates a plot to assess calibration of discrete survival models visually. Estimated hazard rates are grouped based on empirical quantiles and compared to observed hazard rates. For a well calibrated model the weighted mean of observed events should match the weighted mean of observed hazards.

Usage

calPlot(
  estHazards,
  testDataLong,
  nGroups = "auto",
  weights = NULL,
  laplaceSmoothPrior = 0,
  ...
)

Arguments

estHazards

Estimated hazards for all observations of data set in longe format (class "numeric"). In case of cause-specific competings risks each column corresponds to the estimated hazards for one cause (class c("matrix" "array")).

testDataLong

Test data in long format to asses calibration (class "data.frame").

nGroups

Number of groups for partitioning of estimated hazards (class "character" or class "numeric"). Default value "auto" uses a heuristic to determine the number of groups.

weights

Optional weights for each observation in long format (class "integer"). Default value corresponds to equal weights of one for each observation.

laplaceSmoothPrior

Specifies the prior assumption to apply additive Laplace smoothing or estimated hazards (class "numeric). It assumes a prior Bernoulli distribution with probability 1/2 to smooth estimated hazards in argument estHazards and observed events obsEvents. The smoothed hazard rates are between estimated and theoretical values. Default value of zero corresponds to no smoothing. Higher values give more weight to the prior distribution.

...

Specification of further arguments passed to function plot.

Details

The calibration plot can be calculated for training or test data. The number of groups of the compared hazard rates are determined by the heuristic Sturges rule, modified to include skewness, based on theory of information coding.

Note

The calibration plot assumes that the data was preprocessed to long data format. In case of subdistribution models the mean is weighted by the supplied weights. In case of cause-specific competing risks each event is plotted separately.

Author(s)

Thomas Welchowski and Moritz Berger

References

Berger M, Schmid M (2022). “Assessing the calibration of subdistribution hazard models in discrete time.” The Canadian Journal of Statistics, 50(2).

Doane DP (1976). “Aesthetic Frequency Classifications.” The American Statistician, 30(4), 181-183.

See Also

dataLong, dataLongSubDist, estReg, estRegSubDist

Examples


###################################
# Data preprocessing

# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDurTrain <- UnempDur[1:250, ]
SubUnempDurTest <- UnempDur[251:500, ]

# Transformation to long format
SubUnempDurTest_Long_TimeFactor <- dataLong(
 dataShort=SubUnempDurTest, timeColumn="spell", 
 eventColumn="censor1", timeAsFactor=TRUE)
SubUnempDurTest_LongSubDist <- dataLongSubDist(
 dataShort=SubUnempDurTest, timeColumn="spell", 
 eventColumns=c("censor1", "censor4"), eventFocus="censor1",
 timeAsFactor=TRUE)
SubUnempDurTest_LongCompRisks <- dataLongCompRisks(
 dataShort=SubUnempDurTest, timeColumn="spell", 
 eventColumns=c("censor1", "censor2"))

###############################################
# Calibration of plot of basic regression model 

# Estimate discrete survival continuation ratio model
estRegModel <- estReg(dataShort = SubUnempDurTrain, 
                     dataTransform = "dataLong", 
                     formulaVariable =~ timeInt + age + ui + logwage * ui, 
                     eventColumn = "censor1", timeColumn = "spell", timeAsFactor=TRUE)

preds <- predict(estRegModel, 
                newdata = SubUnempDurTest_Long_TimeFactor, type="response")
calPlot(estHazards=preds, 
       testDataLong=SubUnempDurTest_Long_TimeFactor)

###################################################
# Calibration plot of subdistribution hazards model

# Estimation of subdistribution hazard model
estRegModel <- estRegSubDist(dataShort = SubUnempDurTrain, 
                            formulaVariable =~ timeInt + age + ui + logwage * ui, 
                            eventColumns = c("censor1", "censor2", "censor3", "censor4"), 
                            eventFocus="censor1", timeColumn = "spell", timeAsFactor=TRUE)

# Visualization of calibration with test data
preds <- predict(estRegModel, 
               newdata = SubUnempDurTest_LongSubDist, type="response")
subDistWtest <- dataLongSubDist(dataShort=SubUnempDurTest, 
                               timeColumn="spell", eventColumns=c("censor1", "censor4"), 
                              eventFocus="censor1")$subDistWeights
calPlot(estHazards=preds, 
               testDataLong=SubUnempDurTest_LongSubDist, 
               weights=subDistWtest)

#################################################
# Calibration plot cause-specific competing risks

# Estimation
estRegModel <- estRegSmoothCompRisks(dataShort=SubUnempDurTrain, 
                                    dataTransform = "dataLongCompRisks",
                                    formulaVariable =~ s(timeInt) + age + ui + logwage * ui, 
                                    timeColumn="spell", eventColumns=c("censor1", "censor4"))

# Visualization of calibration with test data
preds <- predict(estRegModel, 
                newdata = SubUnempDurTest_LongCompRisks, type="response")
calPlot(estHazards=preds, 
       testDataLong=SubUnempDurTest_LongCompRisks)


Continuous To Discrete Transformation

Description

Discretizes continuous time variable into a specified grid of censored data for discrete survival analysis. It is a data preprocessing step, before the data can be extendend in long format and further analysed with discrete survival models.

Usage

contToDisc(
  dataShort,
  timeColumn,
  intervalLimits,
  equi = FALSE,
  timeAsFactor = FALSE
)

Arguments

dataShort

Original data in short format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Name of the column of discrete survival times (class "character").

intervalLimits

Right interval borders (class "numeric"), e. g. if the intervals are [0, a_1), [a_1, a_2), [a_2, a_max), then intervalLimits = c(a_1, a_2, a_max)

equi

Specifies if argument intervalLimits should be interpreted as number of equidistant intervals (class "logical").

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

Value

Gives the data set expanded with a first column "timeDisc". This column includes the discrete time intervals (class "factor").

Note

In discrete survival analysis the survival times have to be categorized in time intervals. Therefore this function is required, if there are observed continuous survival times. Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

dataLong, dataLongTimeDep, dataLongCompRisks

Examples


# Example copenhagen stroke study data
library(pec)
data(cost)
head(cost)

# Convert observed times to months
# Right borders of intervals [0, a_1), [a_1, a_2), ... , [a_{\max-1}, a_{\max})
IntBorders <- 1:ceiling(max(cost$time)/30)*30

# Select subsample
subCost <- cost [1:100, ]
CostMonths <- contToDisc(dataShort=subCost, timeColumn = "time", intervalLimits = IntBorders)
head(CostMonths)

# Select subsample giving number of equidistant intervals
CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = 10, equi = TRUE)
head(CostMonths)


GEE Covariance Of All Events For Competing Risks

Description

Estimates covariance of estimated parameters of all competing events generalized estimation equation models using sandwich approach.

Usage

covarGEE(modelEst, printProgess = FALSE)

Arguments

modelEst

Discrete time competing risks GEE model prediction model (class "dCRGEE").

printProgess

Should progress status be printed during estimation (class "logical")? Default is FALSE.

Value

Returns symmetric matrix of rows and columns dimension "number of competing risks" * "number of regression parameters" ("numeric matrix").

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74(4), 1468-1481.

See Also

estCompRisksGEE, dataLongCompRisks, dataLongCompRisksTimeDep, geeglm

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Estimate GEE models for all events
estGEE <- estCompRisksGEE(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2"), timeColumn = "spell")

## Not run: 
# Estimate covariance matrix of estimated parameters and competing events
estCovar <- covarGEE(modelEst=estGEE)
estCovar

# Covariances of estimated parameters of one event equal the diagonal blocks
lengthParameters <- length(estGEE[[1]]$coefficients)
noCompEvents <- length(estGEE)
meanAbsError <- rep(NA, noCompEvents)
for( k in 1:noCompEvents ){
  
  relInd <- (1 + (k-1) * lengthParameters) : (k * lengthParameters)
  meanAbsError[k] <- mean(abs(estCovar[relInd, relInd] - estGEE[[k]]$robust.variance))
  
}
mean(meanAbsError) 
# -> Covariance estimates within each event are equal to diagonal blocks in 
# complete covariance matrix with very small differences due to numerical accuracy.

## End(Not run)


Crash 2 Competing Risk Data

Description

Adapted version of the crash2 trial data as availlable in the package Hmisc. Both death or survival and main cause of death are included. Death times are discretized into days. Included covariates are sex and age of patient, elapsed time between injury and hospitalization, type of injury, systolic blood pressure, heart rate, respiratory rate, central capillary refill time and total glascow coma score.

Usage

data(crash2)

Author(s)

David Koehler koehler@imbie.uni-bonn.de

Source

getHdata


Data Censoring Transformation

Description

Function for transformation of discrete survival times in censoring encoding. The original data is expanded to include the censoring process. Alternatively the long data format can also be augmented. With the new generated variable "yCens", the discrete censoring process can be analyzed instead of the discrete survival process. In discrete survival analysis this information is used to constructs weights for predictive evaluation measures. It is applicable in single event survival analysis.

Usage

dataCensoring(
  dataShort,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeColumn,
  shortFormat = TRUE
)

Arguments

dataShort

Original data set in short format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

eventColumns

Name of event columns (class "character"). The event columns have to be in binary format. If the sum of all events equals zero in a row, then this observation is interpreted as censored.

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

timeColumn

Name of column with discrete time intervals (class "character").

shortFormat

Is the supplied data set dataShort not preprocessed with function dataLong (class "logical")? Default is TRUE. If shortFormat=FALSE then it is assumed that the data set was augmented with function dataLong.

Value

Original data set as argument dataShort, but with added censoring process as first variable in column "yCens".

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Fahrmeir L (2005). “Discrete Survival-Time Models.” In Encyclopedia of Biostatistics, chapter Survival Analysis. John Wiley & Sons.

Thompson Jr. WA (1977). “On the Treatment of Grouped Observations in Life Studies.” Biometrics, 33(3), 463-470.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

contToDisc, dataLong, dataLongTimeDep, dataLongCompRisks

Examples


library(pec)
data(cost)
head(cost)
IntBorders <- 1:ceiling(max(cost$time)/30)*30
subCost <- cost [1:100, ]

# Convert from days to months
CostMonths <- contToDisc(dataShort=subCost, timeColumn="time", intervalLimits=IntBorders)
head(CostMonths)

# Generate censoring process variable in short format
CostMonthsCensorShort <- dataCensoring (dataShort = CostMonths, 
eventColumns = "status", timeColumn = "time", shortFormat = TRUE)
head(CostMonthsCensorShort)

################################
# Example with long data format
library(pec)
data(cost)
head(cost)
IntBorders <- 1:ceiling(max(cost$time)/30)*30
subCost <- cost [1:100, ]

# Convert from days to months
CostMonths <- contToDisc(dataShort = subCost, timeColumn = "time", intervalLimits = IntBorders)
head(CostMonths)

# Convert to long format based on months
CostMonthsLong <- dataLong(dataShort = CostMonths, timeColumn = "timeDisc", eventColumn = "status")
head(CostMonthsLong, 20)

# Generate censoring process variable
CostMonthsCensor <- dataCensoring (dataShort = CostMonthsLong, timeColumn = "timeInt", 
shortFormat = FALSE)
head(CostMonthsCensor)
tail(CostMonthsCensor [CostMonthsCensor$obj==1, ], 10)
tail(CostMonthsCensor [CostMonthsCensor$obj==3, ], 10)


Data Long Transformation

Description

Transform data from short format into long format for discrete survival analysis and right censoring. Data is assumed to include no time varying covariates, e. g. no follow up visits are allowed. It is assumed that the covariates stay constant over time, in which no information is available.

Usage

dataLong(
  dataShort,
  timeColumn,
  eventColumn,
  timeAsFactor = FALSE,
  remLastInt = FALSE,
  aggTimeFormat = FALSE,
  lastTheoInt = NULL
)

Arguments

dataShort

Original data in short format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

remLastInt

Should the last theoretical interval be removed in long format (class "logical")? Default setting (FALSE) is no deletion. This is only important, if the short format data includes the last theoretic interval \left[ a_q, \infty \right). There are only events in the last theoretic interval, so the discrete hazard is always one and these observations have to be excluded for estimation.

aggTimeFormat

Instead of the usual long format, should every observation have all time intervals (class "logical")? Default is standard long format (FALSE). In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index.

lastTheoInt

Gives the number of the last theoretic interval (class "integer"). Only used, if argument aggTimeFormat is set to TRUE.

Details

If the data has continuous survival times, the response may be transformed to discrete intervals using function contToDisc. If the data set has time varying covariates the function dataLongTimeDep should be used instead. In the case of competing risks and no time varying covariates see function dataLongCompRisks.

Value

Original data.frame with three additional columns:

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

Matthias Schmid matthias.schmid@imbie.uni-bonn.de

References

Fahrmeir L (2005). “Discrete Survival-Time Models.” In Encyclopedia of Biostatistics, chapter Survival Analysis. John Wiley & Sons.

Schmid M, Welchowski T, Wright MN, Berger M (2020). “Discrete-time survival forests with Hellinger distance decision trees.” Data Mining and Knowledge Discovery, 34(0), 812-832.

Spuck N, Schmid M, Heim N, Klarmann-Schulz U, Hoerauf A, Berger M (2023). “Flexible tree-structured regression models for discrete event times.” Statistics and Computing, 33(20), 1-21.

Thompson Jr. WA (1977). “On the Treatment of Grouped Observations in Life Studies.” Biometrics, 33(3), 463-470.

See Also

contToDisc, dataLongTimeDep, dataLongCompRisks

Examples


# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]
head(subUnempDur)

# Convert to long format
UnempLong <- dataLong (dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong, 20)

# Is there exactly one observed event of y for each person?
splitUnempLong <- split(UnempLong, UnempLong$obj)
all(sapply(splitUnempLong, function (x) sum(x$y))==subUnempDur$censor1) # TRUE

# Second example: Acute Myelogenous Leukemia survival data
library(survival)
head(leukemia)
leukLong <- dataLong(dataShort = leukemia, timeColumn = "time", 
eventColumn = "status", timeAsFactor=TRUE)
head(leukLong, 30)

# Estimate discrete survival model
estGlm <- glm(formula = y ~ timeInt + x, data=leukLong, family = binomial())
summary(estGlm)

# Estimate survival curves for non-maintained chemotherapy
newDataNonMaintained <- data.frame(timeInt = factor(1:161), x = rep("Nonmaintained"))
predHazNonMain <- predict(estGlm, newdata = newDataNonMaintained, type = "response")
predSurvNonMain <- cumprod(1-predHazNonMain)

# Estimate survival curves for maintained chemotherapy
newDataMaintained <- data.frame(timeInt = factor(1:161), x = rep("Maintained"))
predHazMain <- predict(estGlm, newdata = newDataMaintained, type = "response")
predSurvMain <- cumprod(1-predHazMain)

# Compare survival curves
plot(x = 1:50, y = predSurvMain [1:50], xlab = "Time", ylab = "S(t)", las = 1, 
type = "l", main = "Effect of maintained chemotherapy on survival of leukemia patients")
lines(x = 1:161, y = predSurvNonMain, col = "red")
legend("topright", legend = c("Maintained chemotherapy", "Non-maintained chemotherapy"), 
col = c("black", "red"), lty = rep(1, 2))
# The maintained therapy has clearly a positive effect on survival over the time range

##############################################
# Simulation
# Single event in case of right-censoring

# Simulate multivariate normal distribution
library(discSurv)
library(mvnfast)
set.seed(-1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 10), sigma = diag(10))


# Specification of discrete hazards with 11 theoretical intervals
betaCoef <- seq(-1, 1, length.out = 11)[-6]
timeInt <- seq(-1, 1, length.out = 10)
linPred <- c(X %*% betaCoef)
hazTimeX <- cbind(sapply(1:length(timeInt), 
                        function(x) exp(linPred+timeInt[x]) / (1+exp(linPred+timeInt[x])) ), 1)


# Simulate discrete survival and censoring times in 10 observed intervals
discT <- rep(NA, dim(hazTimeX)[1])
discC <- rep(NA, dim(hazTimeX)[1])
for( i in 1:dim(hazTimeX)[1] ){
 
 discT[i] <- sample(1:11, size = 1, prob = estMargProb(haz=hazTimeX[i, ]))
 discC[i] <- sample(1:11, size = 1, prob = c(rep(1/11, 11)))
}


# Calculate observed times, event indicator and specify short data format
eventInd <- discT <= discC
obsT <- ifelse(eventInd, discT, discC)
eventInd[obsT == 11] <- 0
obsT[obsT == 11] <- 10
simDatShort <- data.frame(obsT = obsT, event = as.numeric(eventInd), X)


# Convert data to discrete data long format
simDatLong <- dataLong(dataShort = simDatShort, timeColumn = "obsT", eventColumn = "event",
timeAsFactor=TRUE)


# Estimate discrete-time continuation ratio model
formSpec <- as.formula(paste("y ~ timeInt + ", 
                            paste(paste("X", 1:10, sep=""), collapse = " + "), sep = ""))
modelFit <- glm(formula = formSpec, data = simDatLong, family = binomial(link = "logit"))
summary(modelFit)


# Compare estimated to true coefficients
coefModel <- coef(modelFit)
MSE_covariates <- mean((coefModel[11:20]-timeInt)^2)
MSE_covariates
# -> Estimated coefficients are near true coefficients


Data Long Transformation For Competing Risks

Description

Transforms short data format to long format for discrete survival modelling in the case of competing risks with right censoring. It is assumed that the covariates are not time varying.

Usage

dataLongCompRisks(
  dataShort,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  aggTimeFormat = FALSE,
  lastTheoInt = NULL,
  responseAsFactor = FALSE
)

Arguments

dataShort

Original data in short format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character giving the column name of the observed times (class "character"). It is required that the observed times are discrete (class "integer").

eventColumns

Column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the discrete time variable are treated as quantitative.

aggTimeFormat

Instead of the usual long format, should every observation have all time intervals (class "logical")? Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index.

lastTheoInt

Gives the number of the last theoretic interval (class "integer"). Only used, if aggTimeFormat is set to TRUE.

responseAsFactor

Should the response columns be given as factor (class "logical")? Default is FALSE.

Details

It is assumed, that only one event happens at a specific time point (competing risks). Either the observation is censored or one of the possible events takes place.

In contrast to continuous survival (see e. g. Surv) the start and stop time notation is not used here. In discrete time survival analysis the only relevant information is to use the stop time. Start time does not matter, because all discrete intervals need to be included in the long data set format to ensure consistent estimation. It is assumed that the supplied data set dataShort contains all repeated measurements of each cluster (e. g. persons). For further information see example Start-stop notation.

Value

Original data set in long format with additional columns

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Berger M, Kowark A, Rossaint R, Coburn M, Schmid M, POSE-study (2023). “Modeling Postoperative Mortality in Older Patients by Boosting Discrete-Time Competing Risks Models.” Journal of the American Statistical Association, 0(0), 1-11.

Berger M, Welchowski T, Schmitz-Valckenberg S, Schmid M (2019). “A classification tree approach for the modeling of competing risks in discrete time.” Advances in Data Analysis and Classification, 13(0), 965-990.

Narendranathan W, Stewart MB (1993). “Modelling the Probability of Leaving Unemployment: Competing Risks Models with Flexible Base-Line Hazards.” Journal of the Royal Statistical Society Series C, 42(1), 63-83.

Reinke C, Doblhammer G, Schmid M, Welchowski T (2023). “Dementia risk predictions from German claims data using methods of machine learning.” Alzheimers & Dementia, 19(2), 477-486.

Schmid M, Berger M (2020). “Competing risks analysis for discrete time-to-event data.” Computational Statistics, 13(5), 1-17.

Schmid M, Kuechenhoff H, Hoerauf A, Tutz G (2016). “A survival tree method for the analysis of discrete event times in clinical and epidemiological studies.” Statistics in Medicine, 35(5), 734-751.

Steele F, Goldstein H, Browne W (2004). “A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics.” Statistical Modelling, 4(2), 145-159.

See Also

contToDisc, dataLongTimeDep, dataLongCompRisksTimeDep

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Convert competing risk data to long format
SubUnempDurLong <- dataLongCompRisks (dataShort = SubUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor2", "censor3", "censor4"))
head(SubUnempDurLong, 20)

# Fit multinomial logit model with VGAM package
# with one coefficient per response
library(VGAM)
multLogitVGM <- vgam(cbind(e0, e1, e2, e3, e4) ~ timeInt + ui + age + logwage,
                    family = multinomial(refLevel = 1), 
                    data = SubUnempDurLong)
coef(multLogitVGM)

# Alternative: Use nnet
# Convert response to factor
rawResponseMat <- SubUnempDurLong[, c("e0", "e1", "e2", "e3", "e4")]
NewFactor <- factor(unname(apply(rawResponseMat, 1, function(x) which(x == 1))), 
                    labels = colnames(rawResponseMat))

# Include recoded response in data
SubUnempDurLong <- cbind(SubUnempDurLong, NewResp = NewFactor)

# Construct formula of mlogit model
mlogitFormula <- formula(NewResp ~ timeInt + ui + age + logwage)

# Fit multinomial logit model
# with one coefficient per response
library(nnet)
multLogitNNET <- multinom(formula = mlogitFormula, data = SubUnempDurLong)
coef(multLogitNNET)

###########################################################
# Simulation
# Cause specific competing risks in case of right-censoring
# Discrete subdistribution hazards model

# Simulate covariates as multivariate normal distribution
library(mvnfast)
set.seed(1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4))

# Specification of two discrete cause specific hazards with four intervals
# Event 1
theoInterval <- 4
betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3]
timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1)
linPred_event1 <- c(X %*% betaCoef_event1)
# Event 2
betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3]
timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1)
linPred_event2 <- c(X %*% betaCoef_event2)
# Discrete cause specific hazards in last theoretical interval
theoHaz_event1 <- 0.5
theoHaz_event2 <- 0.5

haz_event1_X <- cbind(sapply(1:length(timeInt_event1), 
               function(x) exp(linPred_event1 + timeInt_event1[x]) / 
               (1 + exp(linPred_event1 + timeInt_event1[x]) + 
               exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1)

haz_event2_X <- cbind(sapply(1:length(timeInt_event2), 
               function(x) exp(linPred_event2 + timeInt_event2[x]) / 
               (1 + exp(linPred_event1 + timeInt_event1[x]) + 
               exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2)
allCauseHaz_X <- haz_event1_X + haz_event2_X


pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) ))


pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X)


survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1) + 1), 
                                                  size = 1, prob = pT_X[i, ]) )
censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], 
               prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), 
               replace = TRUE)


obsT <- ifelse(survT <= censT, survT, censT)
obsEvent <- rep(0, length(obsT))
obsEvent <- sapply(1:length(obsT), 
                  function(i) if(survT[i] <= censT[i]){
                    return(sample(x = c(1, 2), size=1, 
                    prob = c(pR_T_X_event1[i, obsT[i]  ], 
                    1 - pR_T_X_event1[i, obsT[i]  ]) ))
                  } else{
                    
                    return(0)
                  }
)


# Recode last interval to censored
lastInterval <- obsT == theoInterval
obsT[lastInterval] <- theoInterval - 1
obsEvent[lastInterval] <- 0
obsT <- factor(obsT)
obsEvent <- factor(obsEvent)

datShort <- data.frame(event = factor(obsEvent), time = obsT, X)
datLong <- dataLongCompRisks(dataShort = datShort, timeColumn = "time", 
                            eventColumns = "event", responseAsFactor = TRUE, 
                            eventColumnsAsFactor = TRUE, timeAsFactor = TRUE)


# Estimate discrete cause specific hazard model
library(VGAM)
estModel <- vglm(formula=responses ~ timeInt + X1 + X2 + X3 + X4, data=datLong, 
                family = multinomial(refLevel = 1))


# Mean squared errors per event
coefModels <- coef(estModel)
mean((coefModels[seq(7, length(coefModels), 2)] - betaCoef_event1)^2) # Event 1
mean((coefModels[seq(8, length(coefModels), 2)] - betaCoef_event2)^2) # Event 2
# -> Estimated coefficients are near true coefficients for each event type


Data Long Time Dependent Covariates Transformation For Competing Risks

Description

Transforms short data format to long format for discrete survival modelling in the case of competing risks with right censoring. Covariates may vary over time.

Usage

dataLongCompRisksTimeDep(
  dataSemiLong,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  idColumn,
  timeAsFactor = FALSE,
  responseAsFactor = FALSE
)

Arguments

dataSemiLong

Original data in semi-long format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character giving the column name of the observed times (class "logical"). It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

idColumn

Name of column of identification number of persons as character (class "character").

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the discrete time intervals are treated as quantitative (class "numeric").

responseAsFactor

Should the response columns be given as factor (class "logical")? Default is FALSE.

Details

There may be some intervals, where no additional information on the covariates is observed (e. g. observed values in interval one and three but two is missing). In this case it is assumed, that the values from the last observation stay constant over time until a new measurement was done.

In contrast to continuous survival (see e. g. Surv) the start and stop time notation is not used here. In discrete time survival analysis the only relevant information is to use the stop time. Start time does not matter, because all discrete intervals need to be included in the long data set format to ensure consistent estimation. It is assumed that the supplied data set dataSemiLong contains all repeated measurements of each cluster in semi-long format (e. g. persons). For further information see example Start-stop notation.

Value

Original data set in long format with additional columns

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Fahrmeir L (2005). “Discrete Survival-Time Models.” In Encyclopedia of Biostatistics, chapter Survival Analysis. John Wiley & Sons.

Thompson Jr. WA (1977). “On the Treatment of Grouped Observations in Life Studies.” Biometrics, 33(3), 463-470.

See Also

contToDisc, dataLong, dataLongCompRisks

Examples


# Example Primary Biliary Cirrhosis data
library(survival)
pbcseq_example <- pbcseq

# Convert to months
pbcseq_example$day <- ceiling(pbcseq_example$day/30) + 1
names(pbcseq_example)[7] <- "month"
pbcseq_example$status <- factor(pbcseq_example$status)

# Convert to long format for time varying effects
pbcseq_exampleLong <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_example, timeColumn = "month", 
eventColumns = "status", eventColumnsAsFactor = TRUE, idColumn = "id", 
timeAsFactor = TRUE)
head(pbcseq_exampleLong)

#####################
# Start-stop notation

library(survival)
?pbcseq

# Choose subset of patients
subsetID <- unique(pbcseq$id)[1:100]
pbcseq_mod <- pbcseq[pbcseq$id %in% subsetID, ]

# Convert to start stop notation
pbcseq_mod_split <- split(pbcseq_mod, pbcseq_mod$id)
pbcseq_mod_split <- lapply(1:length(pbcseq_mod_split), function(x) {

 cbind(pbcseq_mod_split[[x]], 
 start_time=c(0, pbcseq_mod_split[[x]][ - dim(pbcseq_mod_split[[x]])[1], "day"]),
 stop_time=pbcseq_mod_split[[x]][, "day"])
 
})
pbcseq_mod <- do.call(rbind, pbcseq_mod_split)

# Convert stop time to months
intervalDef <- c(quantile(pbcseq_mod$stop_time, probs = seq(0.1, 0.9, by=0.1)), Inf)
names(pbcseq_mod)
pbcseq_mod <- contToDisc(dataShort = pbcseq_mod, timeColumn = "stop_time", 
                         intervalLimits = intervalDef, equi = FALSE)
pbcseq_mod$status <- factor(pbcseq_mod$status)

# Conversion to data long format
pbcseq_mod_long <- dataLongCompRisksTimeDep(dataSemiLong = pbcseq_mod, timeColumn = "timeDisc", 
                                           eventColumns = "status",
                                          idColumn = "id", 
                                           eventColumnsAsFactor = TRUE, 
                                          responseAsFactor = TRUE,
                                          timeAsFactor = TRUE)
head(pbcseq_mod_long)


Data Long Transformation For Multi Spell Analysis

Description

Transform data from short format into long format for discrete multi spell survival analysis and right censoring.

Usage

dataLongMultiSpell(
  dataSemiLong,
  timeColumn,
  eventColumn,
  idColumn,
  timeAsFactor = FALSE,
  spellAsFactor = FALSE
)

Arguments

dataSemiLong

Original data in semi-long format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "character").

eventColumn

Column name of the event status (class "character"). The events can take multiple values on a discrete scale (0, 1, 2, ...) and repetition of events is allowed (class "integer" or "factor"). It is assumed that the number zero corresponds to censoring and all number > 0 represent the observed states between transitions.

idColumn

Name of column of identification number of persons as character (class "character").

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the discrete time intervals are treated as quantitative (class "numeric").

spellAsFactor

Should the spells be coded as factor (class "logical")? Default is not to use factor. If the argument is false, the column is coded as numeric.

Details

If the data has continuous survival times, the response may be transformed to discrete intervals using function contToDisc. The discrete time variable needs to be strictly increasing for each person, because otherwise the order of the events is not distinguishable. Here is an example data structure in short format prior augmentation with three possible states: \ idColumn=1, 1, ... , 1, 2, 2, ... , n \ timeColumn= t_ID1_1 < t_ID1_1 < ... < t_ID1_k, t_ID2_1 < t_ID2_2 < ... < t_ID2_k, ... \ eventColumn = 0, 1, ... , 2, 1, 0, ... , 0

The starting state of each individual is assumed to given with time interval equals zero. For example in an illness-death model with three states ("healthy", "illness", "death") if an individual was healthy at the beginning of the study this has to be encoded with discrete time interval set to zero and event state "healthy".

Value

Original data.frame with three additional columns:

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Fahrmeir L (2005). “Discrete Survival-Time Models.” In Encyclopedia of Biostatistics, chapter Survival Analysis. John Wiley & Sons.

Thompson Jr. WA (1977). “On the Treatment of Grouped Observations in Life Studies.” Biometrics, 33(3), 463-470.

See Also

contToDisc, dataLongTimeDep, dataLongCompRisks, dataLongCompRisks

Examples


################################
# Example with unemployment data
data(unempMultiSpell)

# Select subsample of first 500 persons
unempSub <- unempMultiSpell[unempMultiSpell$id %in% 1:250,]

# Expansion from semi-long to long format
unempLong <- dataLongMultiSpell(dataSemiLong=unempSub, timeColumn = "year",
                                eventColumn="spell", idColumn="id", 
                                spellAsFactor=TRUE, timeAsFactor=FALSE)

head(unempLong, 25)

# Fit discrete multi-state model regression model
library(VGAM)

model <- vgam(cbind(e0, e1, e2, e3, e4) ~ 0 + s(timeInt) + age:spell, 
data = unempLong, family = multinomial(refLevel="e0"))
             
############################
# Example with artificial data

# Seed specification
set.seed(-2578)

# Construction of data set
# Censoring and three possible states (0, 1, 2, 3)
# Discrete time intervals (1, 2, ... , 10)
# Noninfluential variable x ~ N(0, 1)
datFrame <- data.frame(
 ID = c(rep(1, 6), rep(2, 4), rep(3, 3), rep(4, 2), rep(5, 4), 
      rep(6, 5), rep(7, 7), rep(8, 8)),
 time = c(c(0, 2, 5, 6, 8, 10), c(0, 1, 6, 7), c(0, 9, 10), c(0, 6), c(0, 2, 3, 4), 
        c(0, 3, 4, 7, 9), c(0, 2, 3, 5, 7, 8, 10), c(0, 1, 3, 4, 6, 7, 8, 9) ),
 state = c(c(2, 1, 3, 2, 1, 0), c(3, 1, 2, 2), c(2, 2, 1), c(1, 2), c(3, 2, 2, 0), 
         c(1, 3, 2, 1, 3), c(1, 1, 2, 3, 2, 1, 3), c(3, 2, 3, 2, 1, 1, 2, 3) ),
 x = rnorm(n=6+4+3+2+4+5+7+8) )

# Transformation to long format
datFrameLong <- dataLongMultiSpell(dataSemiLong=datFrame, timeColumn="time",
                                   eventColumn="state", idColumn="ID", 
                                   spellAsFactor=TRUE)
head(datFrameLong, 25)
library(VGAM)
cRm <- vglm(cbind(e0, e1, e2, e3) ~ 0 + timeInt + x:spell, 
data = datFrameLong, family = "multinomial")
summary(cRm)


Data Matrix And Weights For Discrete Subdistribution Hazard Models

Description

Generates the augmented data matrix and the weights required for discrete subdistribution hazard modeling with right censoring.

Usage

dataLongSubDist(
  dataShort,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  eventFocus,
  timeAsFactor = FALSE,
  aggTimeFormat = FALSE,
  lastTheoInt = NULL
)

Arguments

dataShort

Original data in short format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character specifying the column name of the observed event times (class "logical"). It is required that the observed times are discrete (class "integer").

eventColumns

Character vector specifying the column names of the event indicators (excluding censoring events) (class "logical"). It is required that a 0-1 coding is used for all events. The algorithm treats row sums of zero of all event columns as censored.

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

eventFocus

Column name of the event of interest or type 1 event (class "character").

timeAsFactor

Logical indicating whether time should be coded as a factor in the augmented data matrix (class "logical"). If FALSE, a numeric coding will be used.

aggTimeFormat

Instead of the usual long format, should every observation have all time intervals? (class "logical") Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index.

lastTheoInt

Gives the number of the last theoretic interval (class "integer"). Only used, if aggTimeFormat==TRUE.

Details

This function sets up the augmented data matrix and the weights that are needed for weighted maximum likelihood (ML) estimation of the discrete subdistribution model proposed by Berger et al. (2018). The model is a discrete-time extension of the original subdistribution model proposed by Fine and Gray (1999).

Value

Data frame with additional column "subDistWeights". The latter column contains the weights that are needed for fitting a weighted binary regression model, as described in Berger et al. (2018). The weights are calculated by a life table estimator for the censoring event.

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020). “Subdistribution Hazard Models for Competing Risks in Discrete Time.” Biostatistics, 21(3), 449-466.

Fine JP, Gray RJ (2012). “A Proportional Hazards Model for the Subdistribution of a Competing Risk.” Journal of the American Statistical Association, 94(446), 496-509.

See Also

dataLong

Examples


################################
# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Generate subsample, reduce number of intervals to k = 5
SubUnempDur <- UnempDur [1:500, ]
SubUnempDur$time <- as.numeric(cut(SubUnempDur$spell, c(0,4,8,16,28)))

# Convert competing risks data to long format
# The event of interest is re-employment at full job
SubUnempDurLong <- dataLongSubDist (dataShort=SubUnempDur, timeColumn = "time", 
eventColumns=c("censor1", "censor2", "censor3"), eventFocus="censor1")
head(SubUnempDurLong)

# Fit discrete subdistribution hazard model with logistic link function
logisticSubDistr <- glm(y ~ timeInt + ui + age + logwage,
                    family=stats::binomial, data = SubUnempDurLong, 
                    weights = SubUnempDurLong$subDistWeights)
summary(logisticSubDistr)

########################################
# Simulation 
# Discrete subdistribution hazards model

# Simulate covariates as multivariate normal distribution
library(mvnfast)
set.seed(1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4))

# Specification of two discrete cause specific hazards with four intervals
# Event 1
theoInterval <- 4
betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3]
timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1)
linPred_event1 <- c(X %*% betaCoef_event1)
# Event 2
betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3]
timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1)
linPred_event2 <- c(X %*% betaCoef_event2)
# Discrete cause specific hazards in last theoretical interval
theoHaz_event1 <- 0.5
theoHaz_event2 <- 0.5

# Derive discrete all cause hazard
haz_event1_X <- cbind(sapply(1:length(timeInt_event1), 
                            function(x) exp(linPred_event1 + timeInt_event1[x]) / 
                              (1 + exp(linPred_event1 + timeInt_event1[x]) + 
                              exp(linPred_event2 + timeInt_event2[x])) ), 
                     theoHaz_event1)

haz_event2_X <- cbind(sapply(1:length(timeInt_event2), 
                            function(x) exp(linPred_event2 + timeInt_event2[x]) / 
                              (1 + exp(linPred_event1 + timeInt_event1[x]) + 
                              exp(linPred_event2 + timeInt_event2[x]) ) ),
                     theoHaz_event2)
allCauseHaz_X <- haz_event1_X + haz_event2_X

# Derive discrete cumulative incidence function of event 1 given covariates
p_T_event1_X <- haz_event1_X * cbind(1, (1-allCauseHaz_X)[, -dim(allCauseHaz_X)[2]])
cumInc_event1_X <-  t(sapply(1:dim(p_T_event1_X)[1], function(x) cumsum(p_T_event1_X[x, ])))

# Calculate all cause probability P(T=t | X)
pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) ))

# Calculate event probability given time interval P(R=r | T=t, X)
pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X)

# Simulate discrete survival times
survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1)+1), 
                                                   size = 1, prob = pT_X[i, ]) )
censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], 
               prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), 
               replace = TRUE)

# Calculate observed times
obsT <- ifelse(survT <= censT, survT, censT)
obsEvent <- rep(0, length(obsT))
obsEvent <- sapply(1:length(obsT), 
                  function(i) if(survT[i] <= censT[i]){
                    return(sample(x = c(1, 2), size = 1, 
                    prob = c(pR_T_X_event1[i, obsT[i]  ], 
                    1 - pR_T_X_event1[i, obsT[i]  ]) ))
                  } else{
                    
                    return(0)
                  }
)

# Recode last interval to censored
lastInterval <- obsT == theoInterval
obsT[lastInterval] <- theoInterval-1
obsEvent[lastInterval] <- 0
obsT <- factor(obsT)
obsEvent <- factor(obsEvent)

# Data preparation
datShort <- data.frame(event = factor(obsEvent), time=obsT, X)

# Conversion to long data format
datLongSub <- dataLongSubDist(dataShort = datShort, timeColumn = "time",
                             eventColumns = "event", eventFocus = 1, eventColumnsAsFactor = TRUE)

# Estimate discrete subdistribution hazard model
estSubModel <- glm(formula = y ~ timeInt + X1 + X2 + X3 + X4, data = datLongSub,
                  family = binomial(link = "logit"), weights = datLongSub$subDistWeights)

# Predict cumulative incidence function of first event
predSubHaz1 <- predict(estSubModel, newdata = datLongSub[datLongSub$obj == 2, ], type = "response")
mean(((1 - estSurv(predSubHaz1)) - cumInc_event1_X[2, 1:3])^2)


Data Long Time Dependent Covariates

Description

Transforms short data format to long format for discrete survival modelling of single event analysis with right censoring. Covariates may vary over time.

Usage

dataLongTimeDep(
  dataSemiLong,
  timeColumn,
  eventColumn,
  idColumn,
  timeAsFactor = FALSE
)

Arguments

dataSemiLong

Original data in semi-long format (class "data.frame"). Descriptions of data formats are available in discSurv-package.

timeColumn

Character giving the column name of the observed times (class "character"). It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character").

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In case of default settings the discrete time intervals are treated as quantitative (class "numeric").

Details

There may be some intervals, where no additional information on the covariates is observed (e. g. observed values in interval one and three but two is missing). In this case it is assumed, that the values from the last observation stay constant over time until a new measurement was done.

In contrast to continuous survival (see e. g. Surv) the start and stop time notation is not used here. In discrete time survival analysis the only relevant information is to use the stop time. Start time does not matter, because all discrete intervals need to be included in the long data set format to ensure consistent estimation. It is assumed that the supplied data set "dataSemiLong" contains all repeated measurements of each cluster in semi-long format (e. g. persons). For further information see example Start-stop notation.

Value

Original data in long format with three additional columns:

Note

Arguments to this function have to be specified in the required formats. Other objects are not supported. For example a common mistake is the usage of tibble data formats, that are not of class "data.frame".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Fahrmeir L (2005). “Discrete Survival-Time Models.” In Encyclopedia of Biostatistics, chapter Survival Analysis. John Wiley & Sons.

Puth M, Tutz G, Heim N, Muenster E, Schmid M, Berger M (2020). “Tree-based modeling of time-varying coefficients in discrete time-to-event models.” Lifetime Data Analysis, 26(0), 545-572.

Thompson Jr. WA (1977). “On the Treatment of Grouped Observations in Life Studies.” Biometrics, 33(3), 463-470.

See Also

contToDisc, dataLong, dataLongCompRisks

Examples


# Example Primary Biliary Cirrhosis data
library(survival)
dataSet1 <- pbcseq

# Only event death is of interest
dataSet1$status [dataSet1$status == 1] <- 0
dataSet1$status [dataSet1$status == 2] <- 1
table(dataSet1$status)

# Convert to months
dataSet1$day <- ceiling(dataSet1$day/30) + 1
names(dataSet1) [7] <- "month"

# Convert to long format for time varying effects
pbcseqLong <- dataLongTimeDep (dataSemiLong = dataSet1, timeColumn = "month", 
eventColumn = "status", idColumn = "id")
pbcseqLong [pbcseqLong$obj == 1, ]

#####################
# Start-stop notation

library(survival)
?survival::heart

# Assume that time was measured on a discrete scale.
# Discrete interval lengths are assumed to vary.
intervalLimits <- quantile(heart$stop, probs = seq(0.1, 1, by=0.1))
intervalLimits[length(intervalLimits)] <- intervalLimits[length(intervalLimits)] + 1
heart_disc <- contToDisc(dataShort = heart, timeColumn = "stop", 
intervalLimits = intervalLimits, equi = FALSE)
table(heart_disc$timeDisc)

# Conversion to long format
heart_disc_long <- dataLongTimeDep(dataSemiLong = heart_disc, timeColumn = "timeDisc", 
eventColumn = "event", idColumn = "id")
head(heart_disc_long)


Deviance Residuals

Description

Computes the root of the deviance residuals for evaluation of performance in discrete survival analysis.

Usage

devResid(dataLong, hazards)

Arguments

dataLong

Original data in long format (class "data.frame"). The correct format can be specified with data preparation, see e. g. dataLong.

hazards

Estimated discrete hazards of the data in long format (class "numeric"). Discrete discrete hazards are probabilities and therefore restricted to the interval [0, 1].

Value

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G (2012). Regression for Categorical Data. Cambridge University Press.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

adjDevResid, predErrCurve

Examples


library(survival)

# Transform data to long format
heart[, "stop"] <- ceiling(heart[, "stop"])
set.seed(0)
Indizes <- sample(unique(heart$id), 25)
randSample <- heart[unlist(sapply(1:length(Indizes), 
function(x) which(heart$id == Indizes[x]))),]
heartLong <- dataLongTimeDep(dataSemiLong = randSample, 
timeColumn = "stop", eventColumn = "event", idColumn = "id", timeAsFactor = FALSE)

# Fit a generalized, additive model and predict discrete hazards on data in long format
library(mgcv)
gamFit <- gam(y ~ timeInt + surgery + transplant + s(age), data = heartLong, family = "binomial")
hazPreds <- predict(gamFit, type = "response")

# Calculate the deviance residuals
devResiduals <- devResid (dataLong = heartLong, hazards = hazPreds)$Output$DevResid

# Compare with estimated normal distribution
plot(density(devResiduals), 
main = "Empirical density vs estimated normal distribution", 
las = 1, ylim = c(0, 0.5))
tempFunc <- function (x) dnorm(x, mean = mean(devResiduals), sd = sd(devResiduals))
curve(tempFunc, xlim = c(-10, 10), add = TRUE, col = "red")
# The empirical density seems like a mixture distribution, 
# but is not too far off in with values greater than 3 and less than 1


Cumulative Hazard Function

Description

Estimates the cumulative hazard function \Gamma(T = t|x) based on estimated hazard rates. The hazard rates may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given hazard rates should only vary over time.

Usage

estCumHaz(hazards)

## S3 method for class 'discSurvEstCumHaz'
plot(x, ...)

Arguments

hazards

Estimated hazard rates (class "numeric")

x

Estimated cumulative hazard function \Lambda(t|x) (class "discSurvEstCumHaz").

...

Further arguments passed to plot.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated probabilities of survival (class "numeric")

Note

It is assumed that all time points up to the last theoretical interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estCumHaz

Examples


# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur[1:100, ]

# Convert to long format
UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")

# Estimate binomial model with logit link
Fit <- glm(formula = y ~ timeInt + age + logwage, data=UnempLong, family = binomial())

# Estimate discrete survival function given age, logwage of first person
hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response")
cumHazCondX <- estCumHaz(c(hazard, 1))
cumHazCondX

# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong)

# Estimate binomial model with logit link
Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial())

# Estimate discrete survival function given age, logwage of first person
Tmax   <- max(subUnempDur$spell)
UnempEval <- dataLong(dataShort = UnempDur[1,], timeColumn = "spell", eventColumn = "censor1", 
aggTimeFormat = TRUE, lastTheoInt = Tmax)
hazard <- predict(Fit, newdata = UnempEval, type = "response")
estCumHaz1 <- estCumHaz(hazard)
plot(estCumHaz1)



Cumulative Hazard Function For Competing Risks

Description

Estimates the overall cumulative hazard function \Gamma(T = t|x) based on the overall hazard rate in competing risks. The hazard rates may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given hazard rates should only vary over time.

Usage

estCumHazCompRisks(hazards)

Arguments

hazards

Estimated discrete hazards (numeric class c("matrix", "array")). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated cumulative all cause hazard (class "numeric")

Note

It is assumed that all time points up to the last theoretical interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Examples



# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur[1:100, ]

# Estimate binomial model with logit link
estModel <- estRegSmoothCompRisks(dataShort = subUnempDur, 
dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")

# Estimate discrete survival function given age, logwage of first person
predFrame <- attr(estModel, "augData")
hazard <- predict(estModel, newdata=predFrame[predFrame$obj==1, ], type = "response")
cumHazCondX <- estCumHazCompRisks(cbind(hazard, 1))
cumHazCondX


Cumulative Incidence Function For Competing Risks

Description

Estimates the cumulative incidence function of a discrete time competing risks model given covariates P(T <= t, event = k | x).

Usage

estCumInc(hazards, eventFocus = 1, obj = NULL)

## S3 method for class 'discSurvEstCumInc'
plot(x, ...)

Arguments

hazards

Estimated discrete hazard rates of all events (class "matrix"). Each column represents one event. The first column is assumed to contain the censoring case and the discrete hazards should only vary over time in each row.

eventFocus

Column that represent the discrete hazards of the primary event (class "integer").

obj

Integer identification number of each individual. Usually this information is computed during data augmentation (class "numeric").

x

estimated cumulative incidence function P(T <= t, event=k | x) based on estimated hazards of a discrete competing risks model or a discrete subdistribution hazard model (class "discSurvEstCumInc").

...

Further arguments passed to plot.

Details

The covariates set is required to be constant across rows. If argument obj is given, then for each unique individual with specific covariates a cumulative incidence function is computed.

Value

Returns cumulative incidence function of the primary event. If argument obj is not empty a list of vectors is returned.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74(4), 1468-1481.

See Also

estCompRisksGEE, dataLongCompRisks, dataLongCompRisksTimeDep, geeglm

estSurv, estCumInc, estCompRisksGEE

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Estimate GEE models for all events
estGEE <- estCompRisksGEE(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")

# Estimate hazards of all events given the covariates of third person
SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") 
preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3))

# Estimate cumulative incidence function
cumIncGEE <- estCumInc(preds, eventFocus = 2)
plot(cumIncGEE)


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

################################
# Competing risks model 

# Estimate GEE models for all events
estGEE <- estCompRisksGEE(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")

# Estimate hazards of all events given the covariates of third person
SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell") 
preds <- predict(estGEE, subset(SubUnempDurLong, obj == 3))
estCumInc1 <- estCumInc(preds, eventFocus = 3)
plot(estCumInc1)

###############################
# Subdistribution hazards model

# Convert to long format
SubUnempDurLong <- dataLongSubDist(dataShort = SubUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), eventFocus = "censor1")

# Estimate continuation ratio model with logit link
glmFit <- glm(formula = y ~ timeInt + age + ui + logwage * ui, data = SubUnempDurLong, 
family = binomial(), weights = SubUnempDurLong$subDistWeights)

# Estimated subdistribution hazard given the covariates of the third person
preds <- predict(glmFit, type = "response", newdata = subset(SubUnempDurLong, obj == 3))
cumInc1 <- estCumInc(preds)
plot(cumInc1)


Discrete Survival Random Forest Fitting

Description

Wrapper for estimation of discrete survival random forest. Several preprocessing options such as time-dependent covariates are available.

Usage

estForest(
  dataShort,
  dataTransform = "dataLong",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumn,
  idColumn = NULL,
  timeAsFactor = FALSE,
  storeAugData = TRUE,
  responseAsFactor = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLong") and with time dependent covariates ("dataLongTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

responseAsFactor

Should the response be converted to factor? Default is TRUE. Representation as factor is technically important for classification split rules such as Gini index. For details see ranger.

...

Specification of additional arguments in function ranger.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example, if the two variables timeInt and X1 should be included, the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "ranger".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Schmid M, Welchowski T, Wright MN, Berger M (2020). “Discrete-time survival forests with Hellinger distance decision trees.” Data Mining and Knowledge Discovery, 34(0), 812-832.

See Also

dataLong, dataLongTimeDep, ranger

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate discrete survival random forest
estRFModel <- estForest(dataShort = SubUnempDur, dataTransform = "dataLong", 
formulaVariable =~ timeInt + age + ui + logwage + ui, 
eventColumn = "censor1", timeColumn = "spell")
estRFModel


Discrete Survival Random Forest Fitting For Competing Risks

Description

Wrapper for estimation of discrete survival random forest for cause-specific competing risk models. Several preprocessing options such as time-dependent covariates are available.

Usage

estForestCompRisks(
  dataShort,
  dataTransform = "dataLongCompRisks",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  idColumn = NULL,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function vgam.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "ranger".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

dataLongCompRisks, dataLongCompRisksTimeDep, rpart

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate GEE models for all events
estModel <- estRegSmoothCompRisks(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
estModel


Marginal Probabilities

Description

Estimates the marginal probability P(T=t|x) based on estimated discrete hazards. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given discrete hazards should only vary over time.

Usage

estMargProb(hazards)

## S3 method for class 'discSurvEstMargProb'
plot(x, ...)

Arguments

hazards

Estimated discrete hazards (class "numeric")

x

Object of class "discSurvEstMargProb"

...

Specification of additional arguments in function plot.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated marginal probabilities (class "numeric")

Note

It is assumed that all time points up to the last interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estSurv

Examples


# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong)

# Estimate binomial model with logit link
Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial())

# Estimate discrete survival function given age, logwage of first person
hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response")

# Estimate marginal probabilities given age, logwage of first person
MarginalProbCondX <- estMargProb (c(hazard, 1))
MarginalProbCondX
sum(MarginalProbCondX)==1 # TRUE: Marginal probabilities must sum to 1!


Marginal Probabilities For Competing Risks

Description

Estimates the marginal probability P(T = t, R = r|x) based on estimated discrete hazards of a competing risks model. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated discrete hazards. Therefore the given discrete hazards should only vary over time.

Usage

estMargProbCompRisks(hazards)

Arguments

hazards

Estimated discrete hazards ("numeric matrix"). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated marginal probabilities ("numeric matrix")

Note

It is assumed that all time points up to the last interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually. In competing risk settings the marginal probabilities of the last theoretical interval depend on the assumptions on the discrete hazards in the last theoretical interval. However the estimation of all previous discrete intervals is not affected by those assumptions.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estMargProb

Examples


# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"))
head(UnempLong)

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong,
family = VGAM::multinomial(refLevel = "e0"))

# Estimate discrete survival function given age, logwage of first person
hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1]

# Estimate marginal probabilities given age, logwage of first person
# Example 1
# Assumption: Discrete hazards in last theoretical interval are equal for both event types
MarginalProbCondX <- estMargProbCompRisks(rbind(hazards, 0.5))
MarginalProbCondX
all.equal(sum(MarginalProbCondX), 1) # TRUE: Marginal probabilities must sum to 1!

# Example 2
# Assumption: Discrete hazards in last theoretical interval are event1=, event2=
MarginalProbCondX2 <- estMargProbCompRisks(rbind(hazards, c(0.75, 0.25)))
MarginalProbCondX2
all.equal(sum(MarginalProbCondX2), 1) # TRUE: Marginal probabilities must sum to 1!

# Compare marginal probabilities given X
all.equal(MarginalProbCondX[1:5, ], MarginalProbCondX2[1:5, ])
all.equal(MarginalProbCondX[6, ], MarginalProbCondX2[6, ])


Fit Discrete Survival Measures Based On Hazards

Description

Wrapper to estimate survival functions, cumulative hazards and marginal probabilities of all individuals of a given data set.

Usage

estMeasures(hazards, obj)

Arguments

hazards

Estimated hazards of the event (class "numeric").

obj

Integer identification number of each individual. Usually this information is computed during data augmentation (class "numeric").

Value

List of estimated measures (class "list"). There are three list elements with following contents:

Note

It is assumed that the data set was preprocessed with functions such as dataLong or dataLongTimeDep.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estSurv, estCumHaz, estMargProb

Examples


# Load unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur[1:100, ]

#########################
# Estimate hazard rates

estRegModel <- estReg(dataShort = subUnempDur, dataTransform = "dataLong", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumn = "censor1", timeColumn = "spell")
estHaz <- predict(estRegModel, type="response")

######################
# Single event example

# Estimate all survival functions of a given data set
measures1 <- estMeasures(hazards=estHaz, 
obj=attr(estRegModel, "augData")$obj)

# Survival function of first individual
measures1$surv[[1]] 

# Cumulative hazard of first individual
measures1$cumHaz[[1]]

# Marginal probabilities of first individual
measures1$margProb[[1]]


Fit Discrete Survival Measures Based On Cause-Specific Hazards

Description

Wrapper to estimate competing risks all cause survival functions, cumulative hazards and marginal probabilities of all individuals of a given data set.

Usage

estMeasuresCompRisks(hazards, obj)

Arguments

hazards

Estimated discrete hazards of the events (numeric class c("matrix", "array")). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events.

obj

Integer identification number of each individual. Usually this information is computed during data augmentation (class "numeric").

Value

List of estimated measures (class "list"). There are three list elements with following contents:

Note

It is assumed that the data set was preprocessed with functions such as dataLong or dataLongTimeDep.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estSurv, estCumHaz, estMargProb

Examples


# Load unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur[1:100, ]

######################################
# Estimate cause-specific hazard rates

estModel <- estRegSmoothCompRisks(dataShort = subUnempDur, dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")

# Estimate cause-specific hazards (without censoring "e0")
estHaz <- predict(estModel, type="response")[, -1]

######################
# Single event example

# Estimate all survival functions of a given data set
measures1CompRisks <- estMeasuresCompRisks(hazards=estHaz, 
obj=attr(estModel, "augData")$obj)

# All cause survival function of first individual
measures1CompRisks$surv_a[[1]] 

# All cause cumulative hazard of first individual
measures1CompRisks$cumHaz_a[[1]]

# All cause marginal probabilities of first individual
measures1CompRisks$margProb_a[[1]]


Logistic Recalibration Based On Linear Predictors

Description

Fits a logistic recalibration model to independent test data. It updates the intercept and slope parameters. It is assumed that the factor levels of time are equal in both training and validation data. Time dependent covariates, discrete cause specific competing risks and subdistribution hazards are also supported.

Usage

estRecal(testLinPred, testDataLong, weights = NULL)

Arguments

testLinPred

Calculated linear predictor on the validation data with model fitted on training data (class "numeric").

testDataLong

Validation data set in long format (class "data.frame").

weights

Weights used in estimation of the logistic recalibration model (class "numeric"). Default is no weighting (NULL).

Details

Three statistical tests for recalibration are computed. The coefficient alpha is the intercept and beta is the slope parameter of the recalibration model.

In case of competing risks the test assume an extendend null hypothesis that includes all events for each test type.

Value

Continuation ratio model that calibrates estimated discrete hazards to new validation environment (class c("glm", "lm")).

Note

Updates estimated hazards of discrete survival models to better adapt to a different environment. If there are substantial environment changes the predicted probabilities will differ between two environments. Logistic recalibration may be used to improve the calibration of predicted probabilities by incorperating information from the existing model and data from the environment. This approach works for any survival prediction model with one event that provides linear predictors.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Heyard R, Timsit J, Held L, consortium C (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62(3), 643-657.

Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ (1993). “Validation of Probabilistic Predictions.” Medical Decision Making, 13(1), 49-58.

See Also

dataLong, dataLongTimeDep, dataLongCompRisks, dataLongCompRisksTimeDep, dataLongSubDist, calPlot

Examples


####################
# Data preprocessing

# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
selectInd1 <- 1:100
selectInd2 <- 101:200
trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ]
valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ]

####################
# One event

# Convert to long format
trainSet_long <- dataLong(dataShort = trainSet, timeColumn = "spell", eventColumn = "censor1")
valSet_long <- dataLong(dataShort = valSet, timeColumn = "spell", eventColumn = "censor1")

# Estimate continuation ratio model with logit link
glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, family = binomial())

# Calculate linear predictors on validation set
linPred <- predict(glmFit, newdata = valSet_long, type = "link")

# Estimate logistic recalibration model
recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long)
summary(recalModel)

# Calibration plots
hazOrg <- predict(glmFit, newdata = valSet_long, type = "response")
hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response")

# Before logistic recalibration
calPlot(hazOrg, testDataLong = valSet_long)
# After logistic recalibration
calPlot(hazRecal, testDataLong = valSet_long)

############################
# Two cause specific hazards 
library(VGAM)

# Convert to long format
trainSet_long <- dataLongCompRisks(dataShort = trainSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"))
valSet_long <- dataLongCompRisks(dataShort = valSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"))

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = trainSet_long, 
family = VGAM::multinomial(refLevel = "e0"))

# Calculate linear predictors on training and test set
linPred <- VGAM::predictvglm(vglmFit, newdata = valSet_long, type = "link")

# Estimate logistic recalibration model
recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long)
recalModel

# Calibration plots
hazOrg <- predict(vglmFit, newdata = valSet_long, type = "response")
predDat <- as.data.frame(linPred)
names(predDat) <- recalModel@misc$colnames.x[-1]
hazRecal <- predictvglm(recalModel, newdata = predDat, type = "response")

# Before logistic recalibration
calPlot(hazOrg, testDataLong = valSet_long, event = "e1")
calPlot(hazOrg, testDataLong = valSet_long, event = "e2")
# After logistic recalibration
calPlot(hazRecal, testDataLong = valSet_long, event = "e1")
calPlot(hazRecal, testDataLong = valSet_long, event = "e2")

###############################
# Subdistribution hazards model

# Convert to long format
trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), eventFocus = "censor1")
valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), eventFocus = "censor1")

# Estimate continuation ratio model with logit link
glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, 
family = binomial(), weights = trainSet_long$subDistWeights)

# Calculate linear predictors on training and test set
linPred <- predict(glmFit, newdata = valSet_long, type = "link")

# Estimate logistic recalibration model
recalModel <- estRecal(testLinPred = linPred, testDataLong = valSet_long, 
weights = valSet_long$subDistWeights)
recalModel

# Calibration plots
hazOrg <- predict(glmFit, newdata = valSet_long, type = "response",
weights = valSet_long$subDistWeights)
hazRecal <- predict(recalModel, newdata = data.frame(linPred), type = "response",
weights = valSet_long$subDistWeights)

# Before logistic recalibration
calPlot(hazOrg, testDataLong = valSet_long,
weights=valSet_long$subDistWeights)
# After logistic recalibration
calPlot(hazRecal, testDataLong = valSet_long,
weights=valSet_long$subDistWeights)


Discrete Survival Basic Regression Model Fitting

Description

Wrapper for estimation of discrete survival general linear model. Several preprocessing options such as time-dependent covariates are available.

Usage

estReg(
  dataShort,
  dataTransform = "dataLong",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumn,
  idColumn = NULL,
  timeAsFactor = TRUE,
  family = stats::binomial,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLong") and with time dependent covariates ("dataLongTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

family

Specifies the assumption about the response distribution and the link function. Default value is the discrete survival continuation ratio model with logit-link (for comparison see family).

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function glm.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class c("glm", "lm").

Note

Model is rescaled to exclude intercepts.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

dataLong, dataLongTimeDep, glm

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Estimate discrete survival continuation ratio model
estRegModel <- estReg(dataShort = SubUnempDur, dataTransform = "dataLong", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumn = "censor1", timeColumn = "spell")
summary(estRegModel)

# Predictions
SubUnempDurLong <- dataLong(dataShort = SubUnempDur, 
eventColumn = "censor1", timeColumn = "spell", timeAsFactor=TRUE)
preds <- predict(estRegModel, newdata = SubUnempDurLong)
head(preds)


Discrete Survival Model Fitting For Competing Risks

Description

Wrapper for estimation of parametric discrete survival models for cause-specific competing risk models. Several preprocessing options such as time-dependent covariates are available and linear as well as smooth effects can be specified.

Usage

estRegCompRisks(
  dataShort,
  dataTransform = "dataLongCompRisks",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  idColumn = NULL,
  family = VGAM::multinomial,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

family

Specifies the assumption about the response distribution and the link function. Default value is the discrete survival cause-specific competing risks model with multinomial logistic function (function multinomial).

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function vgam.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "vgam".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Schmid M, Berger M (2020). “Competing risks analysis for discrete time-to-event data.” Computational Statistics, 13(5), 1-17.

See Also

dataLongCompRisks, dataLongCompRisksTimeDep, vgam

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate cause-specific hazard rates
estModel <- estRegCompRisks(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
estModel


Discrete Survival Frailty Regression Model Fitting

Description

Wrapper for estimation of discrete survival frailty regression model. Smooth functions of covariates are supported. Several preprocessing options, such as time-dependent covariates, are available.

Usage

estRegFrailty(
  dataShort,
  dataTransform = "dataLong",
  formulaVariable = ~timeInt + (1 | obj),
  timeColumn,
  eventColumn,
  idColumn = NULL,
  timeAsFactor = FALSE,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLong") and with time dependent covariates ("dataLongTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function ranger.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example, if the two variables timeInt and X1 should be included, the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "merMod".

Note

Model is rescaled to exclude the intercept.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

dataLong, dataLongTimeDep, glmer, estRegSmooth

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate discrete survival frailty regression model
estFrailtyModel <- estRegFrailty(dataShort = SubUnempDur, 
dataTransform = "dataLong", formulaVariable =~timeInt + (1 | obj), 
eventColumn = "censor1", timeColumn = "spell")
estFrailtyModel


Discrete Survival Additive Regression Model Fitting

Description

Wrapper for estimation of discrete survival generalized additive models. Several preprocessing options such as time-dependent covariates are available.

Usage

estRegSmooth(
  dataShort,
  dataTransform = "dataLong",
  formulaVariable = ~s(timeInt),
  timeColumn,
  eventColumn,
  idColumn = NULL,
  timeAsFactor = FALSE,
  family = stats::binomial,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLong") and with time dependent covariates ("dataLongTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

family

Specifies the assumption about the response distribution and the link function. Default value is the discrete survival continuation ratio model with logit-link (for comparison see family).

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function gam.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included as smooth functions the formula would be "~ s(timeInt) + s(X1)". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "gam".

Note

Splines can also be used to include heterogeneity of mixed models with penalized fixed effects. An advantage of this approach is that one does not have to assume a specific distribution for the random effects.

Model is rescaled to exclude the intercept.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Berger M, Schmid M (2018). “Semiparametric regression for discrete time-to-event data.” Statistical Modelling, 18(3), 322-345.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

dataLong, dataLongTimeDep, gam, estRegFrailty

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate discrete survival continuation ratio model
estRegSmoothModel <- estRegSmooth(dataShort = SubUnempDur, dataTransform = "dataLong", 
formulaVariable =~ s(timeInt) + s(age) + ui + logwage * ui, 
eventColumn = "censor1", timeColumn = "spell")
summary(estRegSmoothModel)

# Estimated smooth functions
plot(estRegSmoothModel)


Discrete Survival Additive Model Fitting For Competing Risks

Description

Wrapper for estimation of discrete survival generalized, additive models for cause-specific competing risk models. Several preprocessing options such as time-dependent covariates are available and linear as well as smooth effects can be specified.

Usage

estRegSmoothCompRisks(
  dataShort,
  dataTransform = "dataLongCompRisks",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  idColumn = NULL,
  family = VGAM::multinomial,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

family

Specifies the assumption about the response distribution and the link function. Default value is the discrete survival cause-specific competing risks model with multinomial logistic function (function multinomial).

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function vgam.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "vgam".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Schmid M, Berger M (2020). “Competing risks analysis for discrete time-to-event data.” Computational Statistics, 13(5), 1-17.

See Also

dataLongCompRisks, dataLongCompRisksTimeDep, vgam

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate cause-specific hazard rates
estModel <- estRegSmoothCompRisks(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
estModel


Discrete Survival Subdistribution Fitting

Description

Wrapper for estimation of discrete survival regression subdistribution hazard models that includes data preprocessing.

Usage

estRegSubDist(
  dataShort,
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  eventFocus,
  timeAsFactor = FALSE,
  family = stats::binomial,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

eventFocus

Column name of the event of interest or type 1 event (class "character").

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

family

Specifies the assumption about the response distribution and the link function. Default value is the discrete survival continuation ratio model with logit-link (for comparison see family).

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function glm.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class c("glm", "lm"). The attribute "subDistWeights" saves the subdistribution weights used in the estimation.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020). “Subdistribution Hazard Models for Competing Risks in Discrete Time.” Biostatistics, 21(3), 449-466.

See Also

dataLongCompRisks, estCompRisksGEE, estRegSmoothCompRisks, glm

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate GEE models for all events
estModel <- estRegSubDist(dataShort = SubUnempDur,
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell",
eventFocus="censor1")
summary(estModel)


Survival Function

Description

Estimates the survival function S(T = t|x) based on estimated hazard rates. The hazard rates may or may not depend on covariates. The covariates have to be equal across all estimated hazard rates. Therefore the given hazard rates should only vary over time.

Usage

estSurv(hazards)

## S3 method for class 'discSurvEstSurv'
plot(x, ...)

Arguments

hazards

Estimated hazard rates (class "numeric")

x

Estimated discrete survival function S(T>t|x) (class "discSurvEstSurv").

...

Further arguments passed to plot.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated probabilities of survival (class "numeric")

Note

It is assumed that all time points up to the last theoretical interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estMargProb

estSurv

Examples



# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong)

# Estimate binomial model with logit link
Fit <- glm(formula = y ~ timeInt + age + logwage, data=UnempLong, family = binomial())

# Estimate discrete survival function given age, logwage of first person
hazard <- predict(Fit, newdata = subset(UnempLong, obj == 1), type = "response")
SurvivalFuncCondX <- estSurv(c(hazard, 1))
plot(SurvivalFuncCondX)

# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLong(dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong)

# Estimate binomial model with logit link
Fit <- glm(formula = y ~ timeInt + age + logwage, data = UnempLong, family = binomial())

# Estimate discrete survival function given age, logwage of first person
Tmax   <- max(subUnempDur$spell)
UnempEval <- dataLong(dataShort = UnempDur[1,], timeColumn = "spell", eventColumn = "censor1", 
aggTimeFormat = TRUE, lastTheoInt = Tmax)
hazard <- predict(Fit, newdata = UnempEval, type = "response")
estSurv1 <- estSurv(hazard)
plot(estSurv1)



Survival Function Of Censoring Process

Description

Estimates the marginal survival function G(T=t) of the censoring process based on a life table estimator. Compatible with single event and competing risks data.

Usage

estSurvCens(dataShort, timeColumn, eventColumns, censInterval = "middle")

Arguments

dataShort

Data in original short format (class "data.frame").

timeColumn

Name of column with discrete time intervals (class "character").

eventColumns

Names of the event columns of dataShort (class "character"). In the competing risks case the event columns have to be in dummy encoding format (class "numeric").

censInterval

Assumption about when censoring takes places within an interval on the continuous time scale (class "character"). Possible values are "start", "middle", "end". Default is "middle".

Value

Named vector of estimated survival function of the censoring process for all time points except the last theoretical interval.

Note

In the censoring survival function the last time interval \left[ a_q, \infty \right) has the probability of zero.

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estSurv

Examples


# Load unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

######################
# Single event example

# Estimate censoring survival function G(t)
estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", 
eventColumns = "censor1")
estG

#########################
# Competing risks example

# Estimate censoring survival function G(t)
estG <- estSurvCens(dataShort = subUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor2", "censor3", "censor4"))
estG



Survival Function For Competing Risks

Description

Computes the overall survival function S(T>t|x) for all causes based on estimated hazards of a competing risks model. The discrete hazards may or may not depend on covariates. The covariates have to be equal across all estimated hazards. Therefore the given discrete hazards should only vary over time.

Usage

estSurvCompRisks(hazards)

Arguments

hazards

Estimated discrete hazards (numeric class c("matrix", "array")). Discrete hazards of each time interval are stored in the rows and the number of columns equal to the number of events.

Details

The argument hazards must be given for all intervals \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right), \left[ a_q, \infty \right).

Value

Estimated all cause survival probabilities (class "numeric")

Note

It is assumed that all time points up to the last interval \left[ a_q, \infty \right) are available. If not already present, these can be added manually.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

estSurv

Examples


# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
subUnempDur <- UnempDur [1:100, ]

# Convert to long format
UnempLong <- dataLongCompRisks(dataShort = subUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"))
head(UnempLong)

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula = cbind(e0, e1, e2) ~ timeInt + age + logwage, data = UnempLong,
family = VGAM::multinomial(refLevel = "e0"))

# Estimate discrete survival function given age, logwage of first person
hazards <- VGAM::predictvglm(vglmFit, newdata = subset(UnempLong, obj == 1), type = "response")[,-1]
SurvivalFuncCondX <- estSurvCompRisks(rbind(hazards, 0.5))
SurvivalFuncCondX



Discrete Survival Tree Fitting

Description

Wrapper for estimation of discrete survival classification and regression tree. Several preprocessing options such as time-dependent covariates are available.

Usage

estTree(
  dataShort,
  dataTransform = "dataLong",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumn,
  idColumn = NULL,
  timeAsFactor = FALSE,
  storeAugData = TRUE,
  responseAsFactor = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLong") and with time dependent covariates ("dataLongTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumn

Column name of the event indicator (class "character"). It is required that this is a binary variable with 1=="event" and 0=="censored".

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

responseAsFactor

Should the response be converted to factor? Default is TRUE. Representation as factor is technically important for classification split rules such as Gini index. For details see rpart.

...

Specification of additional arguments in function rpart.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example, if the two variables timeInt and X1 should be included, the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "rpart".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Schmid M, Kuechenhoff H, Hoerauf A, Tutz G (2016). “A survival tree method for the analysis of discrete event times in clinical and epidemiological studies.” Statistics in Medicine, 35(5), 734-751.

See Also

dataLong, dataLongTimeDep, rpart

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate discrete survival tree
estTreeModel <- estTree(dataShort = SubUnempDur, dataTransform = "dataLong", 
formulaVariable =~ timeInt + age + ui + logwage + ui, 
eventColumn = "censor1", timeColumn = "spell")

# Graphical illustration of fitted tree
rpart.plot::rpart.plot(estTreeModel)


Discrete Survival Tree Fitting For Competing Risks

Description

Wrapper for estimation of discrete survival tree for cause-specific competing risk models. Several preprocessing options such as time-dependent covariates are available.

Usage

estTreeCompRisks(
  dataShort,
  dataTransform = "dataLongCompRisks",
  formulaVariable = ~timeInt,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  idColumn = NULL,
  storeAugData = TRUE,
  ...
)

Arguments

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former.

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

timeColumn

Character giving the column name of the observed times. It is required that the observed times are discrete (class "integer").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

timeAsFactor

Specifies if the computed discrete time intervals should be converted to a categorical variable (class "logical"). Default is FALSE. In the default settings the discret time intervals are treated as quantitative (class "numeric").

idColumn

Name of column of identification number of persons (class "character"). Default is set to use function dataLong, that does not need this argument.

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

...

Specification of additional arguments in function vgam.

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "rpart".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Berger M, Welchowski T, Schmitz-Valckenberg S, Schmid M (2019). “A classification tree approach for the modeling of competing risks in discrete time.” Advances in Data Analysis and Classification, 13(0), 965-990.

Schmid M, Berger M (2020). “Competing risks analysis for discrete time-to-event data.” Computational Statistics, 13(5), 1-17.

See Also

dataLongCompRisks, dataLongCompRisksTimeDep, rpart

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur[1:100, ]

# Estimate GEE models for all events
estModel <- estRegSmoothCompRisks(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
estModel


Gumbel Link Function

Description

Constructs the link function with gumbel distribution in approriate format for use in generalized, linear models.

Usage

gumbel()

Details

Insert this function into a binary regression model

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

Matthias Schmid matthias.schmid@imbie.uni-bonn.de

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

glm

Examples


# Example with copenhagen stroke study
library(pec)
data(cost)
head(cost)

# Take subsample and convert time to months
costSub <- cost [1:50, ]
costSub$time <- ceiling(costSub$time/30)
costLong <- dataLong(dataShort = costSub, timeColumn = "time", eventColumn = "status",
timeAsFactor=TRUE)
gumbelModel <- glm(formula = y ~ timeInt + diabetes, data = costLong, 
family = binomial(link = gumbel()))

# Estimate hazard given prevStroke and no prevStroke
hazPrevStroke <- predict(gumbelModel, newdata=data.frame(timeInt = factor(1:143), 
diabetes = factor(rep("yes", 143), levels = c("no", "yes"))), type = "response")
hazWoPrevStroke <- predict(gumbelModel, newdata = data.frame(timeInt = factor(1:143), 
diabetes=factor(rep("no", 143), levels = c("no", "yes"))), type = "response")

# Estimate survival function
SurvPrevStroke <- cumprod(1 - hazPrevStroke)
SurvWoPrevStroke <- cumprod(1 - hazWoPrevStroke)

# Example graphics of survival curves with and without diabetes
plot(x = 1:143, y = SurvWoPrevStroke, type = "l", xlab = "Months", 
ylab = "S (t|x)", las = 1, lwd = 2, ylim = c(0,1))
lines(x = 1:143, y = SurvPrevStroke, col = "red", lwd = 2)
legend("topright", legend = c("Without diabetes", "Diabetes"), 
lty = 1, lwd =2, col = c("black", "red"))


Integrated Prediction Error

Description

Computes the integrated prediction error curve for discrete survival models.

Usage

intPredErr(
  hazards,
  testTime,
  testEvent,
  trainTime,
  trainEvent,
  testObjLong,
  tmax = NULL
)

Arguments

hazards

Predicted discrete hazards in the test data (class "numeric").

testTime

Discrete time intervals in short format of the test set (class "integer").

testEvent

Events in short format in the test set (binary vector).

trainTime

Discrete time intervals in short format of the training data set (class "integer").

trainEvent

Events in short format in the training set (binary vector).

testObjLong

Independent observation identification numbers of test data in long format (integer vector). For example in medicine, this would be patient IDs. Each patient should have a unique identifier.

tmax

Gives the maximum time interval for which prediction errors are calculated (class "integer"). It must be smaller than the maximum observed time in the training data of the object produced by function. The default setting NULL means, that all observed intervals are used.

Value

Integrated prediction error (class "numeric").

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Gneiting T, Raftery AE (2007). “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association, 102(477), 359-378.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

See Also

predErrCurve, aggregate

Examples


##########################
# Example with cancer data

library(survival)
head(cancer)

# Data preparation and convertion to 30 intervals
cancerPrep <- cancer
cancerPrep$status <- cancerPrep$status-1
intLim <- quantile(cancerPrep$time, prob = seq(0, 1, length.out = 30))
intLim [length(intLim)] <- intLim [length(intLim)] + 1

# Cut discrete time in smaller number of intervals
cancerPrep <- contToDisc(dataShort = cancerPrep, timeColumn = "time", intervalLimits = intLim)

# Generate training and test data
set.seed(753)
TrainIndices <- sample (x = 1:dim(cancerPrep) [1], size = dim(cancerPrep) [1] * 0.75)
TrainCancer <- cancerPrep [TrainIndices, ]
TestCancer <- cancerPrep [-TrainIndices, ]
TrainCancer$timeDisc <- as.numeric(as.character(TrainCancer$timeDisc))
TestCancer$timeDisc <- as.numeric(as.character(TestCancer$timeDisc))

# Convert to long format
LongTrain <- dataLong(dataShort = TrainCancer, timeColumn = "timeDisc", eventColumn = "status",
timeAsFactor=FALSE)
LongTest <- dataLong(dataShort = TestCancer, timeColumn = "timeDisc", eventColumn = "status",
timeAsFactor=FALSE)
# Convert factors
LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt))
LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt))
LongTrain$sex <- factor(LongTrain$sex)
LongTest$sex <- factor(LongTest$sex)

# Estimate, for example, a generalized, additive model in discrete survival analysis
library(mgcv)
gamFit <- gam (formula = y ~ s(timeInt) + s(age) + sex + ph.ecog, data = LongTrain, 
family = binomial())
summary(gamFit)

# 1. Specification of predicted discrete hazards
# Estimate survival function of each person in the test data
testPredHaz <- predict(gamFit, newdata = LongTest, type = "response")

# 2. Calculate integrated prediction error
intPredErr(hazards = testPredHaz, 
testTime = TestCancer$timeDisc, testEvent = TestCancer$status, 
trainTime = TrainCancer$timeDisc, trainEvent = TrainCancer$status, 
testObjLong = LongTest$obj)


Integrated Prediction Error For Competing Risks

Description

Estimates integrated prediction errors of arbitrary prediction models in the case of competing risks.

Usage

intpredErrCurveCompRisks(
  hazards,
  testDataShort,
  trainDataShort,
  timeColumn,
  timeAsFactor = FALSE,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  tmax = NULL
)

Arguments

hazards

Predicted hazards on the test data with model fitted on training data. Must be a matrix, with the predictions in the rows and the number of columns equal to the number of events.

testDataShort

Test data in short format.

trainDataShort

Train data in short format.

timeColumn

Character giving the column name of the observed times.

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

eventColumns

Character vector giving the column names of the event indicators (excluding censoring column).

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

tmax

Gives the maximum time interval for which prediction errors are calculated. It must not be higher than the maximum observed time in the training data.

Value

Integrated prediction errors for each competing event. Matrix, with the integrated predictions in the rows and the number of columns equal to the number of events.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Heyard R, Timsit J, Held L, consortium C (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62(3), 643-657.

See Also

predErrCurveCompRisks, predErrCurve

Examples


###########################
# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
selectInd1 <- 1:200
selectInd2 <- 201:400
trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ]
testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ]

# Convert to long format 
trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", 
eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE)
tmax          <- max(trainSet$spell)
testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", 
eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax,
timeAsFactor=TRUE)

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, 
data=trainSet_long, family=VGAM::multinomial(refLevel="e0"))

# Calculate predicted hazards
predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response")

# Compute integrated prediction error 
intpredErrCurveCompRisks(hazards=predHazards[,-1], testDataShort=testSet, 
trainDataShort=trainSet, timeColumn="spell", timeAsFactor=FALSE,
eventColumns=c("censor1", "censor4"), tmax=tmax)



Life Table

Description

Constructs a life table and estimates discrete hazards, survival functions, discrete cumulative hazards and their standard errors without covariates.

Usage

lifeTable(
  dataShort,
  timeColumn,
  eventColumn,
  intervalLimits = NULL,
  censInterval = "middle"
)

## S3 method for class 'discSurvLifeTable'
print(x, firstRows = 5, ...)

Arguments

dataShort

Original data in short format (class "data.frame").

timeColumn

Name of the column with discrete survival times (class "character").

eventColumn

Gives the column name of the event indicator (1=observed, 0=censored) (class "character").

intervalLimits

Optional names of the intervals for each row, e. g. \left[ a_0, a_1 \right), \left[ a_1, a_2 \right), ..., \left[ a_{q-1}, a_q \right) (class "character")

censInterval

Assumption about when censoring takes places within an interval on the continuous time scale (class "character"). Possible values are "start", "middle", "end". Default is "middle".

x

estimated life table (class "discSurvLifeTable")

firstRows

Display the first number of specified rows (class "numeric").

...

Specification of additional arguments in function print.

Details

The assumption of censoring times within the given intervals of argument "censInterval" estimate the hazard rate with value "start" $d_t/(n_t-w_t)$, "middle" $d_t/(n_t-w_t/2)$ and "end" $d_t/n_t$

Value

List containing an object of class "data.frame" with following columns

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

Matthias Schmid matthias.schmid@imbie.uni-bonn.de

References

Lawless JF (2002). Statistical Models and Methods for Lifetime Data, 2nd edition. Wiley series in probability and statistics.

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Extract subset of all persons smaller or equal the median of age
UnempDurSubset <- subset(UnempDur, age <= median(UnempDur$age))
LifeTabUnempDur <- lifeTable(dataShort = UnempDurSubset, timeColumn = "spell", 
eventColumn = "censor1")
LifeTabUnempDur

# Example with monoclonal gammapothy data
library(survival)
head(mgus)

# Extract subset of mgus
subMgus <- mgus [mgus$futime<=median(mgus$futime), ]

# Transform time in days to intervals [0, 1), [1, 2), [2, 3), ... , [12460, 12461)
mgusInt <- subMgus
mgusInt$futime <- mgusInt$futime + 1
LifeTabGamma <- lifeTable(dataShort = mgusInt, timeColumn= "futime", eventColumn = "death")
head(LifeTabGamma$Output, 25)
plot(x = 1:dim(LifeTabGamma$Output)[1], y = LifeTabGamma$Output$hazard, type = "l", 
xlab = "Time interval", ylab = "Hazard", las = 1, 
main = "Life table estimated marginal discrete hazards")


Martingale Residuals

Description

Estimates the martingale residuals of discrete survival model.

Usage

martingaleResid(hazards, dataLong, storeAugData = TRUE)

## S3 method for class 'discSurvMartingaleResid'
plot(x, covariates, ...)

Arguments

hazards

Predicted hazards from a discrete survival model (class "numeric").

dataLong

Data set in long format (class "data.frame").

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

x

Object of class "discSurvMartingaleResid"

covariates

Names of covariates to plot (class "character").

...

Specification of additional arguments in function plot.

Details

Gives a different plot of each marginal covariate against the martingale residuals. Additionally a nonparametric loess estimation is done.

Value

Martingale residuals for each observation in long format (class "numeric").

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Therneau TM, Grambsch PM, Fleming TR (1990). “Martingale-Based Residuals for Survival Models.” Biometrika, 77(1), 147-160.

See Also

glm

Examples


# Example with cross validation and unemployment data 
library(Ecdat)
data(UnempDur)
summary(UnempDur$spell)

# Extract subset of data
set.seed(635)
IDsample <- sample(1:dim(UnempDur)[1], 100)
UnempDurSubset <- UnempDur [IDsample, ]

# Conversion to long format
UnempDurSubsetLong <- dataLong(dataShort = UnempDurSubset,
timeColumn = "spell", eventColumn = "censor1")

# Estimate discrete survival continuation ratio model
contModel <- glm(y ~ timeInt + age + logwage, data = UnempDurSubsetLong,
family = binomial(link = "logit"))

# Fit hazards to the data set in long format
hazPreds <- predict(contModel, type = "response")

# Calculate martingale residuals for the unemployment data subset
MartResid <- martingaleResid (hazards = hazPreds, dataLong = UnempDurSubsetLong)
MartResid
sum(MartResid)

# Plot martingale residuals vs each covariate in the event interval
# Dotted line represents the loess estimate
plot(MartResid, covariates = c("age", "logwage"), dataLong = UnempDurSubsetLong)


Minimal Node Size Pruning For Competing Risks

Description

Computes optimal minimal node size of a discrete survival tree from a given vector of possible node sizes by cross-validation. Laplace-smoothing can be applied to the estimated hazards.

Usage

minNodePruningCompRisks(
  formulaVariable,
  data,
  treetype = "rpart",
  splitruleranger = "gini",
  sizes,
  indexList,
  timeColumn,
  eventColumns,
  alpha = 1,
  logOut = FALSE,
  eventColumnsAsFactor = FALSE,
  ...
)

Arguments

formulaVariable

Model formula for tree fitting (class "formula") of the form "~ x1 + x2 + ..." without response.

data

Discrete survival data in short format for which a survival tree is to be fitted (class "data.frame").

treetype

Type of tree to be fitted. Possible values are "rpart" or "ranger" (class "character"). The default is to fit an rpart tree; when "ranger" is chosen, a ranger forest with a single tree is fitted.

splitruleranger

String specifying the splitting rule of the ranger tree (class "character"). Possible values are either "gini" or "extratrees". Default is "gini".

sizes

Vector of different node sizes to try (class "integer"). Values need to be non-negative.

indexList

List of data partitioning indices for cross-validation (class "list"). Each element represents the test indices of one fold (class "integer").

timeColumn

Character giving the column name of the observed times in the "data"-argument (class "character").

eventColumns

Character vector giving the column names of the event indicators (excluding censoring column) in the "data"-argument (class "character").

alpha

Parameter for laplace-smoothing. A value of 0 corresponds to no laplace-smoothing (class "numeric").

logOut

Logical value (class "logical"). If True, computation progress will be written to console.

eventColumnsAsFactor

Should the argument eventColumns be intepreted as column name of a factor variable (class "logical")? Default is FALSE.

...

Additional arguments to the estimation function. It is either "rpart" or "ranger" (see argument treetype).

Details

Computes the out-of-sample log likelihood for all data partitionings for each node size in sizes and returns the node size for which the log likelihood was minimal. Also returns an rpart tree with the optimal minimal node size using the entire data set.

Value

A list containing the two items

Note

Note that depending on argument treetype some arguments are fixed and can not be changed:

Examples

# Example unemployment data
library(Ecdat)
library(caret)
data(UnempDur)

# Select training and testing subsample
subUnempDur <- UnempDur[which(UnempDur$spell < 10),]
subUnempDur <- subUnempDur[1:250,]

# Creating status variable for data partitioning
subUnempDur$status <- ifelse(subUnempDur$censor1, 1, 
ifelse(subUnempDur$censor2, 2, ifelse(
subUnempDur$censor3, 3, ifelse(subUnempDur$censor4, 4, 0))))

# Create cross validation sets
# Stratified by events and time distribution
set.seed(1972)
indexList <- createFolds(factor(paste(subUnempDur$status, 
subUnempDur$spell, sep="_")), k = 5)

# Perform minimal node size pruning
formula1 <- ~ timeInt + age + logwage
sizes <- 1:10
timeColumn <- "spell"
eventColumns <- c("censor1", "censor2", "censor3","censor4")
optiTree <- minNodePruningCompRisks(formula1, subUnempDur, treetype = "rpart", sizes = sizes, 
indexList = indexList, timeColumn = timeColumn, eventColumns = eventColumns, alpha = 1, 
logOut = TRUE)
plot(optiTree)


Minimal Node Size Pruning

Description

Computes optimal minimal node size of a discrete survival tree from a given vector of possible node sizes by cross-validation. Laplace-smoothing can be applied to the estimated hazards.

Usage

## S3 method for class 'discSurvMinNodeSizePrune'
plot(x, ...)

minNodePruning(
  formulaVariable,
  dataShort,
  treetype = "rpart",
  splitruleranger = "hellinger",
  sizes,
  indexList,
  timeColumn,
  eventColumn,
  alpha = 1,
  logOut = FALSE,
  ...
)

Arguments

x

Object of class "discSurvMinNodeSizePrune"

...

Additional arguments to the estimation function. It is either "rpart" or "ranger" (see argument treetype).

formulaVariable

Model formula for tree fitting (class "formula") of the form "~ x1 + x2 + ..." without response.

dataShort

Discrete survival data in short format for which a survival tree is to be fitted (class "data.frame").

treetype

Type of tree to be fitted (class "character"). Possible values are "rpart" or "ranger". The default is to fit an rpart tree; when "ranger" is chosen, a ranger forest with a single tree is fitted.

splitruleranger

String specifying the splitting rule of the ranger tree (class "character"). Possible values are either "gini", "extratrees" or "hellinger". Default is "hellinger".

sizes

Vector of different node sizes to try (class "integer"). Values should be non-negative.

indexList

List of data partitioning indices for cross-validation (class "list"). Each element represents the test indices of one fold (class "integer").

timeColumn

Character giving the column name of the observed times in the data argument (class "character").

eventColumn

Character giving the column name of the event indicator in the data argument (class "character").

alpha

Parameter for laplace-smoothing. A value of 0 corresponds to no laplace-smoothing (class "numeric").

logOut

Logical value (class "logical"). If the argument is set to TRUE, then computation progress will be written to console.

Details

Computes the out-of-sample log likelihood for all data partitionings for each node size in sizes and returns the node size for which the log likelihood was minimal. Also returns an rpart tree with the optimal minimal node size using the entire data set.

Value

A list containing the two items

Note

Note that depending on argument treetype some arguments are fixed and can not be changed:

Examples

library(pec)
library(caret)
data(cost)

# Take subsample and convert time to years
cost$time <- ceiling(cost$time / 365)
costSub <- cost[1:50, ]

# Specify column names for data augmentation
timeColumn <- "time"
eventColumn <- "status"

# Create cross validation sets
# Stratified by event and time distribution
indexList <- createFolds(factor(paste(costSub$status, 
costSub$time, sep="_")), k = 5)

# Perform minimal node size pruning
formula1 <- ~ timeInt + prevStroke + age + sex
sizes <- 1:10
optiTree <- minNodePruning(formula1, costSub, treetype = "rpart", sizes = sizes, 
indexList = indexList, timeColumn =  timeColumn, eventColumn = eventColumn, 
alpha = 1, logOut = TRUE)
plot(optiTree)


Prediction Error Curves For Competing Risks

Description

Estimates prediction error curves for discrete survival competing risks models

Usage

predErrCurveCompRisks(
  hazards,
  testDataShort,
  trainDataShort,
  timeColumn,
  timeAsFactor = FALSE,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  tmax = NULL
)

## S3 method for class 'discSurvPredErrDiscCompRisks'
plot(x, ...)

Arguments

hazards

Predicted hazard rates on the test data with model fitted on training data (class "matrix"). Predictions are stored in the rows and the number of columns equal the number of events.

testDataShort

Test data in short format (class "data.frame").

trainDataShort

Train data in short format (class "data.frame").

timeColumn

Character giving the column name of the observed times (class "character").

timeAsFactor

Should the time intervals be coded as factor (class "logical")? Default is FALSE. In the default settings the column is treated as quantitative variable (class "numeric").

eventColumns

Character vector giving the column names of the event indicators without censoring column (class "character").

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

tmax

Gives the maximum time interval for which prediction errors are calculated (class "integer"). It must not be higher than the maximum observed time in the training data.

x

Object of class "discSurvPredErrDiscCompRisks"

...

Specification of additional arguments in function plot.

Value

Calculated prediction errors for each competing event. Array with one matrix per competing event, with the predictions in the rows and the time points in the columns.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Heyard R, Timsit J, Held L, consortium C (2019). “Validation of discrete time-to-event prediction models in the presence of competing risks.” Biometrical Journal, 62(3), 643-657.

See Also

intpredErrCurveCompRisks, predErrCurve

Examples


###########################
# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
selectInd1 <- 1:200
selectInd2 <- 201:400
trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ]
testSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ]

# Convert to long format 
trainSet_long <- dataLongCompRisks(dataShort=trainSet, timeColumn="spell", 
eventColumns=c("censor1", "censor4"), timeAsFactor=TRUE)
tmax          <- max(trainSet$spell)
testSet_long <- dataLongCompRisks(dataShort=testSet, timeColumn="spell", 
eventColumns=c("censor1", "censor4"), aggTimeFormat = TRUE, lastTheoInt=tmax,
timeAsFactor=TRUE)

# Estimate continuation ratio model with logit link
vglmFit <- VGAM::vglm(formula=cbind(e0, e1, e2) ~ timeInt + age + logwage, 
data=trainSet_long, family=VGAM::multinomial(refLevel="e0"))

# Calculate predicted hazards
predHazards <- VGAM::predictvglm(vglmFit, newdata=testSet_long, type="response")

# Compute prediction error 
predErrCurveCompRisks0 <- predErrCurveCompRisks(
hazards=predHazards[,-1], 
testDataShort=testSet, trainDataShort=trainSet, 
timeColumn="spell", eventColumns=c("censor1", "censor4"), 
tmax=tmax)

# Graphics
plot(predErrCurveCompRisks0)


GEE Model Fitting For Competing Risks

Description

Estimates generalized estimation equation model for each competing event separately. Dependence within person IDs is accounted for by assuming a working covariance structure.

Usage

## S3 method for class 'dCRGEE'
predict(object, newdata, ...)

estCompRisksGEE(
  dataShort,
  dataTransform = "dataLongCompRisks",
  corstr = "independence",
  formulaVariable = ~timeInt,
  storeAugData = TRUE,
  ...
)

Arguments

object

Discrete time competing risks GEE model prediction model (class "dCRGEE").

newdata

New data set to be used for prediction (class "data.frame").

...

Additional arguments to data transformation functions dataLongCompRisks or dataLongCompRisksTimeDep. Preprocessing function argument responseAsFactor has to be set to FALSE (Default).

dataShort

Original data set in short format with each row corresponding to one independent observation (class "data.frame").

dataTransform

Specification of the data transformation function from short to long format (class "character"). There are two available options: Without time dependent covariates ("dataLongCompRisks") and with time dependent covariates ("dataLongCompRisksTimeDep"). The default is set to the former.

corstr

Assumption of correlation structure (class "character"). The following are permitted: '"independence"', '"exchangeable"', '"ar1"', '"unstructured"' and '"userdefined".

formulaVariable

Specifies the right hand side of the regression formula (class "formula"). The default is to use the discrete time variable, which corresponds to a covariate free hazard. It is recommended to always include the discrete time variable "timeInt".

storeAugData

Should the augmented data set be saved (class "logical")? Defaults is TRUE. The data set is available as attribute "augData".

Details

Variables in argument formulaVariable need to be separated by "+ ". For example if the two variables timeInt and X1 should be included the formula would be "~ timeInt + X1". The variable timeInt is constructed before estimation of the model.

Value

Returns an object of class "geeglm".

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Lee M, Feuer EJ, Fine JP (2018). “On the analysis of discrete time competing risks data.” Biometrics, 74(4), 1468-1481.

See Also

covarGEE, dataLongCompRisks, dataLongCompRisksTimeDep, geeglm

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Estimate GEE models for all events
estGEE <- estCompRisksGEE(dataShort = SubUnempDur, dataTransform = "dataLongCompRisks", 
corstr = "independence", formulaVariable =~ timeInt + age + ui + logwage * ui, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
names(estGEE)
estGEE[[1]]

# Predictions
SubUnempDurLong <- dataLongCompRisks(dataShort = SubUnempDur, 
eventColumns = c("censor1", "censor2", "censor3", "censor4"), timeColumn = "spell")
preds <- predict(estGEE, newdata = SubUnempDurLong)
head(preds)


Prediction Error Curve

Description

Estimates prediction error curves of arbitrary discrete survival prediction models. In prediction error curves the estimated and observed survival functions are compared adjusted by weights at given timepoints.

Usage

## S3 method for class 'discSurvPredErrDisc'
print(x, ...)

predErrCurve(
  hazards,
  timepoints,
  testTime,
  testEvent,
  trainTime,
  trainEvent,
  testObjLong
)

## S3 method for class 'discSurvPredErrDisc'
plot(x, ...)

Arguments

x

Object of class "discSurvPredErrDisc"

...

Specification of additional arguments in function plot.

hazards

Predicted discrete hazards in the test data (class "numeric").

timepoints

Vector of discrete time intervals on which the prediction error curve is calculated (class "integer").

testTime

Discrete survival times in the test data (class "numeric").

testEvent

Univariate event indicator in the test data (binary vector).

trainTime

Numeric vector of discrete survival times in the training data (class "numeric").

trainEvent

Integer vector of univariate event indicator in the training data (class "integer").

testObjLong

Independent observation identification numbers of test data in long format (integer vector). For example in medicine, this would be patient IDs. Each patient should have a unique identifier.

Details

The prediction error curves should be smaller than 0.25 for all time points, because this is equivalent to a random assignment error.

Value

Author(s)

Thomas Welchowski t.welchowski@psychologie.uzh.ch

References

Gerds TA, Schumacher M (2006). “Consistent estimation of the expected Brier Score in general survival models with right-censored event times.” Biometrical Journal, 48(6), 1029-1040.

Van der Laan MJ, Robins JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics.

See Also

intPredErr, predErrCurveCompRisks

Examples


# Example with cross validation and unemployment data 
library(Ecdat)
library(mgcv)
data(UnempDur)
summary(UnempDur$spell)

# Extract subset of data
set.seed(635)
IDsample <- sample(1:dim(UnempDur)[1], 100)
UnempDurSubset <- UnempDur [IDsample, ]
head(UnempDurSubset)
range(UnempDurSubset$spell)

# Generate training and test data
set.seed(7550)
TrainIndices <- sample (x = 1:dim(UnempDurSubset) [1], size = 75)
TrainUnempDur <- UnempDurSubset [TrainIndices, ]
TestUnempDur <- UnempDurSubset [-TrainIndices, ]

# Convert to long format
LongTrain <- dataLong(dataShort = TrainUnempDur, timeColumn = "spell", eventColumn = "censor1")
LongTest <- dataLong(dataShort = TestUnempDur, timeColumn = "spell", eventColumn = "censor1")
# Convert factor to numeric for smoothing
LongTrain$timeInt <- as.numeric(as.character(LongTrain$timeInt))
LongTest$timeInt <- as.numeric(as.character(LongTest$timeInt))

######################################################################
# Estimate a generalized, additive model in discrete survival analysis

gamFit <- gam (formula = y ~ s(timeInt) + age + logwage, data = LongTrain, family = binomial())

# Predict hazard rates on test data
predHaz <- predict(gamFit, newdata = LongTest, type = "response")

# Prediction error in first interval
tryPredErrDisc1 <- predErrCurve(hazards=predHaz, timepoints = 1, 
testTime = TestUnempDur$spell,
testEvent=TestUnempDur$censor1, trainTime = TrainUnempDur$spell,
 trainEvent=TrainUnempDur$censor1, testObjLong=LongTest$obj)
tryPredErrDisc1

# Prediction error of the 2. to 10. interval
tryPredErrDisc2 <- predErrCurve(hazards=predHaz, timepoints = 2:10,
testTime = TestUnempDur$spell,
testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell,
trainEvent = TrainUnempDur$censor1, testObjLong=LongTest$obj)
tryPredErrDisc2
plot(tryPredErrDisc2)

########################################
# Fit a random discrete survival forest

library(ranger)
LongTrainRF <- LongTrain
LongTrainRF$y <- factor(LongTrainRF$y)
rfFit <- ranger(formula = y ~ timeInt + age + logwage, data = LongTrainRF,
probability = TRUE)

# Predict hazards on test data
predHaz <- predict(rfFit, data = LongTest)$predictions[, 2]

# Prediction error in first interval
tryPredErrDisc1 <- predErrCurve(hazards=predHaz, timepoints = 1, 
testTime = TestUnempDur$spell,
testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell,
 trainEvent = TrainUnempDur$censor1, testObjLong=LongTest$obj)
tryPredErrDisc1

# Prediction error of the 2. to 10. interval
tryPredErrDisc2 <- predErrCurve(hazards=predHaz, timepoints = 2:10,
testTime = TestUnempDur$spell,
testEvent = TestUnempDur$censor1, trainTime = TrainUnempDur$spell,
trainEvent = TrainUnempDur$censor1, testObjLong=LongTest$obj)
tryPredErrDisc2
plot(tryPredErrDisc2)


Laplace Hazards Of Survival Tree For Competing Risks

Description

Predicts the Laplace-smoothed hazards of discrete survival tree based on fitted objects of class "rpart" or "ranger". Can be used for single-risk or competing risk discrete survival data.

Usage

survTreeLaplaceHazard(treeModel, newdata, alpha, rangerData = NULL)

Arguments

treeModel

Fitted tree object as generated by function rpart (class "rpart").

newdata

Data in long format for which hazards are to be computed. Must contain the same columns that were used for tree fitting (class "data.frame").

alpha

Smoothing parameter for laplace-smoothing. Must be a non-negative number. A value of 0 corresponds to no smoothing (class "numeric").

rangerData

Original training data (class "data.frame") for fitting treeModel of class "ranger". Must be in long format.

Value

A m by k matrix with m being the length of newdata and k being the number of classes in treeModel. Each row corresponds to the smoothed hazard of the respective observation.

See Also

ranger

Examples

############################################
# Example with rpart discrete survival tree
library(pec)
library(caret)

# Example data
data(cost)

# Convert time to years and select training and testing subsample
cost$time <- ceiling(cost$time/365)
costTrain <- cost[1:100, ]
costTest  <- cost[101:120, ]

# Convert to long format
timeColumn <- "time"
eventColumn <- "status"
costTrainLong <- dataLong(dataShort=costTrain, timeColumn = "time", 
                          eventColumn = "status")
costTestLong  <- dataLong(dataShort=costTest, timeColumn = "time", 
                          eventColumn = "status")
head(costTrainLong)

# Fit a survival tree
costTree <- rpart(formula = y ~ timeInt + prevStroke + age + sex, data = costTrainLong, 
                  method = "class")

# Compute smoothed hazards for test data
predictedhazards <- survTreeLaplaceHazard(costTree, costTestLong, 1)
predictedhazards

############################################
# Example with ranger discrete survival tree
library(pec)
library(caret)
library(ranger)
data(cost)

# Take subsample and convert time to years
cost$time <- ceiling(cost$time/365)
costSubTrain <- cost[1:50,]
costSubTest <- cost[51:70,]

# Specify column names for data augmentation
timeColumn<-"time"
eventColumn<-"status"
costSubTrainLong <- dataLong(costSubTrain, timeColumn, eventColumn)
costSubTestLong <- dataLong(costSubTest, timeColumn, eventColumn)

# Estimate discrete survival tree
formula <- y ~ timeInt + diabetes + prevStroke + age + sex
rangerTree <- ranger(formula, costSubTrainLong, num.trees = 1, mtry = 5, 
classification = TRUE, splitrule = "hellinger", replace = FALSE, 
sample.fraction = 1, max.depth = 5)

# Compute laplace-smoothed hazards
laplHaz <- survTreeLaplaceHazard(rangerTree, 
costSubTestLong, alpha = 1, costSubTrainLong)
laplHaz

Multiple Spell Employment Data

Description

Subsample of 1000 persons from the national longitudinal survey of youth 1979 data. Included covariates are age, children, ethnicity, marital status and sex. The bivariate responses current state (spell) and discrete time interval (year) are the last two columns.

Usage

data(unempMultiSpell)

Details

Author(s)

David Koehler koehler@imbie.uni-bonn.de

Source

National Longitudinal Survey of Youth


Compute Subdistribution Weights

Description

Function to compute new subdistribution weights for a test data set based on the estimated censoring survival function from a learning data set.

Usage

weightsLtoT(
  dataShortTrain,
  dataShortTest,
  timeColumn,
  eventColumns,
  eventFocus,
  eventColumnsAsFactor = FALSE
)

Arguments

dataShortTrain

Learning data in short format (class "data.frame").

dataShortTest

Test data in short format (class "data.frame").

timeColumn

Character specifying the column name of the observed event times (class "character"). It is required that the observed times are discrete (class "integer").

eventColumns

Character vector specifying the column names of the event indicators (class "logical")(excluding censoring events). It is required that a 0-1 coding is used for all events. The algorithm treats row sums of zero of all event columns as censored.

eventFocus

Column name of the event of interest, which corresponds to the type 1 event (class "character").

eventColumnsAsFactor

Should the argument eventColumns be interpreted as column name of a factor variable (class "logical")? Default is FALSE.

Value

Subdstribution weights for the test data in long format using the estimated censoring survival function from the learning data (class "numeric"). The length of the vector is equal to the number of observations of the long test data.

Author(s)

Moritz Berger moritz.berger@zi-mannheim.de

References

Berger M, Schmid M, Welchowski T, Schmitz-Valckenberg S, Beyersmann J (2020). “Subdistribution Hazard Models for Competing Risks in Discrete Time.” Biostatistics, 21(3), 449-466.

Berger M, Schmid M (2022). “Assessing the calibration of subdistribution hazard models in discrete time.” The Canadian Journal of Statistics, 50(2).

Zadeh SG, Behning C, Schmid M (2022). “An imputation approach using subdistribution weights for deep survival analysis with competing events.” Scientific Reports, 12(3815).

See Also

dataLongSubDist, calPlot

Examples

####################
# Data preprocessing

# Example unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
selectInd1 <- 1:100
selectInd2 <- 101:200
trainSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd1], ]
valSet <- UnempDur[which(UnempDur$spell %in% (1:10))[selectInd2], ]  

# Convert to long format
trainSet_long <- dataLongSubDist(dataShort = trainSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), eventFocus = "censor1")
valSet_long <- dataLongSubDist(dataShort = valSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), eventFocus = "censor1")

# Compute new weights of the validation data set 
valSet_long$subDistWeights <- weightsLtoT(trainSet, valSet, timeColumn = "spell", 
eventColumns = c("censor1", "censor4"), eventFocus = "censor1")

# Estimate continuation ratio model with logit link
glmFit <- glm(formula = y ~ timeInt + age + logwage, data = trainSet_long, 
family = binomial(), weights = trainSet_long$subDistWeights)

# Calculate predicted discrete hazards 
predHazards <- predict(glmFit, newdata = valSet_long, type = "response")

# Calibration plot 
calPlot(predHazards, testDataLong = valSet_long, weights = valSet_long$subDistWeights)