The ridge function, matrices, and prediction

Terry Therneau

Dec 2020

The ridge function

The ridge() function was included in the survival package as a test of the code for penalized models. As the author of the package, I never actually use the function myself, and as a consequence it has never received any polish. But it does work. This neglect is simply due to the types of data that I analyse.

This vignette was created to address a common issue with using the function, one that arises with some frequency on internet message boards. Hopefully this note will provide context and a reference for a solution. I will create a dummy dataset with 20 random 0/1 variables as a running example.

set.seed(1954)  # force reproducability
library(survival)

n <- nrow(lung)
snp <- matrix(rbinom(20*n, 1, p=.1), nrow=n)
snpdata <- cbind(lung, data.frame(snp))
dim(snpdata)
#> [1] 228  30

The issue

The ridge function makes the most sense when there are a large number of variables, for instance if one had 100 SNPs. Say that you write the call in the obvious way:

cfit1 <- coxph(Surv(time, status) ~ age + sex + ridge(X1, X2, X3, X4, X5, X6,
                     X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, 
             X19, X20, theta=.1), data=snpdata)

The problem with this, of course, is that typing this statement is cumbersome for 20 SNPs and practically impossible if there were 500. One can use . in R formulas to stand for “all the other variables”, but this does not work inside a function. This leads to the first work-around, which is to create the formula programatically.

xlist <-  paste0("X", 1:20)
myform <- paste("Surv(time, status) ~ age + sex + ridge(",
                paste(xlist, collapse= ", "), ", theta=.1)")
cfit2 <- coxph(formula(myform), data=snpdata)

all.equal(cfit1$loglik, cfit2$loglik)
#> [1] TRUE

This approach works well up to a point and then fails, once the model statement grows too long for the internal buffer of the R parser. The solution is to call the ridge function with a single matrix argument.

cfit3 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1),
         data = snpdata)

This fit has completely ignored the variables X1, X2, …, X50 found in the dataframe. If the SNP data had come to us as a dataframe, then we would create a temporary matrix using the as.matrix command applied to the appropriate columns of said data.

The problem with this approach comes if we want to do predictions using a new set of data.

newsnp <- matrix(rbinom(20*4, 1, p=.12), nrow=4)
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                     newsnp)
predict(cfit3, newdata=prdata)
#> Error in model.frame.default(data = prdata, formula = ~age + sex + ridge(snp, : variable lengths differ (found for 'ridge(snp, theta = 0.1)')

We get a failure message that “variable lengths differ”. The reason is that R fitting functions look for their variables first in the data= argument, and then look in the primary working data for any that were not found. The call to cfit3 has 3 variables of age, sex, and snp; while the new dataframe prdata has variables of age, sex, X1, X2, …, X20. The predict function pulls 4 rows from prdata (for age and sex), and 50 rows from the global variable snp. This clearly does not work.

What if we put those variables in the the main data, instead of a dataframe?

prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                     newsnp)
snpsave <- snp
snp <- newsnp
age <- prdata$age
sex <- prdata$sex

new <- predict(cfit3)
length(new)
#> [1] 228

snp <- snpsave  # restore the status quo
rm(age, sex)

This time we have been tripped up by the predict.coxph function. If there is no newdata argument, the function assumes the user is asking for predictions using the original data, and that is what is produced. There is also the need to restore datasets that we overwrote.

The most direct way around this is to do the prediction “by hand”. Since this dataset has no factor variables, splines, or other terms that lead to special coding, it is fairly easy to do the computation.

prmat <- cbind(age= c(50, 65, 48, 70), sex=c(1,1,2,1), newsnp)
drop(prmat %*% coef(cfit3)) # simplify results to vector
#> [1] 0.1428461 0.9497542 0.1203784 1.0034065
#alternate (center)
drop(prmat %*% coef(cfit3)) - sum(coef(cfit3)* cfit3$mean)
#> [1] -0.2148958  0.5920124 -0.2373634  0.6456646

The predict.coxph function, by default, gives predictions for centered covariates. This has its roots in the internals of the coxph code, where centering is used to avoid overflow of the exp function. This is entirely optional when doing the prediction ourselves, unless one wants to match predict.coxph. (If I could go back in time, centering the predictions would be undone; it’s effect has been mostly to cause confusion. Que sera sera.)

Suppose that one still wanted to use the standard predict function, for instance if the model did include a factor variable, or simply to fit in to the usual R pattern? What is needed is a way to store a matrix directly into a dataframe. This is done regularly by the model.frame function; the code below causes data.frame to use that behavior for our matrices of SNP values.

# new data type
ridgemat <- function(x) {
    class(x) <- c("ridgemat", class(x))
    x
}
as.data.frame.ridgemat <- function(x, ...) 
    as.data.frame.model.matrix(as.matrix(x), ...)

snpdata2 <- cbind(lung, snp= ridgemat(snp))
names(snpdata2)
#>  [1] "inst"      "time"      "status"    "age"       "sex"      
#>  [6] "ph.ecog"   "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"  
#> [11] "snp"

cfit4 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1),
             data= snpdata2)
     

prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                       snp= ridgemat(newsnp))
predict(cfit4, newdata=prdata)
#>          1          2          3          4 
#> -0.2148958  0.5920124 -0.2373634  0.6456646

Success!

The downside to this is that the modified dataframe object is known to work with modeling functions, but is not guaranteed to be comparable with standard manipulations for dataframes, such as merge, subset, etc., or tidyverse concepts such as a tibble.