Special Topics

Survival Analysis

Survival Analysis is concerned with the distributions of lifetimes, often human as in Medical Sciences but also of components and machines.

Example: Life Table USA

Here is a life table, for the USA in 2016. Numbers are per 100,000 population:

life.us <- read.csv("us.life.table.csv")
kable.nice(life.us)
Age Male Female
<1 year 100000.00 100000.00
1-4 years 99378.94 99454.61
5-9 years 99281.21 99372.41
10-14 years 99216.42 99317.51
15-19 years 99133.12 99255.05
20-24 years 98772.20 99105.56
25-29 years 98039.19 98849.90
30-34 years 97148.16 98491.30
35-39 years 96152.90 98007.48
40-44 years 95006.17 97365.09
45-49 years 93692.59 96542.95
50-54 years 91897.80 95333.41
55-59 years 89193.00 93486.77
60-64 years 85086.59 90760.06
65-69 years 79496.85 87075.37
70-74 years 72184.16 81854.51
75-79 years 62329.23 74259.41
80-84 years 50291.00 63544.39
85+ years 35033.57 48766.34

so this tells us that of any 100,000 men 97148.16 survived to age 30, or about $2850/100000*100% = 2.8% had died, compared to \(1.5\%\) of the women.

Tables like this are very important to insurance companies when trying to figure out what premiums to charge for a life insurance, or for a company to find out how much their employees need to pay into a retirement plan, etc.

Let’s draw a curve that for each gender and age shows the proportion of the population alive.

First we need to reorganize the data as a data frame with numeric columns:

dim(life.us)
## [1] 19  3
life.us.1 <-
  data.frame(Age=rep(c(0, 2.5+5*0:16, 87), 2),
             Alive=c(life.us$Male, life.us$Female)/1e5,
             Gender=rep(c("Male", "Female"), each=19))

and now

ggplot(data=life.us.1, aes(Age, Alive, color=Gender)) +
    geom_line()

What does this say how many newborn male babies we might have in a certain population? How may 1 year olds and so on? Let’s assume our population consists of 100000 individuals, half male and female. Then the numbers in the table tell us the relative probabilities. For example for any 200 babies there should be about 98.8 20-24 year old males and 63.5 80-84 year old females. So we can generate a simulated population with

sim.pop <- list(Male=sample(c(0, 2.5+5*0:16, 87), 
   size=50000, replace = TRUE, prob=life.us$Male),
                Female=sample(c(0, 2.5+5*0:16, 87), 
   size=50000, replace = TRUE, prob=life.us$Female))
df <- data.frame(x=c(sim.pop$Male, sim.pop$Female),
      Gender=rep(c("Male", "Female"), each=50000))       
ggplot(df, aes(x=x)) + 
    geom_histogram(data = subset(df, Gender == "Male"), 
        fill = "red", alpha = 0.2) +
    geom_histogram(data = subset(df, Gender == "Female"), 
        fill = "blue", alpha = 0.2)


Let T denote a random variable describing a lifetime. Then we know that T takes values x>0 and T has a continuous distribution F with density f. In survival analysis several functions are of common interest:

  • survivor function:
    \[ S(t)=1-F(t)=P(X>t) \] which is the probability to survive past time t.

  • hazard function:

\[ h(t)=\lim_{t \rightarrow \infty} \frac{P(t<T<t+h)}{h} \] which is the probability to survive until time t, and then die.

  • cumulative hazard function:

\[ H(t)=\int_0^{\infty} h(t) dt \]

it is easy to check that

  • \(h(t)=f(t)/S(t)\)
  • \(H(t)=-\log S(t)\)

Here are some standard survival distributions and their associated functions:

  • Exponential
    • \(f(t)=\lambda \exp (-\lambda t)\)
    • \(S(t) = \exp (-\lambda t)\)
    • \(h(t) = \lambda\)
    • \(H(t) = \lambda t\)

    \(\lambda\) is called rate parameter

  • Weibull
    • \(f(t)= \frac{\alpha}{\lambda} (\frac{t}{\lambda})^{\alpha-1} \exp \left( -(\frac{t}{\lambda})^\alpha \right)\)
    • \(S(t) = \exp\left( -(\frac{t}{\lambda})^\alpha \right)\)
    • \(h(t) = \frac{\alpha}{\lambda} (\frac{t}{\lambda})^{\alpha-1}\)
    • \(H(t) = (\frac{t}{\lambda})^\alpha\)

    \(\alpha\) is called the shape parameter and \(\lambda\) the scale parameter

  • Log-Logistic
    • \(f(t)=\frac{\lambda \tau (\lambda t)^{\tau-1}}{(1+(\lambda t)^\tau)^2}\)
    • \(S(t) = \frac1{1+(\lambda t)^\tau}\)
    • \(h(t) = \frac{\lambda \tau (\lambda t)^{\tau-1}}{1+(\lambda t)^\tau}\)
    • \(H(t) =\log (1+ (\lambda t)^\tau)\)

also often used are the log-normal and the gamma distributions.

Example: Life table USA

Let’s see whether we can fit a Weibull distribution to our data:

y <- life.us.1$Alive[life.us.1$Gender=="Male"]
x <- life.us.1$Age[life.us.1$Gender=="Male"]
fit.male <- coef(nls(y~exp(-(x/l)^a), 
                     start=list(l=100, a=4)))
y <- life.us.1$Alive[life.us.1$Gender=="Female"]
x <- life.us.1$Age[life.us.1$Gender=="Female"]
fit.female <- coef(nls(y~exp(-(x/l)^a), 
                       start=list(l=100, a=4)))
x <- seq(0, 85, length=250)
df.male <- data.frame(x=x, 
          y=exp(-(x/fit.male[1])^fit.male[2]))
df.female <- data.frame(x=x, 
          y=exp(-(x/fit.female[1])^fit.female[2]))
ggplot(data=life.us.1, aes(Age, Alive)) +
    geom_point() +
  geom_line(data=df.male, aes(x, y)) +
  geom_line(data=df.female, aes(x, y))

and these fits are ok but not great.

How about a gamma distribution?

y <- life.us.1$Alive[life.us.1$Gender=="Male"]
x <- life.us.1$Age[life.us.1$Gender=="Male"]
fit.male <- coef(nls(y~(1-pgamma(x, a, b)), 
                     start=list(a=10, b=0.1)))
y <- life.us.1$Alive[life.us.1$Gender=="Female"]
x <- life.us.1$Age[life.us.1$Gender=="Female"]
fit.female <- coef(nls(y~(1-pgamma(x, a, b)), 
                     start=list(a=10, b=0.1)))
x <- seq(0, 85, length=250)
df.male <- data.frame(x=x, 
          y=1-pgamma(x, fit.male[1], fit.male[2]))
df.female <- data.frame(x=x, 
          y=1-pgamma(x, fit.female[1], fit.female[2]))
ggplot(data=life.us.1, aes(Age, Alive)) +
    geom_point() +
  geom_line(data=df.male, aes(x, y)) +
  geom_line(data=df.female, aes(x, y))

and that looks a bit worse.

Example: Leukemia

This is survival times for leukemia, with covariates wbc, the white blood cell count and ag, a test result with values “present” or “absent”.

kable.nice(head(leukemia))
wbc ag time
2300 present 65
750 present 156
4300 present 100
2600 present 134
6000 present 16
10500 present 108
empdist <- function (data, npoints) 
{
    a <- hist(data, breaks = npoints, plot = F)
    x <- a$mids
    y <- cumsum(a$counts)/length(data)
    list(x = x, y = y)
}
zp <- empdist(leukemia$time[leukemia$ag == "present"], 10)
za <- empdist(leukemia$time[leukemia$ag == "absent"], 10)
df <- data.frame(Time=c(za$x, zp$x),
                 Survival=1-c(za$y, zp$y),
                 Status=rep(c("Present", "Absent"), 
                c(length(za$x), length(zp$x))))
df
##     Time  Survival  Status
## 1    2.5 0.5625000 Present
## 2    7.5 0.4375000 Present
## 3   12.5 0.4375000 Present
## 4   17.5 0.3125000 Present
## 5   22.5 0.2500000 Present
## 6   27.5 0.1875000 Present
## 7   32.5 0.1875000 Present
## 8   37.5 0.1875000 Present
## 9   42.5 0.1250000 Present
## 10  47.5 0.1250000 Present
## 11  52.5 0.1250000 Present
## 12  57.5 0.0625000 Present
## 13  62.5 0.0000000 Present
## 14  10.0 0.7058824  Absent
## 15  30.0 0.5294118  Absent
## 16  50.0 0.4705882  Absent
## 17  70.0 0.3529412  Absent
## 18  90.0 0.2941176  Absent
## 19 110.0 0.2352941  Absent
## 20 130.0 0.1176471  Absent
## 21 150.0 0.0000000  Absent
ggplot(data=df, aes(Time, Survival, color=Status)) +
    geom_line()

Censored Data

The most distinctive feature of survival analysis is censoring. Say we are testing a new treatment for late-stage stomach cancer. At the beginning of the study we select 100 cancer patients and begin treatment. After 1 year we wish to study the effectiveness of the new treatment. At that time 47 of the patients have died, and for them we know the number of days until death. For the other 53 patients we know that they have survived 365 days but we don’t know how much longer they will live. How do we use all the available information?

Censoring comes in a number of different forms:

  • right censoring: here “case i” leaves the trial at time Ci, and we either know Ti if \(T_i \le C_i\), or that \(T_i > C_i\).

  • we have random censoring if Ti and Ci are independent.

  • type I censoring is when the censoring times are fixed in advance, for example by the end of the clinical trial.

  • type II censoring is when the the experiment is terminated after a fixed number of failures.

In R we code the data as a pair \((t_i, d_i)\) where ti is the observed survival time and \(d_i=0\) if the observation is censored, 1 if a “death” is observed.

Survival data is often presented using a + for the censored observations, for example 35, 67+, 85+, 93, 101, etc.

Let \(t_1 < t_2 < .. < t_m\) be the m distinct survival times. Let \(Y_i(s)\) be an indicator function, which is 1 if person i is still at risk (that is alive) at time s and 0 otherwise, that is \(Y_i(s)=I_{(0,t_i)}(s)\).

Then the number of patients at risk at time s is \(r(s)=\sum Y_i(s)\). We can similarly define d(s) as the number of deaths occurring at time s.

There is also a modern approach to survival data based on stochastic processes, especially martingales and counting processes. In this setup one considers the counting process \(N_i(t)\) associated with the ith subject, so \(N_i(t)\) changes by 1 at each observed event ( for example, a visit to the doctor, a heart attack, a death etc).

How can we estimate the survivor curve in the presence of censoring? One of the most commonly used estimators is the Kaplan-Meier estimator:

\[ \hat{S}(x)= \prod_{t_i<t} \frac{r(t_i)-d(t_i)}{r(t_i)} \] In R this is calculated by the routine survfit. It uses as its argument a “survival” object which in turn is generated using the Surv function.

Example: Leukemia

This data does not have censoring, bit let’s see what Kaplan Meier looks like anyway:

kable.nice(head(leukemia))
wbc ag time
2300 present 65
750 present 156
4300 present 100
2600 present 134
6000 present 16
10500 present 108
library(survival)
fit <- survfit(Surv(time) ~ ag, data = leukemia)
plot(fit)

Example: Gehan data

A data frame from a trial of 42 leukaemia patients. Some were treated with the drug 6-mercaptopurine and the rest are controls. The trial was designed as matched pairs, both withdrawn from the trial when either came out of remission.

kable.nice(head(gehan))
pair time cens treat
1 1 1 control
1 10 1 6-MP
2 22 1 control
2 7 1 6-MP
3 3 1 control
3 32 0 6-MP
fit <- survfit(Surv(time, cens) ~ treat, data = gehan)
plot(fit, xlab = "Time", ylab = "Est. Remission", 
     col = c("blue", "red"))
legend(1, 0.2, c("Control", "6-MP"), 
       col = c("red", "blue"), lty = c(1, 1))

The obvious question is whether there are differences between the remission times of the two groups:

survdiff(Surv(time, cens) ~ treat, data = gehan)
## Call:
## survdiff(formula = Surv(time, cens) ~ treat, data = gehan)
## 
##                N Observed Expected (O-E)^2/E (O-E)^2/V
## treat=6-MP    21        9     19.3      5.46      16.8
## treat=control 21       21     10.7      9.77      16.8
## 
##  Chisq= 16.8  on 1 degrees of freedom, p= 4e-05

and we see that there is.

Confidence intervals for the survival curve are automatically computed by the survfit function. By default it finds intervals of the form \(S \pm 1.96se(S)\). Other options are:

  • conf.type=“log”: \(\exp(\log(S) \pm 1.96se(H))\)
  • conf.type=“log-log”: \(\exp (- \exp(\log(-\log(S) \pm 1.96se(\log(H)))\)

One advantage of log-log is that the limits are always in (0,1).

Let’s find \(90\%\) confidence intervals based on these three methods for t=10 in the 6-MP group.

A <- matrix(0, 3, 2)
dimnames(A) <- list(c("plain", "log", "log-log"), 
                    c("Lower", "Upper"))
fit <- survfit(Surv(time, cens) ~ treat, data = gehan, 
            conf.type = "plain", conf.int = 0.9)
A[1, 1] <- fit$lower[fit$time == "10"]
A[1, 2] <- fit$upper[fit$time == "10"]
fit <- survfit(Surv(time, cens) ~ treat, data = gehan, 
            conf.type = "log", conf.int = 0.9)
A[2, 1] <- fit$lower[fit$time == "10"]
A[2, 2] <- fit$upper[fit$time == "10"]
fit <- survfit(Surv(time, cens) ~ treat, data = gehan, 
            conf.type = "log-log", conf.int = 0.9)
A[3, 1] <- fit$lower[fit$time == "10"]
A[3, 2] <- fit$upper[fit$time == "10"]
round(A, 2)
##         Lower Upper
## plain    0.59  0.91
## log      0.61  0.93
## log-log  0.55  0.87

An important question for new patients is of course the average survival time. As always “average” can be computed in different ways. R uses the median and the estimate is part of the output for a survfit object

survfit(Surv(time, cens) ~ treat, data = gehan)
## Call: survfit(formula = Surv(time, cens) ~ treat, data = gehan)
## 
##                n events median 0.95LCL 0.95UCL
## treat=6-MP    21      9     23      16      NA
## treat=control 21     21      8       4      12

Sometimes it is possible to fit a parametric curve via generalized linear models to the survival curve. Say we wish to fit an exponential model, and suppose we have a covariate vector x for each case. Say \(\lambda_i\) is the rate of the exponential for case i. Then we can connect the rate to the covariates via the model \(\lambda_i = \beta^T x\). This might on occasion be negative, and it is often better to use \(\log(\lambda_i)=\beta^Tx\).

Example: Leukemia

Let’s see whether the Leukemia data can be modeled in this way. First we have a look at the relationship of time and its covariates:

pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=leukemia, aes(time, wbc)) +
        geom_point(),
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=leukemia, aes(ag, time)) + 
        geom_boxplot(),
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        
print(ggplot(data=leukemia, aes(log(time), log(wbc))) +
               geom_point(),
  vp=viewport(layout.pos.row=2, layout.pos.col=1))

The relationships do not appear to be linear, and we fix this by using a log transform on time and wbc.

What model might be appropriate for this data set? For the exponential model we have \(-\log S(t)=H(t)=\lambda t\), so if we plot \(-\log S(t)\) vs t we should see a linear relationship:

fit <- survfit(Surv(time, rep(1, 33)) ~ ag, data = leukemia)
df <- data.frame(Time=fit$time,
             y=-log(fit$surv),
             Status=rep(c("present", "absent"), c(12, 15)) )
ggplot(data=df, aes(Time, y, color=Status)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(y= "-logS")

we fit this model via glm:

options(contrasts = c("contr.treatment", "contr.poly"))
fit <- glm(time ~ ag + log(wbc), family = Gamma(log),
           data=leukemia)
summary(fit, dispersion = 1)
## 
## Call:
## glm(formula = time ~ ag + log(wbc), family = Gamma(log), data = leukemia)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1746  -1.2663  -0.4251   0.4962   1.9048  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   5.8154     1.2932   4.497 6.89e-06
## agpresent     1.0176     0.3492   2.914  0.00357
## log(wbc)     -0.3044     0.1319  -2.308  0.02097
## 
## (Dispersion parameter for Gamma family taken to be 1)
## 
##     Null deviance: 58.138  on 32  degrees of freedom
## Residual deviance: 40.319  on 30  degrees of freedom
## AIC: 301.49
## 
## Number of Fisher Scoring iterations: 8

Cox Proportional Hazard Model

A nonparametric approach to survival data was first introduced by Cox in 1972. Here we model the hazard function h as follows: there is a baseline hazard \(h_0(t)\) which is modified by covariates, so the hazard function for any individual case is \[ h(t) = h_0(t)\exp(\beta^Tx) \] and the interest is mainly in \(\beta\).

The vector \(\beta\) is estimated via partial likelihood.

Suppose a death occurs at time tj. Then conditionally on this event the probability that case i died is

\[ \frac{h_0(t)\exp (\beta^Tx_i)}{\sum_j I{[T_j\ge t]}h_0(t)\exp (\beta^Tx_i)}=\frac{\exp (\beta^Tx_i)}{\sum_j I{[T_j\ge t]}\exp (\beta^Tx_i)} \] which does not depend on the baseline hazard.

Example: Leukemia

fit <- coxph(Surv(time) ~ ag + log(wbc), data=leukemia)
summary(fit)
## Call:
## coxph(formula = Surv(time) ~ ag + log(wbc), data = leukemia)
## 
##   n= 33, number of events= 33 
## 
##              coef exp(coef) se(coef)      z Pr(>|z|)
## agpresent -1.0691    0.3433   0.4293 -2.490  0.01276
## log(wbc)   0.3677    1.4444   0.1360  2.703  0.00687
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## agpresent    0.3433     2.9126     0.148    0.7964
## log(wbc)     1.4444     0.6923     1.106    1.8857
## 
## Concordance= 0.726  (se = 0.065 )
## Rsquare= 0.377   (max possible= 0.994 )
## Likelihood ratio test= 15.64  on 2 df,   p=4e-04
## Wald test            = 15.06  on 2 df,   p=5e-04
## Score (logrank) test = 16.49  on 2 df,   p=3e-04

Example: Lung Cancer

Survival in patients with lung cancer at Mayo Clinic. Performance scores rate how well the patient can perform usual daily activities.

Variables:
inst: Institution code time: Survival time in days status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score (0=good 5=dead) ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Source: Terry Therneau

kable.nice(head(lung.cancer))
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175 NA
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 NA 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 NA 0
12 1022 1 74 1 1 50 80 513 0
library(GGally)
ggpairs(lung.cancer)

Next we want to fit the Cox proportional hazards model, but there are two issues we need to deal with:

  • what to do with the missing values? we remove them from the data set using na.action=na.omit

  • in this experiment sex was a stratification variable. We can include this fact in our model by using strata(sex) instead of just sex. This allows for non proportional hazards.

fit <- coxph(Surv(time, status) ~ strata(sex) + age + 
       ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, 
                 na.action = na.omit,
          data=lung.cancer)
summary(fit)
## Call:
## coxph(formula = Surv(time, status) ~ strata(sex) + age + ph.ecog + 
##     ph.karno + pat.karno + meal.cal + wt.loss, data = lung.cancer, 
##     na.action = na.omit)
## 
##   n= 168, number of events= 121 
##    (60 observations deleted due to missingness)
## 
##                 coef  exp(coef)   se(coef)      z Pr(>|z|)
## age        9.054e-03  1.009e+00  1.160e-02  0.780   0.4352
## ph.ecog    7.073e-01  2.029e+00  2.228e-01  3.175   0.0015
## ph.karno   2.072e-02  1.021e+00  1.128e-02  1.836   0.0663
## pat.karno -1.330e-02  9.868e-01  8.050e-03 -1.652   0.0985
## meal.cal  -5.268e-06  1.000e+00  2.629e-04 -0.020   0.9840
## wt.loss   -1.520e-02  9.849e-01  7.890e-03 -1.927   0.0540
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## age          1.0091     0.9910    0.9864     1.032
## ph.ecog      2.0285     0.4930    1.3108     3.139
## ph.karno     1.0209     0.9795    0.9986     1.044
## pat.karno    0.9868     1.0134    0.9713     1.002
## meal.cal     1.0000     1.0000    0.9995     1.001
## wt.loss      0.9849     1.0153    0.9698     1.000
## 
## Concordance= 0.638  (se = 0.041 )
## Rsquare= 0.121   (max possible= 0.994 )
## Likelihood ratio test= 21.58  on 6 df,   p=0.001
## Wald test            = 21.9  on 6 df,   p=0.001
## Score (logrank) test = 22.49  on 6 df,   p=0.001

Notice the three hypothesis tests at the bottom. They do the job of the F-test in regression, namely testing whether all the variables together are useful for predicting time. Hear clearly they are.

Checking the individual covariates we see that age and meal.cal are not significant. Let’s remove them

fit <- coxph(Surv(time, status) ~ strata(sex) + ph.ecog + 
            ph.karno + pat.karno + wt.loss, 
            na.action = na.omit, data=lung.cancer)
summary(fit)
## Call:
## coxph(formula = Surv(time, status) ~ strata(sex) + ph.ecog + 
##     ph.karno + pat.karno + wt.loss, data = lung.cancer, na.action = na.omit)
## 
##   n= 210, number of events= 148 
##    (18 observations deleted due to missingness)
## 
##                coef exp(coef)  se(coef)      z Pr(>|z|)
## ph.ecog    0.649467  1.914520  0.200699  3.236  0.00121
## ph.karno   0.017304  1.017454  0.010314  1.678  0.09342
## pat.karno -0.016679  0.983459  0.007255 -2.299  0.02151
## wt.loss   -0.013729  0.986365  0.006906 -1.988  0.04681
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## ph.ecog      1.9145     0.5223    1.2919    2.8372
## ph.karno     1.0175     0.9828    0.9971    1.0382
## pat.karno    0.9835     1.0168    0.9696    0.9975
## wt.loss      0.9864     1.0138    0.9731    0.9998
## 
## Concordance= 0.633  (se = 0.038 )
## Rsquare= 0.115   (max possible= 0.995 )
## Likelihood ratio test= 25.71  on 4 df,   p=4e-05
## Wald test            = 26.57  on 4 df,   p=2e-05
## Score (logrank) test = 27.16  on 4 df,   p=2e-05

Next we want to try and assess whether this model is appropriate. In regression we use the residual vs. fits plot for this purpose. Here we have something similar. There are actually several kinds of residuals in a survival analysis, we will use what are called the martingale residuals. The plots are the residuals with the variable of interest vs. the residuals without the variable of interest. Again we need to deal with the missing values, and we remove them from the data set. Finally we add a loess curve to the graphs. All of this is done in

lung1 <- na.omit(lung.cancer[, -c(1, 4, 9)])
fit <- coxph(Surv(time, status) ~ strata(sex) + ph.karno + 
            pat.karno + wt.loss, data=lung1)
df <- data.frame(ph.ecog=lung1$ph.ecog,
                 pat.karno=lung1$pat.karno,
                 ph.karno=lung1$ph.karno,
                 wt.loss=lung1$wt.loss,
                 Residuals=resid(fit))
pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=df, aes(ph.ecog, Residuals)) +
        geom_point() + geom_smooth(se=FALSE),
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(ph.karno, Residuals)) +
        geom_point() + geom_smooth(se=FALSE),
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        
print(ggplot(data=df, aes(pat.karno, Residuals)) +
        geom_point() + geom_smooth(se=FALSE),
  vp=viewport(layout.pos.row=2, layout.pos.col=1))
print(ggplot(data=df, aes(wt.loss, Residuals)) +
        geom_point() + geom_smooth(se=FALSE),
  vp=viewport(layout.pos.row=2, layout.pos.col=2))        

All of the relationships look reasonably linear.

Finally we can have a look whether the assumption of proportional hazards is justified. For this we use the cox.zph function. This plots the rescaled Shoenfeld residuals. A flat appearance indicates that the assumption is ok. The corresponding object carries out hypothesis tests for a significant slope in the scatterplots, which support our assessment.

fit.zph <- cox.zph(fit)
plot(fit.zph)

print(fit.zph)
##              rho chisq      p
## ph.karno  0.1573 3.116 0.0775
## pat.karno 0.0540 0.472 0.4921
## wt.loss   0.0453 0.341 0.5591
## GLOBAL        NA 5.923 0.1154