93 Relative Survival Analysis In R

Partial capture of text on file.
                                                     computer methods and programs in biomedicine 81(2006)272–278
                                                   journal homepage: www.intl.elsevierhealth.com/journals/cmpb
                        Relative survival analysis in R
                        MajaPohara,∗, Janez Stareb
                        a Department of Medical Informatics, University of Ljubljana, Vrazov trg 2, SI-1000 Ljubljana, Slovenia
                        b Department of Medical Informatics, University of Ljubljana, Slovenia
                        article info                               abstract
                        Article history:                           Relative survival techniques are used to compare the survival experience in a study cohort
                        Received 7 March 2005                      with the one expected should they follow the background population mortality rates. The
                        Received in revised form 12 January        techniquesareespeciallyusefulwhenthecause-specicdeathinformationisnotaccurate
                        2006                                       ornotavailablesincetheyprovideameasureofexcessmortalityinagroupofpatientswith
                        Accepted12January2006                      acertaindisease.Thereareseveralapproachestomodelingrelativesurvival,butthereisno
                                                                   widely used statistical package that would incorporate the relevant techniques. The exist-
                        Keywords:                                  ingsoftwarewasmostlywrittenbytheauthorsofdifferentmethods,indifferentcomputer
                        Relative survival                          languages and with different requirements for the data input, which makes it almost im-
                        Rsoftware                                  possibleforausertochoosebetweenavailablemodels.WedescribeourRpackagerelsurv
                        Regression                                 that provides functions for easy and exible tting of several relative survival regression
                                                                   models.
                                                                                                              ©2006ElsevierIrelandLtd. All rights reserved.
                        1.      Introduction                                                population life tables. Obviously, r(t) can be any non-negative
                                                                                            number,althoughthemethodsaremostoftenappliedtodata
                        Survivalanalysisisconcernedwithstudyingtherandomvari-               wherer(t) is less than 1.
                        able T, representing the time between entry to a study and             Severalapproachestomodellingrelativesurvivalexist,but
                        some event of interest (e.g. death, recurrence of disease ...).     all of the existing programmes (for example: Surv3 [2], SAS
                        InsteadofthecumulativedistributionfunctionF(t),weusually            macrosandStatafunctions[3],RSurvRfunction[4])focuson
                        estimate the cumulative survival function S(t), which is de-        only one of the models and use specically organised general
                        nedasS(t)=P(T>t)=1F(t).Inthecaseofthenalevent                    populationtables, makingitdifcultfortheuserstocompare
                        beingdeath,S(t)istheprobabilityofbeingaliveattimet.When             different methods.
                        the nal event of interest is death due to a certain disease,          We present three R [5] functions organised as a package
                        but the causes of death are not known, it is not possible to        called relsurv that largely simplify the usage of relative sur-
                        directly estimate the proportion of dead due to the disease         vival regression models. All the functions require the same
                        in question. We then resort to the methods of relative sur-         basic organisation of the data and can be used with any for-
                        vival. The cumulative relative survival function r(t) is dened     matofthepopulationmortalitydata.
                        [1] as                                                                 Section 2 briey describes the three most commonly used
                                                                                            regression approaches and gives an outline of the theory for
                        r(t):= SO(t)                                                 (1)    the ve tting methods used in the relsurv functions. Section
                               S (t)
                                P                                                           3 describes the functions, their arguments, the preparation
                                                                                            of the data and the returned values. The usage is further ex-
                        whereS (t)denotesobservedsurvivalandSP(t)standsforpop-              plainedthroughanexampleinSection4.Incasetheuserdoes
                                O
                        ulationorexpectedsurvival,whichisestimatedonthebasisof              not yet have the relevant population mortality tables organ-
                         ∗ Corresponding author. Tel.:+386 1 543 7785; fax: +386 1 543 7771.
                           E-mail address: maja.pohar@mf.uni-lj.si (M. Pohar).
                        0169-2607/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved.
                        doi:10.1016/j.cmpb.2006.01.004
                                                                               computer methods and programs in biomedicine 81(2006)272–278                                                                                           273
                                      isedintherequiredratetableformat,theappendixprovides                                                          the form E(t) = exp{′z+
′fu(t)}, where z is the vector of
                                      somequickhelp.                                                                                                the predictor variables, fu the vector of the follow-up in-
                                                                                                                                                    tervalindicators(withcomponentsfu,wherefu(t) = 1on
                                                                                                                                                                                                              i               i
                                                                                                                                                    interval i and 0 elsewhere) and  and 
 are the vectors of
                                      2.           Relative survival regression models                                                              regression coefcients. If we denote the interval-specic
                                                                                                                                                                                                                                           ∗
                                                                                                                                                    observedandexpectedsurvivalproportionsbyp andp
                                                                                                                                                                                                                               ki          ki
                                      The relative survival literature most frequently refers to the                                                respectively, Eqs. (2)–(4) lead us to
                                      additive models, dominating especially in cancer research. The                                                             
                                      mainassumptionisthatthehazardofeveryindividualcanbe                                                           ln     ln pki              =′z+
,                                                  (6)
                                      split into two additive components:                                                                                            p∗                      i
                                                                                                                                                                       ki
                                       (t) =  (t) +  (t),                                                                      (2)               which is a generalized linear model with a binomial er-
                                        O          P          E
                                                                                                                                                    ror structure and a complementary log–log link function
                                      whereP(t)isthehazardeverypatienttakesbecauseofhisage,                                                        combinedwithadivisionbyp∗.
                                                                                                                                                                                                   ki
                                      sex and cohort year (or any other combination of covariates                                             (ii)  The glm model with the Poisson error structure [6]: the
                                      included in the population mortality data).  (t) denotes the                                                 methodissimilartoHakulinen–Tenkanenanditrequires
                                                                                                             E
                                      excesshazard,specicforthediseaseinquestionand (t)is                                                         the same grouping of the data. Again the  is assumed
                                                                                                                             O                                                                                          E
                                      usedfor the observed hazard - the one we can estimate from                                                    to be of the  (t) = exp{′z+
′fu(t)} form. The number of
                                                                                                                                                                         E
                                      our data. Using the equality                                                                                  deaths dki is assumed to be distributed as Poisson(ki),
                                                                                                                                                    where  =                 · y    and y is the person-time at risk
                                                                                                                                                                 ki      E,ki    ki           ki
                                                  (t)dt                                                                                           for group k of observations on time interval i. Denoting
                                      S(t) = e               ,                                                                    (3)                                                                       ∗
                                                                                                                                                    the expected number of deaths as d , Eq. (2) can thus be
                                                                                                                                                                                                            ki
                                           wecanwrite                                                                                               writen as
                                                                                     SO(t)                                                                      ∗
                                      S (t) = S (t) × S (t) =⇒ S (t) =                       ,                                    (4)                        d
                                        O          P          E            E                                                                           ki       ki      ′z+

                                                                                     S (t)                                                                =        +e         i                                                          (7)
                                                                                       P                                                             y        y
                                                                                                                                                       ki       ki
                                      givingthesameexpressionas1.Severalttingapproachesfor                                                         ln( d∗)=′z+
 +ln(y ).                                                             (8)
                                      the additive model exist and a review of the three methods                                                          ki      ki                i         ki
                                      described in this paper is given in [6].                                                                           Thisimpliesageneralizedlinearmodelwithoutcome
                                           Themultiplicativemodelsassumethehazardcomponentsto                                                                                                                           ∗
                                                                                                                                                    dki, Poisson error structure, link ln(ki d ) and offset
                                                                                                                                                                                                                        ki
                                      bemultiplicative:                                                                                             ln(yki).
                                                                                                                                                         Themodelcanberewritten[6]toyieldthesamelike-
                                       (t) =  (t) ·  (t).                                                                      (5)                                                         `
                                        O          P        E                                                                                       lihood function as the Esteve model, meaning that the
                                                                                                                                                    grouping of the variables is not crucial. The idea of split-
                                      Such a model does not assume that the observed hazard is                                                      ting the observations into subject-band observationsand
                                      always greater than the population hazard, but has a less ob-                                                 taking d∗ /y         as a kind of an average of individual  val-
                                                                                                                                                                 ki   ki                                                            P
                                      vious interpretation as the additive model.                                                                   ues however causes the estimates to slightly differ from
                                           Thethirdoptionarethetransformationmodels[7]thatmake                                                                               `
                                                                                                                                                    those in the Esteve model.
                                      noassumptionabouttherelationship between the observed,                                                                    `
                                                                                                                                             (iii)  TheEsteveadditivesurvivalmodel[10]:themethoduses
                                      the population and the excess hazard. All the individual sur-                                                 individual data and estimates the coefcients using the
                                      vival times are rst transformed to a different scale (by taking                                              maximumlikelihoodapproach.Thelikelihoodforthead-
                                      intoaccountthegeneralpopulationmortality),wheretheycan                                                        ditive model is
                                      befurther analysed by any of the ordinary survival models.
                                           Ourprogramsprovideforttingthetransformationmodel,                                                                   n                        tj                          n
                                      the Andersen multiplicative model [8], and three differ-                                                                  dj                                                   
                                                                                                                                                    L() =          ( (t ))     exp               (s)ds         =       ( (t )
                                                                                                                                                                      O j                            O                        P j
                                      ent approaches to tting the additive model. Only the last                                                               j=1                             0                       j=1
                                      four have been widely used in the past, while the trans-                                                                                                                                     
                                      formation approach has only recently been published and                                                                                                       tj
                                                                                                                                                                   ′z+
′fu(t )                                     ′z+
′fu(s)
                                                                                                                                                               +e             j )d exp               ( (s) + e                )ds      ,
                                      given its simplicity, it might become popular in the fu-                                                                                    j                      P
                                                                                                                                                                                                   0
                                      ture, especially in studies where long-term observations are                                                                                                                                       (9)
                                      common.
                                                                                                                                                    where t is the event time for person j and d the status
                                                                                                                                                                j                                                          j
                                        (i)  Hakulinen–Tenkanen additive survival method [9]:inor-                                                  indicator. The log-likelihood function is then
                                             der to t the model, the patients must rst be grouped
                                                                                                                                                                n
                                             into K strata, indexed by k, with one stratum for each                                                                                     ′z+
′fu(tj)
                                                                                                                                                    l() =          d ln(P(t ) +e                    )
                                             combinationoftherelevantpredictorvariables (age, sex,                                                                    j           j
                                             cohort year, stage ...)andalife table must be estimated                                                           j=1
                                             for each stratum, with intervals indexed by i. The  (t)is                                                            n  tj                             n  tj
                                                                                                                             E                                 e′z+
′fu(s)ds                                (s)ds.               (10)
                                             assumedtobeamultiplicativefunctionofthecovariates                                                                                                                    P
                                             and constant within each time interval i. We write it in                                                             j=1    0                          j=1     0
                           274                             computer methods and programs in biomedicine 81(2006)272–278
                                Thelast term on the right-hand side in Eq. (10) does not                 changes—forexampleon1Januaryandontheindividual’s
                                dependonor
 andcanbeomitted.Itfollowsthatonly                           birthday.
                                theexpectedhazardsattimeofeventareneededforeach                       rstrans ts the transformation model as described
                                individual.                                                              in (v). If only the transformation times are needed,
                           (iv) The Andersen multiplicative model [8]: the model as-                     this can be done directly by the survexp function
                                sumes  (t) =  (t)exp{′z}, which is the same as in the                 (survival package) or by function rstrans, where
                                         E       0
                                Coxmodel.Theobservedhazardthusbecomes                                    the transformed times are returned in output value
                                                                                                         y(fit<-rstrans(...), y<-fit$y).
                                                   ′z
                                 (t) =  (t) (t)e  .                                      (11)
                                 O       P    0
                                                                                                         The package also includes the functions for testing
                                This model can be rewritten in the form                              goodness-of-t and plotting relevant graphs for all the above
                                                                                                     modelswithmethodsdescribedin[11].Additionally,twodata
                                              ′z+1 ln( (t))
                                 (t) =  (t)e        P   ,                                 (12)
                                 O       0                                                           setsareincludedinthepackage.Oneiscalledrdataandcon-
                                fromwhichitisobviousthatthisisaCoxmodelwithan                        tainssurvivaldatathatcanbeusedasanexample(seeSection
                                additional time-dependent variable with the coefcient               4 for more details). The slopop data set contains the popula-
                                xedto1.Though (t)isacontinuousprocess,thepop-                      tion mortality tables for Slovenia.
                                                      P
                                ulation mortality tables are usually organised in yearly             3.1.      Usage
                                intervals. In order to t the model using the Cox model
                                procedures, data can be split in yearly intervals, with the          All the functions are called in the same way, for example:
                                 (t) updated on each of the intervals.
                                 P
                           (v) The transformation approach [7]: the individual survival                rstrans(formula, data, ratetable, int, na.action,
                                times t are rst transformed as                                           init, control)
                                y = FP(t),                                                  (13)         Twodatasetsarerequiredforanyrelative survival model.
                                where F is the cumulative distribution function of a                 Oneisourobserveddataset, which we will pass to the func-
                                         P                                                           tion as argument data. The other is the population mortality
                                person of a certain age, sex and cohort year (or any                 tablethatwewanttocompareourobserveddatato,andthisis
                                othercombinationincludedinthepopulationtables)that                   speciedintheargumentratetable.Thepopulationmortal-
                                would apply if the person is a typical representative of             ity tables havetobeorganisedasanobjectofclassratetable
                                the general population. This distribution function is cal-           (dened in package survival), default is the survexp.us ta-
                                culated from the general population mortality data. The              ble that contains the US data (also in the survival package).
                                valuesycanthusbeinterpretedastheachievedvalueson                         Themodelisspeciedwithintheargumentformula.Itis
                                theexpectedcumulativedistributionfunctionforeachin-                  organised following the syntactic rules of an R language for-
                                dividual.Bytransformingtothenewscale,thepopulation                   mulaandthussimilartoanyothersurvivalpackageformula.
                                hazardisautomaticallytakenintoaccount,consequently                   Theleft-handsidemustbeaSurvobject.Forexample,iftime
                                all what is left is precisely the disease-specic hazard,            and status are the survival time variable and the censoring
                                which we can thus directly model. One of the possibili-              indicator, and x is a covariate, then the command may be
                                ties is to use the Cox model
                                             ′z                                                       rstrans(Surv(time,status)∼x)
                                (y) = 0(y)e   .                                           (14)
                                                                                                         If the variables that are used for the expected survival
                           3.       Therelsurv package                                               calculation (for example age, sex and year) are not organised
                                                                                                     and not named in the same way as in the population tables,
                           Thecoreofthepackagearethreefunctionsthattthemodels                       aspecialtermratetableistobeincludedintheformula,for
                           described in the previous section:                                        example:
                           rsadd ts an additive model. As described in Sec-                          Surv(time,status)∼x+ratetable(age=age,sex="male",
                              tion 2, different methods of estimation exist, and the                      year=diag)
                              user can choose among them through the method ar-
                                                                    `
                              gument. The default is the Esteve method (iii) which                   Argument int species the number of yearly intervals to be
                              is specied by "max.lik", the other two options are                    usedinrsaddfunction.intcaneitherbeasinglenumberora
                              method="glm.bin" and method="glm.poi" for models                       vector.Thelatterisusedwhentheintervalsarenotalloneyear
                              described in (i) and (ii) of Section 2. When using one                 long, for example int=c(0,0.5,1,5,10) means a couple of
                              of the glm methods, the observed and expected sur-                     shortintervalsatthebeginningandlongerintervalsthereafter
                              vival proportions for each group are returned as object                with the maximum follow-up time of 10 years. Each interval
                              groups.                                                                includestherightendpoint,butnottheleftone,exceptforthe
                           rsmul ts the multiplicative model as described in (iv).                 rstintervalwhichalsoincludestheleftendpoint.Inthiscase
                              A more computationally intensive alternative is cho-                   the third interval would therefore be (1,5].
                              sen by specifying method="mul1"that splits the intervals                   This option can also be used in rsmul and rstrans func-
                              at every time point in which the population mortality                  tions, serving simply as the maximum follow-up time after
                                                      computer methods and programs in biomedicine 81(2006)272–278                                          275
                          which all the data are censored, enabling in this way a more           rsadd(Surv(time,status)∼x,ratetable=slopop,
                          straightforward comparison of the models.                                  data=my.data)
                             Missing data are handled in the usual way through the ar-
                          gument na.action and the initial values for the model can             assumesthatthedatasetmy.datahasvariablesage,sexand
                          bespeciedviatheargumentinit.Thenumberofiterations                    yearorganisedintheformatthatmatchesslopopratetable.
                          and other tting specications can be specied through the
                          control argument, which follows the glm.control function              3.3.     Values
                          in the rsadd function and the coxph.control function in
                          rsmulandrstrans.                                                      The rsadd function returns an object of class rsadd, while
                                                                                                rsmulandrstransreturnacoxphobject.Allthereturnedval-
                          3.2.     Preparing data                                               uescanbeviewedassuchorbythehelpofsummaryfunction.
                                                                                                Apartfromthecoefcients,theirstandardizedvaluesandthe
                          The functions follow syntactic conventions of the R package           other usually returned values, the common object returned
                          survivalandthedatashouldbeorganisedasrequiredbythe                    byallthefunctionsisnamedyandcontaintsthesurvivaland
                          survivalfunctions. In particular, this means that we need a           censoring times for each individual, organised as a Surv ob-
                          timeandacensoringvalueforeachsubjectandanyadditional                  ject. In the case of rstrans function, these are already the
                          numberofcovariateswewouldliketoincludeinthemodel.All                  transformed times and this function can thus also be used
                          the functions also support time-dependent data, with times            just for making the transformation.
                          andeventsorganisedinthestandardwayinaSurvobject.                         The max.lik method of rsadd function is based on max-
                             The main feature of any relative survival method is the            imum likelihood and therefore the output also contains the
                          comparison between the observed and population mortality              log-likelihoodvaluesstoredasloglik.Theothertwomethods
                          data. The latter have to be organized as a ratetable object           are based on the glm function and the returned objects pre-
                          (see Appendix A for details) and specied in the ratetable            serveallitsvalues.Theoriginalandthegroupeddataareboth
                          argument of the used function. The only thing that remains            stored in the output object as data and groups, respectively.
                          to be done is matching the variables in our data set with the         The original data set contains the demographic variables in
                          namesofthevariablesthattheratetableisgroupedby.Age                    the format that matches the rate table (stored as X1, X2,...)
                          intheratetableobjectisalwaysexpressedindaysandthere-                  and the covariates in the format in which they were used in
                          fore the samemustbedonewiththeageintheobserveddata                    themodel.Foreachofthegroupsinthemodel,objectgroups
                          set. The calendar year must be specied in the date format            contains the number at risk (ld), number of events (nd), and
                          andthecodingofthesexvariablemustbedoneinthematch-                                  ∗       ∗
                                                                                                the values p (ps), d (dstar) and ln(yki)(lny) dened in Sec-
                          ingway(intheUSandSlovenetablessexisdenedasafactor                                 ki      ki
                                                                                                tion 2. This object can be useful in a more detailed analysis of
                          variable with levels “male” and “female”, so either a factor          the data. Additionally, a note about the number of the groups
                          variable or an integer with values 1 and 2 will work).                in which the observed number of deaths is smaller than the
                             Theuser, however, does not have to change the data set—            expected is reported (this gives a very basic indication about
                          all the matching can be done by the help of a special term            goodnessoft).
                          in the formula named ratetable. In order to construct this
                          term, rst check the names of the ratetable variables. For
                          the slopop rate table the names are                                   4.       Example
                           >attributes(slopop)$dimid
                           "age" "year" "sex"                                                   Toillustrate the usage of the program, we use a subset of data
                                                                                                from the study of survival of patients after acute myocardial
                             These names are now to be matched with the variable                infarctionthatisincludedinthepackageasthelerdata.The
                          names(oranyfunctionofthem),forexample:                                data were collected in the study carried out at the University
                                                                                                Clinical Center in Ljubljana and contain 1040 patients diag-
                           Surv(time,status)∼x+ratetable(age=age*365.24,                        nosedbetween1982and1986andfollowedupuntil1997.Dur-
                               sex="male",year=diag)                                            ing this time 547 deaths occurred and as the causes of death
                                                                                                are not given, this is a good example of the need of the rela-
                             Thenamesontheleftofeachequalitysignarethedimen-                    tive survival methodology. The organisation of the data is as
                          sionnamesoftheratetablenames,whilethenamesonthe                       follows:
                          right are the variable names in the observed data set. This ex-          >rdata[1:2,]
                          ampleisforadatasetinwhichnovariablesexisincludedas                       time            cens age sex year         agegr
                          all the patients are male, the calendar year variable is stored       1 2657             1     68   2     24Jun82 62-70
                          as diag (diagnosis year), and age is in years and must there-         2 1097             1     63   2     31Aug82 62-70
                          forebemultipliedby365.24tomatchthedataintheratetable                     Time is measured in days and year of infarction is ex-
                          (note that .24 is used in all R calculations as there are 24 leap     pressed in R date format. Age is measured in years and a cat-
                          years per century).                                                   egoricalvariableagegrcontainingfouragecategories(“under
                             If the variables in the observed data set are of the same          54”, “54–61”, “62–70”, “71–95”) is formed. The censoring indi-
                          type and have the same names as the ratetable variables, the          cator is specied in variable cens and is coded 0 (censoring)
                          termratetablecanbeomittedfromtheformula,i.e.thecall                   and1(event).
The words contained in this file might help you see if this file matches what you are looking for:

...Computer methods and programs in biomedicine journal homepage www intl elsevierhealth com journals cmpb relative survival analysis r majapohara janez stareb a department of medical informatics university ljubljana vrazov trg si slovenia b article info abstract history techniques are used to compare the experience study cohort received march with one expected should they follow background population mortality rates revised form january techniquesareespeciallyusefulwhenthecause specicdeathinformationisnotaccurate ornotavailablesincetheyprovideameasureofexcessmortalityinagroupofpatientswith acceptedjanuary acertaindisease thereareseveralapproachestomodelingrelativesurvival butthereisno widely statistical package that would incorporate relevant exist keywords ingsoftwarewasmostlywrittenbytheauthorsofdifferentmethods indifferentcomputer languages different requirements for data input which makes it almost im rsoftware possibleforausertochoosebetweenavailablemodels wedescribeourrpackagerelsu...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area