123x Filetype PDF File size 0.16 MB Source: www.pauldickman.com
computer methods and programs in biomedicine 81(2006)272–278 journal homepage: www.intl.elsevierhealth.com/journals/cmpb Relative survival analysis in R MajaPohara,∗, Janez Stareb a Department of Medical Informatics, University of Ljubljana, Vrazov trg 2, SI-1000 Ljubljana, Slovenia b Department of Medical Informatics, University of Ljubljana, Slovenia article info abstract Article history: Relative survival techniques are used to compare the survival experience in a study cohort Received 7 March 2005 with the one expected should they follow the background population mortality rates. The Received in revised form 12 January techniquesareespeciallyusefulwhenthecause-specicdeathinformationisnotaccurate 2006 ornotavailablesincetheyprovideameasureofexcessmortalityinagroupofpatientswith Accepted12January2006 acertaindisease.Thereareseveralapproachestomodelingrelativesurvival,butthereisno widely used statistical package that would incorporate the relevant techniques. The exist- Keywords: ingsoftwarewasmostlywrittenbytheauthorsofdifferentmethods,indifferentcomputer Relative survival languages and with different requirements for the data input, which makes it almost im- Rsoftware possibleforausertochoosebetweenavailablemodels.WedescribeourRpackagerelsurv Regression that provides functions for easy and exible tting of several relative survival regression models. ©2006ElsevierIrelandLtd. All rights reserved. 1. Introduction population life tables. Obviously, r(t) can be any non-negative number,althoughthemethodsaremostoftenappliedtodata Survivalanalysisisconcernedwithstudyingtherandomvari- wherer(t) is less than 1. able T, representing the time between entry to a study and Severalapproachestomodellingrelativesurvivalexist,but some event of interest (e.g. death, recurrence of disease ...). all of the existing programmes (for example: Surv3 [2], SAS InsteadofthecumulativedistributionfunctionF(t),weusually macrosandStatafunctions[3],RSurvRfunction[4])focuson estimate the cumulative survival function S(t), which is de- only one of the models and use specically organised general nedasS(t)=P(T>t)=1F(t).Inthecaseofthenalevent populationtables, makingitdifcultfortheuserstocompare beingdeath,S(t)istheprobabilityofbeingaliveattimet.When different methods. the nal event of interest is death due to a certain disease, We present three R [5] functions organised as a package but the causes of death are not known, it is not possible to called relsurv that largely simplify the usage of relative sur- directly estimate the proportion of dead due to the disease vival regression models. All the functions require the same in question. We then resort to the methods of relative sur- basic organisation of the data and can be used with any for- vival. The cumulative relative survival function r(t) is dened matofthepopulationmortalitydata. [1] as Section 2 briey describes the three most commonly used regression approaches and gives an outline of the theory for r(t):= SO(t) (1) the ve tting methods used in the relsurv functions. Section S (t) P 3 describes the functions, their arguments, the preparation of the data and the returned values. The usage is further ex- whereS (t)denotesobservedsurvivalandSP(t)standsforpop- plainedthroughanexampleinSection4.Incasetheuserdoes O ulationorexpectedsurvival,whichisestimatedonthebasisof not yet have the relevant population mortality tables organ- ∗ Corresponding author. Tel.:+386 1 543 7785; fax: +386 1 543 7771. E-mail address: maja.pohar@mf.uni-lj.si (M. Pohar). 0169-2607/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2006.01.004 computer methods and programs in biomedicine 81(2006)272–278 273 isedintherequiredratetableformat,theappendixprovides the form E(t) = exp{′z+ ′fu(t)}, where z is the vector of somequickhelp. the predictor variables, fu the vector of the follow-up in- tervalindicators(withcomponentsfu,wherefu(t) = 1on i i interval i and 0 elsewhere) and and are the vectors of 2. Relative survival regression models regression coefcients. If we denote the interval-specic ∗ observedandexpectedsurvivalproportionsbyp andp ki ki The relative survival literature most frequently refers to the respectively, Eqs. (2)–(4) lead us to additive models, dominating especially in cancer research. The mainassumptionisthatthehazardofeveryindividualcanbe ln ln pki =′z+ , (6) split into two additive components: p∗ i ki (t) = (t) + (t), (2) which is a generalized linear model with a binomial er- O P E ror structure and a complementary log–log link function whereP(t)isthehazardeverypatienttakesbecauseofhisage, combinedwithadivisionbyp∗. ki sex and cohort year (or any other combination of covariates (ii) The glm model with the Poisson error structure [6]: the included in the population mortality data). (t) denotes the methodissimilartoHakulinen–Tenkanenanditrequires E excesshazard,specicforthediseaseinquestionand (t)is the same grouping of the data. Again the is assumed O E usedfor the observed hazard - the one we can estimate from to be of the (t) = exp{′z+ ′fu(t)} form. The number of E our data. Using the equality deaths dki is assumed to be distributed as Poisson(ki), where = · y and y is the person-time at risk ki E,ki ki ki (t)dt for group k of observations on time interval i. Denoting S(t) = e , (3) ∗ the expected number of deaths as d , Eq. (2) can thus be ki wecanwrite writen as SO(t) ∗ S (t) = S (t) × S (t) =⇒ S (t) = , (4) d O P E E ki ki ′z+ S (t) = +e i (7) P y y ki ki givingthesameexpressionas1.Severalttingapproachesfor ln( d∗)=′z+ +ln(y ). (8) the additive model exist and a review of the three methods ki ki i ki described in this paper is given in [6]. Thisimpliesageneralizedlinearmodelwithoutcome Themultiplicativemodelsassumethehazardcomponentsto ∗ dki, Poisson error structure, link ln(ki d ) and offset ki bemultiplicative: ln(yki). Themodelcanberewritten[6]toyieldthesamelike- (t) = (t) · (t). (5) ` O P E lihood function as the Esteve model, meaning that the grouping of the variables is not crucial. The idea of split- Such a model does not assume that the observed hazard is ting the observations into subject-band observationsand always greater than the population hazard, but has a less ob- taking d∗ /y as a kind of an average of individual val- ki ki P vious interpretation as the additive model. ues however causes the estimates to slightly differ from Thethirdoptionarethetransformationmodels[7]thatmake ` those in the Esteve model. noassumptionabouttherelationship between the observed, ` (iii) TheEsteveadditivesurvivalmodel[10]:themethoduses the population and the excess hazard. All the individual sur- individual data and estimates the coefcients using the vival times are rst transformed to a different scale (by taking maximumlikelihoodapproach.Thelikelihoodforthead- intoaccountthegeneralpopulationmortality),wheretheycan ditive model is befurther analysed by any of the ordinary survival models. Ourprogramsprovideforttingthetransformationmodel, n tj n the Andersen multiplicative model [8], and three differ- dj L() = ( (t )) exp (s)ds = ( (t ) O j O P j ent approaches to tting the additive model. Only the last j=1 0 j=1 four have been widely used in the past, while the trans- formation approach has only recently been published and tj ′z+ ′fu(t ) ′z+ ′fu(s) +e j )d exp ( (s) + e )ds , given its simplicity, it might become popular in the fu- j P 0 ture, especially in studies where long-term observations are (9) common. where t is the event time for person j and d the status j j (i) Hakulinen–Tenkanen additive survival method [9]:inor- indicator. The log-likelihood function is then der to t the model, the patients must rst be grouped n into K strata, indexed by k, with one stratum for each ′z+ ′fu(tj) l() = d ln(P(t ) +e ) combinationoftherelevantpredictorvariables (age, sex, j j cohort year, stage ...)andalife table must be estimated j=1 for each stratum, with intervals indexed by i. The (t)is n tj n tj E e′z+ ′fu(s)ds (s)ds. (10) assumedtobeamultiplicativefunctionofthecovariates P and constant within each time interval i. We write it in j=1 0 j=1 0 274 computer methods and programs in biomedicine 81(2006)272–278 Thelast term on the right-hand side in Eq. (10) does not changes—forexampleon1Januaryandontheindividual’s dependonor andcanbeomitted.Itfollowsthatonly birthday. theexpectedhazardsattimeofeventareneededforeach rstrans ts the transformation model as described individual. in (v). If only the transformation times are needed, (iv) The Andersen multiplicative model [8]: the model as- this can be done directly by the survexp function sumes (t) = (t)exp{′z}, which is the same as in the (survival package) or by function rstrans, where E 0 Coxmodel.Theobservedhazardthusbecomes the transformed times are returned in output value y(fit<-rstrans(...), y<-fit$y). ′z (t) = (t) (t)e . (11) O P 0 The package also includes the functions for testing This model can be rewritten in the form goodness-of-t and plotting relevant graphs for all the above modelswithmethodsdescribedin[11].Additionally,twodata ′z+1 ln( (t)) (t) = (t)e P , (12) O 0 setsareincludedinthepackage.Oneiscalledrdataandcon- fromwhichitisobviousthatthisisaCoxmodelwithan tainssurvivaldatathatcanbeusedasanexample(seeSection additional time-dependent variable with the coefcient 4 for more details). The slopop data set contains the popula- xedto1.Though (t)isacontinuousprocess,thepop- tion mortality tables for Slovenia. P ulation mortality tables are usually organised in yearly 3.1. Usage intervals. In order to t the model using the Cox model procedures, data can be split in yearly intervals, with the All the functions are called in the same way, for example: (t) updated on each of the intervals. P (v) The transformation approach [7]: the individual survival rstrans(formula, data, ratetable, int, na.action, times t are rst transformed as init, control) y = FP(t), (13) Twodatasetsarerequiredforanyrelative survival model. where F is the cumulative distribution function of a Oneisourobserveddataset, which we will pass to the func- P tion as argument data. The other is the population mortality person of a certain age, sex and cohort year (or any tablethatwewanttocompareourobserveddatato,andthisis othercombinationincludedinthepopulationtables)that speciedintheargumentratetable.Thepopulationmortal- would apply if the person is a typical representative of ity tables havetobeorganisedasanobjectofclassratetable the general population. This distribution function is cal- (dened in package survival), default is the survexp.us ta- culated from the general population mortality data. The ble that contains the US data (also in the survival package). valuesycanthusbeinterpretedastheachievedvalueson Themodelisspeciedwithintheargumentformula.Itis theexpectedcumulativedistributionfunctionforeachin- organised following the syntactic rules of an R language for- dividual.Bytransformingtothenewscale,thepopulation mulaandthussimilartoanyothersurvivalpackageformula. hazardisautomaticallytakenintoaccount,consequently Theleft-handsidemustbeaSurvobject.Forexample,iftime all what is left is precisely the disease-specic hazard, and status are the survival time variable and the censoring which we can thus directly model. One of the possibili- indicator, and x is a covariate, then the command may be ties is to use the Cox model ′z rstrans(Surv(time,status)∼x) (y) = 0(y)e . (14) If the variables that are used for the expected survival 3. Therelsurv package calculation (for example age, sex and year) are not organised and not named in the same way as in the population tables, Thecoreofthepackagearethreefunctionsthattthemodels aspecialtermratetableistobeincludedintheformula,for described in the previous section: example: rsadd ts an additive model. As described in Sec- Surv(time,status)∼x+ratetable(age=age,sex="male", tion 2, different methods of estimation exist, and the year=diag) user can choose among them through the method ar- ` gument. The default is the Esteve method (iii) which Argument int species the number of yearly intervals to be is specied by "max.lik", the other two options are usedinrsaddfunction.intcaneitherbeasinglenumberora method="glm.bin" and method="glm.poi" for models vector.Thelatterisusedwhentheintervalsarenotalloneyear described in (i) and (ii) of Section 2. When using one long, for example int=c(0,0.5,1,5,10) means a couple of of the glm methods, the observed and expected sur- shortintervalsatthebeginningandlongerintervalsthereafter vival proportions for each group are returned as object with the maximum follow-up time of 10 years. Each interval groups. includestherightendpoint,butnottheleftone,exceptforthe rsmul ts the multiplicative model as described in (iv). rstintervalwhichalsoincludestheleftendpoint.Inthiscase A more computationally intensive alternative is cho- the third interval would therefore be (1,5]. sen by specifying method="mul1"that splits the intervals This option can also be used in rsmul and rstrans func- at every time point in which the population mortality tions, serving simply as the maximum follow-up time after computer methods and programs in biomedicine 81(2006)272–278 275 which all the data are censored, enabling in this way a more rsadd(Surv(time,status)∼x,ratetable=slopop, straightforward comparison of the models. data=my.data) Missing data are handled in the usual way through the ar- gument na.action and the initial values for the model can assumesthatthedatasetmy.datahasvariablesage,sexand bespeciedviatheargumentinit.Thenumberofiterations yearorganisedintheformatthatmatchesslopopratetable. and other tting specications can be specied through the control argument, which follows the glm.control function 3.3. Values in the rsadd function and the coxph.control function in rsmulandrstrans. The rsadd function returns an object of class rsadd, while rsmulandrstransreturnacoxphobject.Allthereturnedval- 3.2. Preparing data uescanbeviewedassuchorbythehelpofsummaryfunction. Apartfromthecoefcients,theirstandardizedvaluesandthe The functions follow syntactic conventions of the R package other usually returned values, the common object returned survivalandthedatashouldbeorganisedasrequiredbythe byallthefunctionsisnamedyandcontaintsthesurvivaland survivalfunctions. In particular, this means that we need a censoring times for each individual, organised as a Surv ob- timeandacensoringvalueforeachsubjectandanyadditional ject. In the case of rstrans function, these are already the numberofcovariateswewouldliketoincludeinthemodel.All transformed times and this function can thus also be used the functions also support time-dependent data, with times just for making the transformation. andeventsorganisedinthestandardwayinaSurvobject. The max.lik method of rsadd function is based on max- The main feature of any relative survival method is the imum likelihood and therefore the output also contains the comparison between the observed and population mortality log-likelihoodvaluesstoredasloglik.Theothertwomethods data. The latter have to be organized as a ratetable object are based on the glm function and the returned objects pre- (see Appendix A for details) and specied in the ratetable serveallitsvalues.Theoriginalandthegroupeddataareboth argument of the used function. The only thing that remains stored in the output object as data and groups, respectively. to be done is matching the variables in our data set with the The original data set contains the demographic variables in namesofthevariablesthattheratetableisgroupedby.Age the format that matches the rate table (stored as X1, X2,...) intheratetableobjectisalwaysexpressedindaysandthere- and the covariates in the format in which they were used in fore the samemustbedonewiththeageintheobserveddata themodel.Foreachofthegroupsinthemodel,objectgroups set. The calendar year must be specied in the date format contains the number at risk (ld), number of events (nd), and andthecodingofthesexvariablemustbedoneinthematch- ∗ ∗ the values p (ps), d (dstar) and ln(yki)(lny) dened in Sec- ingway(intheUSandSlovenetablessexisdenedasafactor ki ki tion 2. This object can be useful in a more detailed analysis of variable with levels “male” and “female”, so either a factor the data. Additionally, a note about the number of the groups variable or an integer with values 1 and 2 will work). in which the observed number of deaths is smaller than the Theuser, however, does not have to change the data set— expected is reported (this gives a very basic indication about all the matching can be done by the help of a special term goodnessoft). in the formula named ratetable. In order to construct this term, rst check the names of the ratetable variables. For the slopop rate table the names are 4. Example >attributes(slopop)$dimid "age" "year" "sex" Toillustrate the usage of the program, we use a subset of data from the study of survival of patients after acute myocardial These names are now to be matched with the variable infarctionthatisincludedinthepackageasthelerdata.The names(oranyfunctionofthem),forexample: data were collected in the study carried out at the University Clinical Center in Ljubljana and contain 1040 patients diag- Surv(time,status)∼x+ratetable(age=age*365.24, nosedbetween1982and1986andfollowedupuntil1997.Dur- sex="male",year=diag) ing this time 547 deaths occurred and as the causes of death are not given, this is a good example of the need of the rela- Thenamesontheleftofeachequalitysignarethedimen- tive survival methodology. The organisation of the data is as sionnamesoftheratetablenames,whilethenamesonthe follows: right are the variable names in the observed data set. This ex- >rdata[1:2,] ampleisforadatasetinwhichnovariablesexisincludedas time cens age sex year agegr all the patients are male, the calendar year variable is stored 1 2657 1 68 2 24Jun82 62-70 as diag (diagnosis year), and age is in years and must there- 2 1097 1 63 2 31Aug82 62-70 forebemultipliedby365.24tomatchthedataintheratetable Time is measured in days and year of infarction is ex- (note that .24 is used in all R calculations as there are 24 leap pressed in R date format. Age is measured in years and a cat- years per century). egoricalvariableagegrcontainingfouragecategories(“under If the variables in the observed data set are of the same 54”, “54–61”, “62–70”, “71–95”) is formed. The censoring indi- type and have the same names as the ratetable variables, the cator is specied in variable cens and is coded 0 (censoring) termratetablecanbeomittedfromtheformula,i.e.thecall and1(event).
no reviews yet
Please Login to review.