138x Filetype PDF File size 0.14 MB Source: matt-wand.utsacademics.info
Vector Differential Calculus in Statistics M.P.WAND Section4illustratestheiruseinstatistics.Section5describesan extension of the methodology to matrix arguments. Many statistical operations benefit from differential calculus. Examplesinclude optimization of likelihood functions and cal- 2. SIMPLEILLUSTRATIVEEXAMPLE culation of information matrices. For multiparameter models Let (x ,y), 1 ≤ i ≤ n, be a set of measurements on two differential calculus suited to vector argument functions is usu- i i variables x and y, and consider the problem of fitting a line y = ally the most efficient means of performing the required calcu- β +β xtothedata. The homoscedastic Gaussian regression lations. We present a primer on vector differential calculus and 0 1 model is demonstrate its application to statistics through several worked 2 examples. y=Xβ+ε, ε ∼ N(0,σ I), KEY WORDS: Best linear prediction; Generalized linear where y 1 x model; Generalized linear mixed model; Information matrix; 1 1 . . . β Matrix differential calculus; Maximum likelihood estimation; y= . , X= . . and β= 0 . . . . β Penalized quasi-likelihood; Score equation. y 1 x 1 n n For σ2 known, the log-likelihood for estimation of β is (up to an additive constant) 1. INTRODUCTION 1 ⊤ Matrix notation is an indispensable tool in statistics. In par- ℓ(β)=2σ2(yXβ) (yXβ). (1) ticular, the storage of model parameters in a vector allows for The scalar differential calculus approach to maximization of economical expression of models that relate the parameters to ℓ(β) involves rewriting (1) as the data. Functions used for estimation of the parameter vector, suchaslikelihoodfunctionsandposteriordensities,arealsocon- n 1 2 veniently expressed using matrix notation. But when it comes ℓ(β)= (y β β x ) 2σ2 i 0 1 i to optimization of these functions through differential calculus i=1 the statistician usually reverts to scalar subscript notation and and then obtaining applies ordinary univariate calculus to the entries of the param- β n eter vector. Subsequent calculus steps, such as those required ∂ℓ(ββ) =(1/σ2) (y β β x ), ∂β i=1 i 0 1 i 0 (2) β n to obtain the information matrix, are also usually done with ∂ℓ(ββ) =(1/σ2) x (y β β x ). ∂β i=1 i i 0 1 i anelement-wiseapproach—sometimesfollowedbyconversion 1 back to matrix notation. Settingthistozeroweobtaintheuniquestationarypointofℓ(β) This article demonstrates that it is often possible to perform occurring at theentireoperationusingmatrixalgebra.Apartfrombeingmore β = yβ x streamlined, it saves conversion between matrix notation and 0 1 n n subscript notation. The key is a differential calculus suited to 2 β = (x x)(y y) (x x) . vector argument and scalar-valued functions. 1 i i i The book by Magnus and Neudecker (1988) describes an i=1 i=1 elegant approach to differential calculus in statistics for gen- Asiswell-known, (2) is algebraically equivalent to eral matrix-argument functions. However, the simplifications that arise for scalar-valued vector-argument functions are some- ⊤ 1 ⊤ β=(X X) X y. (3) what opaque in that reference. This article aims to redress this deficiency, while otherwise using the tools of Magnus and TheHessian matrix of second-order partial derivatives is Neudecker. Other contributions to matrix differential calcu- 2 β 2 β ∂ ℓ(ββ) ∂ ℓ(ββ) 2 ∂β ∂β lus may be found in Dwyer (1967), Searle (1982), Basilevsky ∂β 0 1 2 0 2 β β (1983), and Harville (1997) but each of these references have ∂ ℓ(ββ) ∂ ℓ(ββ) ∂β ∂β 2 0 1 ∂β 1 similar shortcomings for vector-argument functions. n n x Section 2 begins with a simple illustrative example. Section =(1/σ2) i=1 i =(1/σ2)X⊤X n n 2 x x 3 presents the fundamentals of vector differential calculus, and i=1 i i=1 i which can be shown to be negative definite. Therefore β = ⊤ [β β ] as defined in (2) is the maximizer of ℓ(β) and returns M. P. Wand is Associate Professor, Department of Biostatistics, School of 0 1 PublicHealth,HarvardUniversity,665HuntingtonAvenue,Boston,MA02115 the least squares line. The information matrix of β is (E-mail: mwand@hsph.harvard.edu). I am grateful to Mahlet Tadesse, Misha 2 ⊤ 2 ⊤ Salganik, Bhaswati Ganguli, and Yannis Jemiai for comments on the article. I(β)=E{(1/σ )X X}=(1/σ )X X c 2002AmericanStatistical Association The American Statistician, February 2002, Vol. 56, No. 1 1 so the covariance matrix of β is use such subscript notation liberally in the examples of Section 1 2 ⊤ 1 4. I(β) =σ (X X) . Vector differential calculus achieves the same result more di- 3. APRIMERONVECTORDIFFERENTIAL rectly: CALCULUS 1 Consider x ∈ R and the function ⊤ ℓ(β)=2σ2(yXβ) (yXβ) y = x3 sinx. =⇒ dℓ(β)= 1 [{d(yXβ)}⊤(yXβ) 2σ2 Thenscalar differential calculus leads to ⊤ +(yXβ) {d(yXβ)}] dy d 3 3 d 1 = x sin x +x sin x ⊤ dx dx dx = 2σ2[(Xdβ) (yXβ) 2 3 ⊤ =3x sinx+x cosx. (yXβ) Xdβ] =(1/σ2)(yXβ)⊤Xdβ Analternative piece of calculus is =⇒ Dℓ(β)=(1/σ2)(yXβ)⊤X=0 3 dy = d(x sinx) ⊤ iff (yXβ) X=0 3 3 ⊤ 1 ⊤ =(dx )sin x+x (dsin x) =⇒ β =(X X) X y. 2 3 =(3x dx)sin x+x (cosxdx) 2 3 =(3x sinx+x cosx)dx. Also, Hence 2 ⊤ 1 ⊤ d ℓ(β)=(dβ) σ2X X (dβ) dy 2 3 dx =3x sinx+x cosx so 1 as before. This second derivation has used the rules: ⊤ 1 2 ⊤ 1 Hℓ(β)=σ2X X =⇒ I(β) =σ (X X) . 1. d(uv)=(du)v+u(dv) The symbols “Dℓ(β)” and “Hℓ(β)” may be unfamiliar to 2. du3 =3u2du somereaders, and even “dβ” may require clarification. The lat- ter is a vector of differentials, analogous to the numerator and 3. dsinu = cosudu. denominator of dy/dx. In vector differential calculus these dif- ferentialsareoftenvectors,socannotbedividedaccordingtothe Eachoftheserulesseemsreasonablegiventheproductruleand usual rules of algebra. The D and H notation is used throughout rules for differentiation of polynomials and trigonometric func- MagnusandNeudecker(1988)and,forscalar-valuedfunctions tions. The only difference is that they do not involve division of vectors, is defined by the following by differentials such as du. In scalar differential calculus such Definitions: Let f be a scalar-valued function with argument division is “allowable,” but in the vector world division is not x ∈ Rp. The derivative vector of f, Df(x),isthe1 × p vector standard. Therefore, for vector differential calculus, it is more whoseith entry is appropriate to work with products and then compute the deriva- tive vector using (Magnus and Neudecker 1988): ∂f(x) ∂x First Identification Theorem: If a is a 1 × p vector for which i TheHessianmatrix of f is df(x)=adx, ⊤ Hf(x)=D{Df(x) } then and is, alternatively, the p × p matrix with (i,j) entry equal to a=Df(x). ∂2f(x) ∂x∂x . i j TheHessian matrix can be found from Manyofourexamplesareconcernedwithscoreandinforma- Second Identification Theorem: If A is a p × p matrix for tion calculations. If the data vector y has density f(y;β) then which the score vector and information matrices are, respectively, 2 ⊤ D logf(y;θ) and I(β)≡E{H logf(y;θ)}. d f(x)=(dx) Adx, θ θ θθ θθ Notethenecessity of a subscript on the D and H in this instance then to specify the argument of logf(y;θ) being differentiated. We A=Hf(x). 2 Statistical Practice It follows from these theorems that all that is required are 3.3.1 Rules for Scalar Functions rules for expressing df (x) in the forms ⊤ α α1 adx and (dx) Adx. du = αu du, dlogu = u1du, 3.1 Notation u u de = e du. For vectors 3.3.2 Chain Rule a b 1 1 . . a= . and b= . ′ ′ . . d{s(u)} = s (u)⊙(du)=diag{s (u)}du. a b p p 3.3.3 Rules Involving Linear Functions we will introduce the notation (used by modern programming languages such as S-Plus and Matlab) d(AU)=AdU, a b 1 1 d(U+V)=dU+dV, . a⊙b= . , . ddiag(u)=diag(du), a b p p ⊤ ⊤ dU =(dU) , dvecU = vec(dU), a /b 1 1 d(trU)=tr(dU), . a/b= . , . d(EU)=E(dU). a /b p p and 3.3.4 Product and Quotient Rules s(a ) 1 . d(U⊙V)=(dU)⊙V+U⊙(dV), s(a)= . . . d(UV)=(dU)V+U(dV), s(a ) p (du)⊙vu⊙(dv) wheres : R → Risascalarfunction.Wewilluse1todenotea d(u/v)= v⊙v . vector of ones, with dimension clear from the context. Another useful notation is 3.3.5 Rules for Determinant and Matrix Inverse a 0 ··· 0 1 1 d|U| = |U|tr(U dU), 0 a ··· 0 1 1 1 2 dU = U (dU)U . diag(a)= . . . . . . . .. . . . . 00··· a Quadratic forms, particularly those involving symmetric ma- p trices, are very commoninstatistics, so the following results are For a p × q matrix A we define vec(A) to be the pq × 1 vector worth learning: obtained by stacking the columns of A underneath each other, 3.3.6 Rules Involving Quadratic Forms in order from left to right. ⊤ ⊤ ⊤ If v is a random vector then we use E(v) to denote is expec- du Au = u (A+A )du, tation and cov(v) to denote its covariance matrix. ⊤ ⊤ du Au =2u Adu, Asymmetric. 3.2 SomeMatrixAlgebraicRules Thefollowingmatrixalgebraicrelationshipsareusefulinvec- 4. EXAMPLES tor differential calculus. The first one is particularly crucial. diag(a)b = a⊙b, 4.1 Generalized Linear Models diag(a)1 = a, Let y be a vector of responses and X be a corresponding de- tr(AB)=tr(BA), sign matrix. The one-parameter exponential family model, with tr(A⊤ ⊤ canonical link, is characterized by the joint density B)=vec(A) vec(B). f(y;β) = exp{y⊤(Xβ)1⊤b(Xβ)+1⊤c(y)} 3.3 Rules for Differentials whereβisthevectorofcoefficients(e.g.,McCullaghandNelder x 1990). For example, b(x) = log(1 + e ) corresponds to binary LetuandvbevectorfunctionsandUandVbematrixfunc- regression with a logit link function. tions. A will denote a constant matrix and s a scalar function. Thelog-likelihood of β is The American Statistician, February 2002, Vol. 56, No. 1 3 ℓ(β)=y⊤Xβ1⊤b(Xβ)+1⊤c(y) where ⊤ c =[cov{S(x ),S(x )},...cov{S(x ),S(x )}] . =⇒ dℓ(β)=y⊤Xdβ1⊤db(Xβ) 0 0 1 0 n = y⊤Xdβ1⊤diag{b′(Xβ)}d(Xβ) Therefore ⊤ ′ ⊤ ⊤ ⊤ ⊤ 2 ⊤ = y Xdβb(Xβ) Xdβ daC(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c0}da ′ ⊤ = {yb(Xβ)} Xdβ. and Fromthefirst identification theorem the score is 2 ⊤ 2 ⊤ 2 daC(a,b)=2(da) (µ 11 +C+σ I)da ′ ⊤ leading to Dℓ(β)={yb(Xβ)} X. ⊤ ⊤ ⊤ 2 ⊤ D C(a,b)=2{(µ1 a+b)µ1 +a (C+σ I)c },(4) Theinformation matrix of β is obtained from a 0 2 ′ ⊤ and d ℓ(β)=d{yb(Xβ)} Xdβ = {diag{b′′(Xβ)}Xdβ}⊤Xdβ 2 ⊤ 2 H C(a,b)=2(µ 11 +C+σ I)>0 forall b∈R.(5) a ⊤ ⊤ ′′ =(dβ) X [diag{b (Xβ)}]X(dβ) Since which leads to d C(a,b)=2(µ1⊤a+b), Hℓ(β)=X⊤diag{b′′(Xβ)}X db and and the information matrix being d2 C(a,b)=2>0 forall a∈Rd I(β)=E{Hℓ(β)} db2 =X⊤diag{b′′(Xβ)}X=X⊤cov(y)X. it follows from (4) and (5) that the unique minimizers of C(a,b) are ⊤ Therefore, the asymptotic covariance matrix of β is b = µ1 a 1 ⊤ ′′ 1 ⊤ 1 2 1 a =(C+σ I) c I(β) =[X diag{b (Xβ)}X] =[X cov(y)X] . 0 4.2 Kriging and the best linear predictor of y at x is 0 Let(x ,y)beasetofobservationswithx ∈ Rdandy ∈ R. i i i i ⊤ 2 1 d µ+c (C+σ I) (yµ1). Thesimple kriging model for estimation of E(y|x ), x ∈ R , 0 0 0 is In practice, µ and σ2 are replaced by estimators µ and σ2, and y =µ+S(x)+ε Cparameterized and estimated by C, leading to the kriging i i i formula where the random vectors ⊤ 2 1 y(x )=µ+c (C+σ I) (yµ1). S(x ) ε 0 0 1 1 . . S= . and ε= . . . 4.3 Variance Regression Models S(x ) ε n n Thegeneral Gaussian variance regression model is have the moment structure y∼N(Xβ,V ) θ θθ E(S)=E(ε)=0, cov(S)=C and cov(ε)=σ2I whereVθrepresentsaparameterizationofthecovariancematrix θθ (e.g., Cressie 1993). Kriging involves determination of the best of y in terms of θ. An important special case is the Gaussian mixed model linear predictor, S(x ),ofS(x ) in the sense that 0 0 2 y=Xβ+Zu+ε E[{S(x )S(x )} ] 0 0 with the random effects vector u and the error ε having joint is minimized among all linear combinations of the form distribution ⊤ u 0 Gζ 0 S(x )=a y+b. ζζ 0 ε ∼N 0 , 0R τ ττ Thefunction to be minimized over a ∈ Rd and b ∈ R is then for some parameterized matrices G and R (e.g., Robinson ζ τ ζζ ττ ⊤ 2 1991). In this instance C(a,b)=E[{a y+bS(x )} ] 0 ⊤ 2 ⊤ 2 =(µ1 a+b) +a (C+σ I)a ζ ⊤ θ = and V =ZG Z +R . ⊤ 2 θ ζ τ 2c a+E{S(x0) }, τ θθ ζζ ττ 0 4 Statistical Practice
no reviews yet
Please Login to review.