142x Filetype PDF File size 0.36 MB Source: www.spsc.tugraz.at
1 The Matrix Cookbook a 1 a Notation : A vector a = 2 always denotes a row-vector. With aT = [a ,a ,...,a ] we will . 1 2 d . . a d denote a column vector. Transpose : T T T T T T −1 T T −1 −T (A+B) =A +B , (AB) =B A , (A ) =(A ) =A (1) Product : (AB) =XA B (2) ij ik ki k (AB)C=A(BC), AB6=BA (3) Inner product of 2 vectors: T X a b= a b (4) k k k 1.1 Matrix derivatives d Gradient of a function f(x) : R → R: The gradient is defined as row vector: ∂f = ∂f , ∂f ,..., ∂f (5) ∂x ∂x ∂x ∂x 1 2 d Chain Rule ∂Z = ∂Z ∂Y (6) ∂X ∂Y∂X Product Rule ∂(YZ) = ∂YZ+Y∂Z (7) ∂X ∂X ∂X Linear derivatives T T ∂a x = ∂x a =aT (8) ∂x ∂x ∂Ax T =A, ∂x A =AT (9) ∂x ∂x Quadratic derivatives T T ∂x Ax T T T ∂x x T ∂x =x A+x A , ∂x =2x (10) 1 Figure 1: Linear Regression: The training data points are given by the blue circles, the original function is plotted by the green line. Based on the knowledge from the data-points we want to find the original function. 2 Linear Regression Wearegiven a dataset D = hx ,y i (for simplicity we assume that y is a scalar, x is a vector i i i=1...N P d ˜T of dimensionality d). We want to find a linear function f(x;w) = w + w x =x w with 0 k=1 k k ˜T T x =[1x ] which minimizes the quadratic error function : X˜T 2 E=1/N (x w−yi) (11) i i The setting is illustrated in Figure 1. The error function can be easily written in matrix form by noting that a sum over the squared error terms can be represented as the inner product of the error vector ˜T x w−y1 1 ˜T x w−y2 z = 2 =Xw−y (12) . . . ˜T x w−yN N ˜T x y1 1 ˜T y x 2 with X = 2 and y= . . . . . . ˜T y x N N T T E=1/Nz z=1/N(Xw−y) (Xw−y) (13) 2.1 Least Squares Solution Wenowderivate E w.r.t w and set the gradient to 0T Apply Eq. 6, 10 and 9 : 2 Figure 2: Polynomial basis functions T ∂E =1/N∂(Xw−y) (Xw−y) =0, Apply chain rule 6 ∂w ∂w T (Xw−y) X=0 wTXTX−yTX=0 T T T T −1 w X X=y X, rightmultiply by (X X) T T T −1 w =y X(X X) , transpose... T −1 T w=(X X) X y T −1 T (X X) X isalso known as pseudo inverse of matrix X. 2.2 Basis functions Instead of directly using the input vectors x in our regression, we can use arbitrary non-linear basis functions φ (x) for our linear regression k d f(x;w) = Xφ (x)w =φTw (14) k k k=1 Polynomial Basis functions : For a l degree polynomial we have l +1 basis functions (for scalar x) k−1 φ (x) = x (15) k For higher dimensional x we have to take all products of elements of x up to the order l, thus the number of basis functions quickly increases! Radial Basis functions : Here we uniformly distribute l basis functions (also called receptive fields) in the input space 2 2 φ (x) = exp(−(x−µ ) /(2σ )), (16) k k where µ is the location of the kth receptive field and σ denotes the bandwidth of the basis k functions. The number of needed basis functions usually scales exponentially with the number of input dimensions ! 3 Figure 3: RBF basis functions 3 Decomposition in Structure and Approximation Error Definitions : Expected Error: E [E(D)] Expected error if we sample a data set of given size N and use D this data set to fit our function. Here, the expectation has to be done also over all possible data sets of size N! E(D) denotes the error if we use training set D to fit the function (see below). Structure Error: Error that comes solely from the structure of the function used to fit the data. Can be estimated by fitting the function on a very large data set (thus, the approximation error vanishes). Approximation Error: Is given by the variance of the function estimates if we sample different training sets of size N. The more complex functions we use, the higher the variance of our estimate gets! Lets denote f(x;D) a function learned from the a specific dataset D containing N examples and y(x) denote the target function. We will also denote the expected learned function as E [f(x;D)], D which is calculated by taking the expectation with respect to all possible datasets D of size N. The error E(D) for the training set D is given by Z 2 E(D)= x(f(x;D))−y(x)) p(x)dx (17) Wenowaddandsubstract E [f(x;D)] D Z 2 E(D) = (f(x;D))−E [f(x;D)]+E [f(x;D)]−y(x)) p(x)dx = x D D Z 2 2 = (f(x;D)−E [f(x;D)]) +(E [f(x;D)]−y(x)) + x D D 2(f(x;D)−E [f(x;D)])(E [f(x;D)]−y(x)))p(x)dx D D If we now want to calculate the expected error w.r.t all data sets, E [E(D)], the last line of this equation will vanish, and therefore D Z 2 E [E(D)] = E (f(x;D)−E [f(x;D)]) p(x)dx+ D x D D Z 2 (E [f(x;D)]−y(x)) p(x)dx x D The first term corresponds to the approximation error and the last term to the structure error. This decomposition is also known as bias-variance tradeoff. 4
no reviews yet
Please Login to review.