jagomart
digital resources
picture1_Matrix Pdf 174573 | Lecturenotes


 142x       Filetype PDF       File size 0.36 MB       Source: www.spsc.tugraz.at


File: Matrix Pdf 174573 | Lecturenotes
1 the matrix cookbook a 1 a notation a vector a 2 always denotes a row vector with at we will 1 2 d a d denote a column vector ...

icon picture PDF Filetype PDF | Posted on 27 Jan 2023 | 2 years ago
Partial capture of text on file.
                       1     The Matrix Cookbook
                                                    a 
                                                       1
                                                    a 
                       Notation : A vector a =        2  always denotes a row-vector. With aT = [a ,a ,...,a ] we will
                                                    .                                                     1   2       d
                                                    .    
                                                      .
                                                      a
                                                       d
                       denote a column vector.
                           Transpose :
                                               T      T      T          T      T  T       −1 T        T −1       −T
                                     (A+B) =A +B , (AB) =B A , (A ) =(A ) =A                                                   (1)
                           Product :
                                                               (AB) =XA B                                                      (2)
                                                                     ij         ik  ki
                                                                           k
                                                          (AB)C=A(BC), AB6=BA                                                  (3)
                           Inner product of 2 vectors:
                                                                   T      X
                                                                  a b=        a b                                              (4)
                                                                               k k
                                                                           k
                       1.1     Matrix derivatives
                                                              d
                       Gradient of a function f(x) : R → R: The gradient is defined as row vector:
                                                           ∂f = ∂f , ∂f ,..., ∂f                                             (5)
                                                           ∂x       ∂x ∂x           ∂x
                                                                       1    2          d
                           Chain Rule
                                                                   ∂Z = ∂Z ∂Y                                                  (6)
                                                                  ∂X      ∂Y∂X
                           Product Rule
                                                             ∂(YZ) = ∂YZ+Y∂Z                                                   (7)
                                                               ∂X       ∂X          ∂X
                           Linear derivatives
                                                                  T         T
                                                               ∂a x = ∂x a =aT                                                 (8)
                                                                 ∂x        ∂x
                                                              ∂Ax              T
                                                                    =A, ∂x A =AT                                               (9)
                                                               ∂x             ∂x
                           Quadratic derivatives
                                                        T                                 T
                                                     ∂x Ax         T       T   T       ∂x x        T
                                                        ∂x     =x A+x A            ,    ∂x =2x                                (10)
                                                                            1
                    Figure 1: Linear Regression: The training data points are given by the blue circles, the original
                    function is plotted by the green line. Based on the knowledge from the data-points we want to
                    find the original function.
                    2    Linear Regression
                    Wearegiven a dataset D = hx ,y i      (for simplicity we assume that y is a scalar, x is a vector
                                                i i i=1...N                            P
                                                                                         d          ˜T
                    of dimensionality d). We want to find a linear function f(x;w) = w +     w x =x w with
                                                                                   0     k=1 k k
                    ˜T      T
                    x =[1x ] which minimizes the quadratic error function :
                                                            X˜T          2
                                                   E=1/N       (x w−yi)                                   (11)
                                                                 i
                                                             i
                    The setting is illustrated in Figure 1.
                       The error function can be easily written in matrix form by noting that a sum over the squared
                    error terms can be represented as the inner product of the error vector
                                                     ˜T         
                                                      x w−y1
                                                        1
                                                     ˜T         
                                                      x w−y2
                                                z =    2        =Xw−y                                   (12)
                                                     .          
                                                     .          
                                                      .
                                                      ˜T
                                                      x w−yN
                                                        N
                              ˜T                 
                                x               y1
                                 1
                              ˜T             y   
                                x                2
                    with X =    2  and y=        
                              .              .   
                              .              .   
                                .               .
                                ˜T              y
                                x                N
                                 N
                                                     T                 T
                                           E=1/Nz z=1/N(Xw−y) (Xw−y)                                      (13)
                    2.1   Least Squares Solution
                    Wenowderivate E w.r.t w and set the gradient to 0T
                       Apply Eq. 6, 10 and 9 :
                                                                2
                                      Figure 2: Polynomial basis functions
                                          T
                          ∂E =1/N∂(Xw−y) (Xw−y) =0,        Apply chain rule 6
                          ∂w             ∂w
                                                 T
                                         (Xw−y) X=0
                                       wTXTX−yTX=0
                                           T T     T                   T   −1
                                          w X X=y X, rightmultiply by (X X)
                                        T   T    T  −1
                                      w =y X(X X) ,           transpose...
                                              T  −1 T
                                        w=(X X) X y
                  T  −1 T
                (X X) X isalso known as pseudo inverse of matrix X.
                2.2  Basis functions
                Instead of directly using the input vectors x in our regression, we can use arbitrary non-linear
                basis functions φ (x) for our linear regression
                            k
                                               d
                                       f(x;w) = Xφ (x)w =φTw                         (14)
                                                  k    k
                                              k=1
                  Polynomial Basis functions : For a l degree polynomial we have l +1 basis functions (for
                scalar x)
                                                    k−1
                                             φ (x) = x                               (15)
                                              k
                For higher dimensional x we have to take all products of elements of x up to the order l, thus the
                number of basis functions quickly increases!
                  Radial Basis functions : Here we uniformly distribute l basis functions (also called receptive
                fields) in the input space
                                                       2   2
                                      φ (x) = exp(−(x−µ ) /(2σ )),                   (16)
                                       k              k
                where µ is the location of the kth receptive field and σ denotes the bandwidth of the basis
                      k
                functions. The number of needed basis functions usually scales exponentially with the number of
                input dimensions !
                                                   3
                                              Figure 3: RBF basis functions
                  3   Decomposition in Structure and Approximation Error
                  Definitions :
                     Expected Error: E [E(D)] Expected error if we sample a data set of given size N and use
                                      D
                  this data set to fit our function. Here, the expectation has to be done also over all possible data
                  sets of size N! E(D) denotes the error if we use training set D to fit the function (see below).
                     Structure Error: Error that comes solely from the structure of the function used to fit the
                  data. Can be estimated by fitting the function on a very large data set (thus, the approximation
                  error vanishes).
                     Approximation Error: Is given by the variance of the function estimates if we sample
                  different training sets of size N. The more complex functions we use, the higher the variance of
                  our estimate gets!
                     Lets denote f(x;D) a function learned from the a specific dataset D containing N examples and
                  y(x) denote the target function. We will also denote the expected learned function as E [f(x;D)],
                                                                                          D
                  which is calculated by taking the expectation with respect to all possible datasets D of size N.
                     The error E(D) for the training set D is given by
                                                Z                2
                                         E(D)= x(f(x;D))−y(x)) p(x)dx                          (17)
                  Wenowaddandsubstract E [f(x;D)]
                                          D
                                      Z                                        2
                            E(D) =      (f(x;D))−E [f(x;D)]+E [f(x;D)]−y(x)) p(x)dx =
                                       x           D            D
                                      Z                     2                    2
                                   =     (f(x;D)−E [f(x;D)]) +(E [f(x;D)]−y(x)) +
                                       x            D             D
                                      2(f(x;D)−E [f(x;D)])(E [f(x;D)]−y(x)))p(x)dx
                                                  D          D
                     If we now want to calculate the expected error w.r.t all data sets, E [E(D)], the last line of
                  this equation will vanish, and therefore                    D
                                               Z                        2
                                  E [E(D)] =     E (f(x;D)−E [f(x;D)]) p(x)dx+
                                   D            x D             D
                                               Z                   2
                                                 (E [f(x;D)]−y(x)) p(x)dx
                                                x  D
                     The first term corresponds to the approximation error and the last term to the structure error.
                  This decomposition is also known as bias-variance tradeoff.
                                                          4
The words contained in this file might help you see if this file matches what you are looking for:

...The matrix cookbook a notation vector always denotes row with at we will d denote column transpose t b ab product xa ij ik ki k c bc ba inner of vectors x derivatives gradient function f r is dened as chain rule z y yz linear ax quadratic figure regression training data points are given by blue circles original plotted green line based on knowledge from want to nd wearegiven dataset hx i for simplicity assume that scalar n p dimensionality w which minimizes error e yi setting illustrated in can be easily written form noting sum over squared terms represented xw yn and nz least squares solution wenowderivate set apply eq polynomial basis functions wtxtx ytx rightmultiply isalso known pseudo inverse instead directly using input our use arbitrary non tw l degree have higher dimensional take all products elements up order thus number quickly increases radial here uniformly distribute also called receptive elds space exp...

no reviews yet
Please Login to review.