172x Filetype PDF File size 0.52 MB Source: www.awaleeconsulting.com
Why And How To Use Ensemble Methods in Financial Machine Learning? Study carried out by the Quantitative Practice Special thanks to Pierre-Edouard THIERY JANVIER 2021 S Introduction 1 1. FromASingleModelToEnsembleMethods:BaggingandBoosting 1 2. TheThreeErrorsOfAMachineLearningModel 2 3. WhyIsItBeerToRelyOnBaggingInFinance? 3 Conclusion 5 References 5 Note Awalee Introduction the number of neurons in each layer as well as the functions within each neuron, forms the metaparameters M. The pa- MachineLearningtechniquesaregainingcurrencyinfinance rameters of the neural network are the weights for each link nowadays; ever more strategies rely on Machine Learning betweentwoneuronsfromtwoconsecutivelayers. Thosepa- models such as neural networks to detect ever subtler sig- rameters are estimated thanks to a training set D: formally nals. Nonetheless this rising popularity does not come with- P = P(D). As of now, since the training sets which are used out any shortcoming, the most widespread one being the so- areoftheutmostimportance,wealwayswriteP(D)toclearly called "overfiing", when models tend to learn by heart from mention which training set is used to find the parameters of the data and are thus unable to face unknown data. In our a given model. opinion, using Machine Learning algorithms in finance with- out a deep understanding of their inner logic is highly risky: The gist of ensemble methods is fairly simple: we com- promisinginitial results are oen misleading, the real-life im- bine several weak models to produce a single output. As of plementation being much disappointing due to the lack of now and for the rest of this paper, the number of models is comprehension of what is really happening. denoted N. In this paper we decide to focus on a specific category of Machine Learning meta-algorithms: the ensemble meth- Theensemblemethodscanbedividedintotwomainsets: ods. The ensemble methods are called meta-algorithms since the parallel methods, where the N models are independent, theyprovidedifferentwaysofcombiningmiscellaneousmod- and the sequential methods, where the N models are built els of Machine Learning in order to build a stronger model. progressively. Thosetechniquesarewell-knownforbeingextremelypower- ful withinmanyareas;howeverwebelievethatitisimportant •AParallel Method: Bootstrap Aggregating to understand what their advantages are from a mathemati- cal point of view to make sure that they are used purposefully In this section, we set forth the bootstrap aggregat- whendealingwithafinancial Machine Learning problem. ing method, also known as "bagging", which is the most First we set forth how ensemble methods work from a widespread of the parallel methods [1]. As of now, we as- general point of view. We then present the three sources of sumeatraining set, denoted D, is at our disposal: errors in Machine Learning models before explaining what the advantages of bagging over boosting are in finance, and Definition 2 (Data Set) howtouseefficiently bagging. Adataset D is a set of couples of the following form m ( D={xi,yi) ∈ R ×R,1≤i ≤n} wherenisthecardinalofD. Forthei-thelementinthedataset, m 1 FromASingleModelToEnsemble xi ∈ R is called the vector of the m features, and yi is called the output value. Methods: BaggingandBoosting j Tocarryoutthebagging,weconstructN modelsM with Machine Learning is mainly premised on predictive models. 1 ≤ j ≤ N. To do so, we consider a generic model M(M;•;•), Once devised, a model is then trained thanks to available i.e. a predictive model whose metaparameters are fixed, for data; its purpose is to predict the output value, also known as instance a neural network with a given shape. In order to get the outcome, corresponding to new input data. Formally we N models, the generic model will be trained with N different can define a predictive model in the following manner: training sets Dj: Mj =M(M;P(Dj);•) Definition 1 (Predictive Model) Apredictive model is defined as an operator M, based on meta- Thus, Mj is now a function which, for every input vector m j parameters denoted M, and on parameters denoted P. It uses a x ∈ R outputarealvaluey = M (x). m set of inputs, denoted x ∈ R , to compute an output, denoted The N models are different since they are not trained on O∈R,seenasthepredictedvalue. We can write: thesametrainingset;itmeansthatthesetofparameterswill ( m MM;P;•) : R → R bedifferent; therefore we will have different output values for ( x → O=MM;P;x) the same input vector of features x. Thus the idea of a predictive model is only to predict a value The N training sets are created thanks to the data set D. based on several features which are the inputs. If M is consid- Thesizeofthetrainingsetsischosenbytheuseranddenoted ered to be "the machine", the learning part consists in estimating K with K < n, otherwise the training sets would necessary the parameters P that enable us to use the model. The metapa- contain redundant information. The K elements of the train- rameters M are chosen, and oen optimized, by the user. ing set D for 1 ≤ j ≤ N are sampled in D with replacement: j For instance, a neural network is a predictive model. The for 1 ≤ j ≤ N shape of the neural network, i.e. the number of layers and D = x , y ,1≤k≤K j u(j,k) u(j,k 1 1 whereu(j,k) is a uniform random variable in 1,n . Theprocess is then exactly similar. Thanks to the error of [ ] OncetheN modelsMj havebeentrained, they are com- thesecondmodel,wecancomputeforeachelementwithinD a new weight. Those weights are used to create a new train- bined into the final model Mf. For instance, if we consider ing set: we sample D using the new weights to get D2, which a regression problem, meaning that the output value y does will then be used to train model M3 and so on. not belong to a predetermined finite set of values: It is then possible to define the final model Mf as a weighted sum of the N models Mj, where the weight as- Mf : Rm → R 1 N j sociated with a given model is derived from the error of this ( x → N j=1 M x) modelonthedata. If we consider a classification problem, meaning that the output value y belongs to a finite set of values S, the output Wehave only presented the main ideas of boosting; the value of the final model is determined by a vote of the N simplest implementation of those guidelines is probably the modelsMj: theoutcomewhichappearsthemostamongthe AdaBoost algorithm [2]. N output values produced by the N models is the outcome of the final model. 2 The Three Errors Of A Machine Suchamodelcanthenbetestedonatestsetofdataasit LearningModel is usually done for every model of Machine Learning. It is also worth noticing that there are many bagging ap- A Machine Learning model can suffer from three sources of proaches, which all derive from the general principle as pre- error: the bias, the variance and the noise. It is important sented above. Even though we do not delve into the details, to understand what lies behind those words in order to un- we can for instance mention the so-called "feature bagging", derstand why and how ensemble methods can prove to be where each one of the N models is trained using only a spe- helpful in finance. cific subset of features. Thebias is the error spawned by unrealistic assumptions. • A Sequential Method: Boosting When the bias is particularly important, it means that the model fails to recognize the important relations between the Sequential methods consist no longer in using a set of features and the outputs. The model is said to be underfied N independent models, but instead a sequence of N models, whensuchacaseoccurs. wheretheorderofthemodelsmaers: Figure1displaysamodelwithanimportantbias. Thedots represent the training data, which obviously do not exhibit a M1,...,MN → M1,...,MN linear relation. If we assume that there is a linear relationship betweenthefeaturesandtheoutcomes,suchamodelclearly a mere set: no order a sequence founders to recognize any relation between the former and So we have to construct the sequence of the N models, the laer. beginning with the first one, which will then sway how the second one is defined, and so on and so forth. In the rest of thissectionwepresentsomeoftheprincipalideasofboosting. First, as with bagging, we assume we have a training set madeofnelementsanddenotedD. Ifwechoosetoconsider a generic model M(M;•;•), we can train a first model: 1 ( M =MM;P(D);•) 1 ( ) and For every element within D we can compute M xi compareit to the outcome yi. TodevisethesecondmodelM2,wearegoingtotrainthe ( generic model M M;•;•) on a new training set D1: the new training set derives from D. It contains K < n elements, as will all the subsequent training sets. Figure 1: Underfied model WeaributetoeachelementwithinDaweightdepending 1 on how far ( ). The more important the error, y is from M x i i The variance stems from the sensitivity to tiny changes the higher the weight associated with an element. We then usethoseweightstorandomlysampleDinordertogenerate in the training set. When variance is too high, the model is the new training set D1. overfied on the training set: there happens a "learning-by- heart"situation. Thisexplainswhyevenasmallchangeinthe M2 ( training set can lead to widely different predictions. = M M;P(D1);•) 2 2
no reviews yet
Please Login to review.