Model Fitting 102: Augmenting the Mechanics of Learning for Less-Than-Ideal Data
When standing up a model to use in the real world there is a long list of considerations to take into account, from checking the integrity of the training data to understanding the biases and assumptions in different modeling algorithms to selecting the best metrics for evaluating the model. All of these considerations (and many more) are vital steps to creating a quality model. For each step there are a vast number of solutions, all of which are better or worse depending on the data at hand. Having an intimate understanding of the data (specifically any biases in the training set and the quality of any target metrics) play an integral part in how we prep the data for modeling, which modeling algorithm we choose, and how we evaluate the model once it’s been trained. So much of the modeling process is bespoke to the training data that I feel comfortable no automated ML solution will ever automate me out of a job.
For all the effort put into tailoring the model to the data and its intended application, there is one part of this process that is often overlooked: Is our model learning the data in the most appropriate way? This isn’t to say that we need to run another grid search on the hyperparameters on our optimizer to ensure that we are finding a global optima, this is more of a step back to understand how our model is learning the data and asking ourselves “Will this give me the best outcome?”. For the remainder of this post, I want to reevaluate the mechanics of learning and explore some alternative methods for more intentionally training a model.
Before we go too far it always helps to review the fundamentals of statistical learning before pointing out where it can fall short. Every model has a set of parameters which are updated by the data fed into the model. The aim of fitting the model is to minimize the training error (error between each training sample’s predicted outcome and actual outcome) by updating these model parameters. If we only focus on the training error we end up with an overfit model that performs terribly on all other datasets, but for the sake of brevity we will only focus on the training error for now. The formula for this training error is a differentiable function with respect to the model parameters, so we are able to use some basic calculus to calculate the optimal parameter values to minimize the training error. Certain tree-based models use a slightly different method for updating discrete parameters, but this process covers the vast majority of other machine learning algorithms (for a more in-depth review of this process please see Chapter 4 in Reference 1.) So what is wrong with this process? I want to start out by saying that nothing is wrong with this method of fitting a model, in fact I’m suggesting that we continue to use it. What I am suggesting is that we can supplement this process to more intentionally train a model to better serve a specific use case.
The key assumption that I want to address is that the traditional method of fitting a model assumes that all data points are of equal value. Ideally this would always be true, every sample in your dataset would be equally reliable and representative of your true training set, but we know that this is almost never the case with real world data. Some data can be labeled by hand, some labels can be imputed, and sometimes more recent data is more relevant than older data. So if we know that certain samples are “better” samples than others, why do we want to train our model to learn from every sample equally? Wouldn’t it be more effective for our model to learn more from “better” samples and less from others? Of course this could lead to tremendous overfitting if left unchecked, but the following methods offer solutions there too.
Let’s begin with the very common scenario where some samples are more accurate or trustworthy than others. Labeled data is difficult to come by, so it’s common that we become more creative with how we source labels (for example crowd-sourcing or proxy labels based on some heuristic). These methods do yield larger sample sizes for our datasets, but these also leave us with a dataset where some samples are more ground-truth than others. For all the methods we will discuss in this post, it is necessary that we document which samples are ground-truth and which are not. We want our model to learn from all the training data, but informing the model of which samples contain more accurate information can lead to a more optimal solution. Differential Weighted Learning² looks to update the parameters of an optimization function (in this case the Adam optimizer) based on the fidelity of a sample. For each sample in the training set, we compute its fidelity, or how similar the sample is to the ground-truth samples, and use this degree of fidelity to determine the step size for the optimizer. Ground-truth samples would have a very high degree of fidelity to other ground-truth samples and so would result in a larger step size for the optimizer because we can be sure the model learned the intended information. When a sample has a lower degree of fidelity, we are less confident in the information contained in that sample therefore reducing the resulting sample size. The sample is still included because it was included in the training data and we want to avoid overfitting, but our confidence in these data points is lower so the step size or amount the model learns from these samples is lower. When measuring the degree of fidelity of a given sample, we can use cosine similarity or another appropriate distance metric³.
In a more specific scenario, some samples are more relevant than others. The samples could have been collected from a system that changes over time, such as a mechanical system that degrades with use or a dynamic market environment. More recent samples from these systems will give us more accurate information about the function we are trying to approximate, but all our samples are still relevant. How do we then best train a model to represent our system’s current state? We could try time series forecasting, but if our time signature does not have enough historical data, or displays highly irregular seasonalities then a forecast will yield poor results. Training our model with all of our data equally will give us a fair approximation of our system over its lifetime, but not of its current state. By introducing a bias to our model towards more recent (and therefore relevant) data, we can better approximate the current state of our system. This introduction of bias for training is known as Weighted Regression⁴, and can be used for both the scenario we’re in and for highly heteroscedastic data. The idea is that each sample is assigned a value representing how relevant that sample is to our desired outcome, and this value is then mapped as a weight to be applied to the training error that sample contributes to the model. Then we are free to train the model as we normally would, with the model trying to minimize the error of the more relevant samples over the error from the less relevant samples. In the scenario described above we could weight each sample by time using a linear or even exponential weight decay, so that our model is able to learn the parametric form of all of our data but the system’s current state is most accurately approximated.
As with anything in machine learning, the methods described above are not always applicable to every dataset. However, much of machine learning is based on our decision making processes so if we do not value every sample equally, neither should our models.
- Rishabh Mehrotra and Ashish Gupta. 2020. Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘20). Association for Computing Machinery, New York, NY, USA, 3416–3425. http://rishabhmehrotra.com/papers/kdd2020-limited-labels.pdf
- Mostafa Dehghani, Mehrjou, Gouws, Kamps, and Schölkopf. 2017. Fidelity weighted learning. arXiv preprint arXiv:1711.02799 (2017). https://arxiv.org/pdf/1711.02799.pdf