A very important concept in Machine Learning is Overfitting. In this article, we will explore the idea of overfitting, why it deserves an entire section, and what to do to ensure you don't make that mistake.
Overfitting is when you are getting married, and you chose not to go retail. Instead, you order a wedding dress that would fit your entire frame to a T. The entire process to create a tailor-made wedding gown can take up to a year, and you are not allowed to gain a single pound. It starts with measuring every inch of your measurable body, and regular check-ins with the designer to make sure the design and fit will be perfect on your wedding day.
Because this is a one-occasion use, you end up shoving that dress at the back of your closet, or maybe frame it if you have the money. Despite the amount of money that you will spend on that dress --
You won't be able to wear it again. Hopefully, you don't get divorced and get married again. And even if you do, you'd probably buy another one - pretty sure a cheaper one the second time around.
You won't be able to give or sell it to anyone. Since the measurements are exactly yours, and our bodies are as diverse as the number of people in the world, chances are, nobody in your proximity will be interested in purchasing that dress. Even if they did, they will have to spend more money to refit that dress to their own measurements.
In Machine Learning, overfitting is when you train the model with too much data such that it learns everything including the unnecessary ones like outliers and noise, rather than capturing the general underlying patterns. This results in great performance for the training data (i.e. the wedding dress will fit you here now), but performs poorly on newly-introduced data (i.e. the wedding dress won't fit you in the future or won't fit someone else). This is called High Variance.
If there is overfitting in your predictive models, the model will crunch out inaccurate data, which can lead to business problems such as unexpected customer satisfaction scores, low take up of offerings of the recommended products, poor fraudulent review detection, and even more complex data to work on.
Overfitting: Happens when a model learns the noise and details of the training data too well, rather than capturing the general underlying pattern. This leads to high variance—the model performs well on training data but poorly on unseen data.
Bias: Refers to the error due to overly simplistic assumptions in the model. A high-bias model underfits the data, meaning it fails to capture important patterns.
There are many ways to ensure that you don't overfit the model to your data. You go retail.
Use more training data. If available, more data can help create a more generalized model, and won’t force a fit.
Cross-validation. This will ensure that the model performs well on different subsets of data, and not just on the training data.
Simplify the model. This can be done by reducing the number of features such that only relevant information is fed into the model, removing noise.
Watch out for the next article on Types of Bias in Machine Learning!
Questions? Feedback? Head over to the About Me page and leave me a message. Thank you!