tdscanuck t1_j5xkuj8 wrote on January 26, 2023 at 6:35 AM

It's not just for machine learning, it's a general problem with any models that try to simplify anything. Overfitting is basically when you make the model so "big" (enough values that it can adjust) that it can perfectly fit *any* training data you feed it. So your model will look *amazing* in terms of performance, but it may totally fail when you finish training and try to do something useful with it because it's too hyperspecialized to the training data.

As a trivial/over-simplified example, suppose I want a machine learning widget to recognize pictures of traffic lights so I can automate those stupid captchas (yes, I know that's not how they actually work). I get training data of 10,000 pictures of traffic lights and 10,000 pictures of non-traffic lights and use that to train the model. Except I give the model 10,000 different variables to work with (far too many). The model can "learn" to recognize each of the 10,000 pictures because it can use one variable to match each photo of a traffic light. The results on the training data will be perfect...it recognizes every one of my 10,000 traffic lights and ignores anything that isn't those. 100% success!!! But now I feed it a new picture of a traffic light...and that doesn't match any of the 10,000 I trained it on before. The model will say "not a traffic light" because it got too specific...I overfitted the model so much that it can *only* recognize the training data. It was never forced to figure out how to efficiently recognize traffic lights with a much smaller number of variables that would learn "traffic-light-ness" but still be general enough to recognize other traffic lights.

You can do the same trick in Excel with polynomial fits to data points...if you give the polynomial enough free variables it can match basically anything to a pretty high accuracy. That doesn't mean you've discovered some amazing 70th degree polynomial that magically predicts your data, you've just (grossly) overfitted the model.

alexander-prince OP t1_j5xlrzp wrote on January 26, 2023 at 6:46 AM

thanks for the answer! it covers the follow-up question to the first commentator. Just one last question in your example what is the rule of thump to avoid very few variables and too much? like is there an accepted level of accuracy?

tdscanuck t1_j5xnrlw wrote on January 26, 2023 at 7:10 AM

It really depends on the dataset and model. In general, the smallest model that meets your requirements is a good idea but that size could range wildly depending on the data format, richness, and required performance.