percevalw t1_iqn20u4 wrote on October 1, 2022 at 4:24 PM

The use of log loss is related to the maximum entropy principle, which states that the loss should make as few assumptions as possible about the actual distribution of the data. For example, if you only know that your problem has two classes, your loss should make no further assumptions. In the case of binary classification, the mathematical formula derived from this loss principle is the sigmoid function. You can learn more about it with this short article https://github.com/WinVector/Examples/raw/main/dfiles/LogisticRegressionMaxEnt.pdf

cthorrez OP t1_iqn3vy4 wrote on October 1, 2022 at 4:37 PM

Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.

The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.

NeilGirdhar t1_iqoovql wrote on October 1, 2022 at 11:34 PM

This is the correct answer.