Submitted by cthorrez t3_xsq40j in MachineLearning
percevalw t1_iqn20u4 wrote
The use of log loss is related to the maximum entropy principle, which states that the loss should make as few assumptions as possible about the actual distribution of the data. For example, if you only know that your problem has two classes, your loss should make no further assumptions. In the case of binary classification, the mathematical formula derived from this loss principle is the sigmoid function. You can learn more about it with this short article https://github.com/WinVector/Examples/raw/main/dfiles/LogisticRegressionMaxEnt.pdf
cthorrez OP t1_iqn3vy4 wrote
Thank you! This pretty much answers my question. Though I think don't think it makes sense to bundle log loss and logistic regression. Like I mentioned in my post probit regression also uses log loss.
The only difference is how the model makes a probability prediction. The paper you linked provides a great motivation for using logistic sigmoid over another sigmoid.
NeilGirdhar t1_iqoovql wrote
This is the correct answer.
Viewing a single comment thread. View all comments