mayermensch69

mayermensch69 t1_izk7m7b wrote

I came across this approach of dialog evaluation: https://github.com/Shikib/fed

What I don't understand is, how the (more or less) raw loss can be used as a metric, since it is not really bounded. It may work when directly comparing specific examples with this method, but how does one compare these scores to other metrics with a fixed scale?

1