thomasahle t1_je14a0c wrote on March 28, 2023 at 5:34 PM

Reply to [D] Simple Questions Thread by AutoModerator

Are there any "small" LLMs, like 1MB, that I can include, say, on a website using ONNX to provide a minimal AI chat experience?

thomasahle OP t1_j9nprmt wrote on February 23, 2023 at 7:30 AM

Reply to comment by activatedgeek in Unit Normalization instead of Cross-Entropy Loss [Discussion] by thomasahle

Great example! With Brier scoring we have

loss = norm(x)**2 - x[label]**2 + (1-x[label])**2
     = norm(x)**2 - 2*x[label] + 1

which is basically equivalent to replacing logsumexp with norm^2 in the first code

def label_cross_entropy_on_logits(x, labels):
    return (-2*x.select(labels) + x.norm(axis=1)**2).sum(axis=0)

This actually works just as good as my original method! The Wikipedia article for proper scoring functions also mention "Spherical score", which seems to be equivalent to my method of dividing by the norm. So maybe that's the explanation?

Note though that I applied Brier Loss directly on the logits, which is probably not how they are meant to be used...

thomasahle OP t1_j9kapw7 wrote on February 22, 2023 at 4:38 PM

Reply to comment by ChuckSeven in Unit Normalization instead of Cross-Entropy Loss [Discussion] by thomasahle

Even with angles you can still have exponentially many vectors that are nearly orthogonal to each other, if that's what you mean...

I agree the representations will be different. Indeed one issue may be that large negative entries will be penalized as much as large positive ones, which is not the case for logsumexp...

But on the other hand more "geometric" representations like this, based on angles, may make the vectors more suitable for stuff like LSH.

thomasahle OP t1_j9iq4rz wrote on February 22, 2023 at 6:42 AM

Reply to comment by cthorrez in Unit Normalization instead of Cross-Entropy Loss [Discussion] by thomasahle

Should have said Accuracy.

Only MNist though. Went from 3.8% error on a simple linear model to 1.2%. In average. With 80%-20% train test split. So in no way amazing, just interesting.

Just wondered if other people had experimented more with it, since it's also a bit faster training.