chaosmosis
chaosmosis t1_j4lnsfe wrote
Reply to comment by jimmymvp in Why is Super Learning / Stacking used rather rarely in practice? [D] by Worth-Advance-1232
Thanks!
chaosmosis t1_j4fpg6g wrote
Reply to comment by jimmymvp in Why is Super Learning / Stacking used rather rarely in practice? [D] by Worth-Advance-1232
I'd love the reference if you can find it.
chaosmosis t1_j47d0ev wrote
In addition to being more straightforward, applying the same total amount of compute to a single model doing end to end learning is often better for performance than splitting up compute between multiple models. As far as I'm aware, there aren't any systematic ways to tell when which method will be preferable, this is just a rule of thumb opinion.
chaosmosis t1_j45vdll wrote
Reply to comment by giga-chad99 in [D] What's your opinion on "neurocompositional computing"? (Microsoft paper from April 2022) by currentscurrents
With enough scale we get crude compositionality, yes. That trend will probably continue, but I don't think it'll take us to the moon.
chaosmosis t1_j0icgvf wrote
Reply to comment by BrisklyBrusque in [R] Are there open research problems in random forests? by SpookyTardigrade
As an example, imagine that Bob and Susan are estimating the height of a dinosaur and Bob makes errors that are exaggerated versions of Susan's, so if Susan underestimates its height by ten feet then Bob underestimates it by twenty, or if Susan overestimates its height by thirty feet then Bob overestimates it by forty. You can "artificially construct" a new prediction to average with Susan's predictions by taking the difference between her prediction and Bob's, flipping its sign, and adding it to her prediction. Then you conduct traditional linear averaging on the constructed prediction with Susan's prediction.
Visually, you can think about it as if normal averaging draws a straight line between two different models' individual outputs in R^n , then chooses some point between them, while control variates extend that line further in both directions and allow you to choose a point that's more extreme.
It's a little more complicated with more predictors and when issuing predictions in higher dimensions than in one dimension, but not by much. Intuitively, you have to avoid "overcounting" certain relationships when you're trying to build a flipped predictor. This is why the financial portfolio framework is helpful; they're already used to thinking about correlations between lots of different investments.
The tl;dr version is, you want models with errors that balance each other out.
chaosmosis t1_j0ib3ja wrote
Reply to comment by BrisklyBrusque in [R] Are there open research problems in random forests? by SpookyTardigrade
No problem at all. I'm leaving ML research for at least the next couple years, and I want my best ideas to get adopted by others. I figured out all of the above in a three month summer internship in 2020 and nobody there cared because it couldn't immediately be used to blow things up more effectively, which was incredibly disappointing.
As far as I can tell, nobody but me and this one footnote in an obscure economics paper I've forgotten the citation of has ever noted that ensembles and financial portfolios deal with the same problem if you cast both in terms of control variates. In theory, bridging between the two by way of control variates should allow for stealing lots and lots of ideas from finance literature for ML papers. Would really like seeing someone make something of the connection someday.
chaosmosis t1_j0i51ka wrote
Reply to comment by BrisklyBrusque in [R] Are there open research problems in random forests? by SpookyTardigrade
> Random forests raise a lot of questions about the relationship between ensemble diversity and ensemble accuracy, about which there are many mysteries.
By way of Jensen's inequality, there's a generalization of the bias-variance decomposition of mean-squared error that holds for all convex loss functions, see the paper Generalized Negative Correlation Learning that came out in 2021. From there, you can view linear averaging of model outputs as a special case of the method of control variates, where their diversity matters insofar as it's harnessed to reduce error due to variance. I think control variates give us a unified theoretical framework for investigating ensembles. They've got all sorts of fun generalizations like nonlinear control variates that are as yet completely unexplored in the machine learning literature.
In other words, you should diversify ensembles in exactly the same way as you should diversify a portfolio of financial investments according to optimal portfolio theory. See also Phillip Tetlock's work on his "extremizing algorithm" for an application of similar ideas to human forecasting competitions.
The main outstanding question with respect to ensembles, to my mind, is not how to make the most use of a collection of models, but when and whether to invest computational effort into running multiple models in parallel and optimizing the relationships between their errors rather than into training a bigger model.
chaosmosis t1_izawfsz wrote
Reply to comment by ThisIsMyStonerAcount in [D] If you had to pick 10-20 significant papers that summarize the research trajectory of AI from the past 100 years what would they be by versaceblues
Non-monotonic activation functions can allow for single layers to solve xor, but they take forever to converge.
chaosmosis t1_iz8fzyo wrote
Reply to comment by Ulfgardleo in [R] The Forward-Forward Algorithm: Some Preliminary Investigations [Geoffrey Hinton] by shitboots
I think the main problem is that they aren't theory driven except in an ad hoc sense. They'd be fine if they hadn't become a fad published on by everyone and their mother.
For actually neat discussions of distributed computing in animals, I don't think it's possible to do better than reading about octopuses. Strong recommend for Other Minds to anyone interested in the area.
chaosmosis t1_iz3ymas wrote
Reply to comment by Ulfgardleo in [R] The Forward-Forward Algorithm: Some Preliminary Investigations [Geoffrey Hinton] by shitboots
Gimmick animal optimization procedures are my guilty pleasure. They're like intellectually cute to me or something. I get happy every time I come across a new one.
chaosmosis t1_ixhl90r wrote
Reply to comment by ReginaldIII in [D] Schmidhuber: LeCun's "5 best ideas 2012-22” are mostly from my lab, and older by RobbinDeBank
"My ideas are the best in {Big Set}" versus "My ideas are the best in {Bigger Set}".
chaosmosis t1_ixhcvxy wrote
Reply to [D] Schmidhuber: LeCun's "5 best ideas 2012-22” are mostly from my lab, and older by RobbinDeBank
Title's a little misleading. Initially I thought the claim was that the best ideas Lecun's had all came from Schmidhuber. Instead, the claim is that the best ideas anyone's had, as listed by Lecun, all came from Schmidhuber.
Amusingly, that's actually a more arrogant claim, but it's less personal and I don't think the tweet's "striking back" against Lecun.
chaosmosis t1_j4mjxh9 wrote
Reply to [D] The Illustrated Stable Diffusion (Video) by jayalammar
Are the 77 token embedding vectors just concatenated together as ClipText's output? Is there any structure to their ordering as processed by the Image Information Creator? Assuming a trained model, would permuting the vectors' order before passing them forward to the next subcomponent break anything?
General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.