DigThatData

DigThatData t1_j7tb03a wrote

i'm not sure that's an emergent ability so much as it is explicitly what the model is being trained to learn. it's not surprising to me that there is a "painting signature" concept it has learned and samples from when it generates gibberish of a particular length and size in the bottom right corner (for example). that sounds like one of the easier "concepts" it would have learned.

11

DigThatData t1_j6ynesq wrote

> p(sample| dataset including sample)/p(sample| dataset excluding sample) )

which, like I said, is basically identical to statistical leverage. If you haven't seen it before, you can compute LOOCV for a regression model directly from the hat matrix (which is another name for the matrix of leverage values). This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.

> What's the definition of memorization here? how do we measure it?

I'd argue that what's at issue here is differentiating between memorization and learning. My concern regarding the density ratio here is that a model that had learned to generalize well in the neighborhood of the observation in question would behave the same way, so this definition of memorization doesn't differentiate between memorization and learning, which I think effectively renders it useless.

I don't love everything about the paper you linked in the OP, but I think they're on the right track by defining their "memorization" measure by probing the model's ability to regenerate presumably memorized data, especially since our main concern wrt memorization is in regards to the model reproducing memorized values.

1

DigThatData t1_j6y35x2 wrote

It's a startup that evolved out of a community of people who found each other through common interests in open source machine learning for public good (i.e. eleuther and laion), committed to providing the public with access to ML tools that were otherwise gated by corporate paywalls. For several years, that work was all being done by volunteers in their free time. We're barely a year old as an actual company and we're not perfect. But as far as intentions and integrity go: you're talking about a group of people who were essentially already functioning as a volunteer run non-profit, and then were given the opportunity to continue that work with a salary, benefits, and resources.

If profit was our chief concern, we wouldn't be giving these models away for free. Simple as that. There're plenty of valid criticisms you could lob our way, but a lack of principles and greed aren't among them. You might not like the way we do things or certain choices we've made, but if you think the intentions behind those decisions is primarily profit motivated: you should really learn more about the people you are criticizing, because you couldn't be more misinformed.

1

DigThatData t1_j6xexyf wrote

> That models that memorize better generalize better has been observed in large language models

I think this is an incorrect reading here. increasing model capacity is a reliable strategy for increasing generalization (Kaplan et al 2020, Scaling Laws), and larger capacity models have a higher propensity to memorize (your citations). The correlations discussed in both of those links are to capacity specifically, not generalization ability broadly. scaling law research has recently been demonstrating that there is probably a lot of wasted capacity in certain architectures, which suggests that the generalization potential of those models could be achieved with a much lower potential for memorization. see for example Tirumala et al 2022, Chinchilla.

which is to say: you're not wrong that a lot of recently trained models that generalize well have also been observed to memorize. but I don't think it's accurate to suggest that the reason these models generalize well is linked to a propensity/ability to memorize. it's possible this is the case, but I don't think anything suggesting this has been demonstrated. it seems more likely that generalization and memorization are correlated through the confounder of capacity, and contemporary research is actively attacking the problem of excess capacity in part to address the memorization question specifically.

EDIT: Also... I have some mixed feelings about that last paper. It's new to me and I just woke up so I'll have to take another look after I've had some coffee, but although their approach feels intuitively sound from the direction of the LOO methodology, their probabilistic formulation of memorization I think is problematic. They formalize memorization using a definition that appears to me to be indistinguishable from an operational definition of generalizability. Not even OOD generalizability: perfectly reasonable in-distribution generalization to unseen data, according to these researchers, would have the same properties as memorization. That's... not helpful. Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized". Their approach appears to make no effort to distinguish between a model that had "memorized" a training datum and one that had "learned" meaningful features in the neighborhood of a datum that has high [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics). Are they detecting memorization or outlier samples? If the "outliers" are valid in distribution samples, removing them harms the diversity of the dataset and the model may have significantly less opportunity to learn features in the neighborhood of those observations (i.e. they are high leverage). My understanding is that the problem of memorization is generally more pathological in high density regions of the data, which would be undetectable by their approach.

1

DigThatData t1_j6uxsdj wrote

> full image comparison.

that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.

Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.

Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.

2

DigThatData t1_j6ugpgr wrote

very difficult is correct. The authors identified 350,000 candidate prompt/image pairs that were likely to have been memorized because they were duplicated repeatedly in the training data, and were only able to find 109 cases of memorization in Stable Diffusion in that 350k.

EDIT:

Conflict of Interest Disclosure: I'm a Stability.AI employee, and as such I have a financial interest in protecting the reputation of generative models generally and SD in particular. Read the paper for yourself. Everything here is my own personal opinion, and I am not speaking as a representative of Stability AI.

My reading is that yes: they demonstrated these models are clearly capable of memorizing images, but also that they are clearly capable of being trained in a way that makes them fairly robust to this phenomenon. Imagen has a higher capacity and was trained on much less data: it unsurprisingly is more prone to memorization. SD was trained on a massive dataset and has a smaller capacity: after constraining attention to the content we think it had the best excuse to have memorized, it barely memorized any of it.

There's almost certainly a scaling law here, and finding it will permit us to be even more principled about robustness to memorization. My personal reading of this experiment is that SD is probably pretty close to the pareto boundary here, and we could probably flush out the memorization phenomenon entirely if we train it on more data or trim away at the capacity tinker with the model's topology.

26

DigThatData t1_j46bnn7 wrote

AI has basically become a buzzword that means "this thing is capable of achieving what it does because it's powered by ML", and in this context especially, ML has become synonymous with deep learning.

1

DigThatData t1_j3nvle9 wrote

i just wanted to comment that your solution to the galaxy zoo contest forever ago was the first demonstration to really open my eyes to what was possible with clever data augmentation.

14

DigThatData t1_j2m3iw7 wrote

Reply to comment by fdis_ in [D] Machine Learning Illustrations by fdis_

you should reorder sections and nesting so the major book sections each appear in the sidebar instead of "references" and "todo". it's bad enought the user has to click on anything to reveal the table of contents, but hiding it behind "machine learning" makes it two clicks deep. there's no reason for that.

10

DigThatData t1_j0kjvw6 wrote

Relevant reference I think you should include for your discussion: a summary of some especially pernicious silent bugs in scikit-learn that were deliberate design choices made by the library authors and whose bug impact was a consequence of opaque documentation or deceptive/non-obvious naming choices, in some cases even in spite of complaints about undesirable behavior by users. - https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/?context=3

EDIT: also this - https://www.reddit.com/r/MachineLearning/comments/aryjif/d_alternatives_to_scikitlearn/egrctzk/?context=3

like you say, "do not blindly trust the framework"

--- full disclosure: I wrote that under my old account. If you choose to add that comment as a reference, please attribute it to David Marx

2

DigThatData t1_ixkghj5 wrote

sounds like the problem here is the metrics then. which also is something I'm pretty sure only even became a thing extremely recently. For a long time, the only citation-based metric anyone talked about was their Erdos number, which was a tongue-in-cheek thing anyway. Concern over metrics like this is more likely than not going to damage research progress by encouraging gamification. The only "cited by" count I ever concern myself with is for sorting stuff on google scholar, which I never presume is an exact count or directly maps to the sorting I really need.

1

DigThatData t1_ixinfbc wrote

> If they cite GRU, they should cite LSTM as well.

that's not how citations work...

> GRU cite LSTM so it's fine to cite GRU but not LSTM.

but that's literally how citations work. If you cite paper X, you are implicitly citing everything that paper X cited as well. citation graphs are transitive.

1

DigThatData t1_iw7r5ou wrote

this is interesting to me and it sounds like there's probably an opportunity here to develop some pre-trained models specifically to support astrophysicists, especially with JWST hitting the scene.

would you be interested in connecting to discuss the potential opportunity here? just because there aren't currently any 'foundation models' relevant to your work doesn't mean there couldn't be.

1