DigThatData
DigThatData t1_j6ynesq wrote
Reply to comment by pm_me_your_pay_slips in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
> p(sample| dataset including sample)/p(sample| dataset excluding sample) )
which, like I said, is basically identical to statistical leverage. If you haven't seen it before, you can compute LOOCV for a regression model directly from the hat matrix (which is another name for the matrix of leverage values). This isn't a good definition for "memorization" because it's indistinguishable from how we define outliers.
> What's the definition of memorization here? how do we measure it?
I'd argue that what's at issue here is differentiating between memorization and learning. My concern regarding the density ratio here is that a model that had learned to generalize well in the neighborhood of the observation in question would behave the same way, so this definition of memorization doesn't differentiate between memorization and learning, which I think effectively renders it useless.
I don't love everything about the paper you linked in the OP, but I think they're on the right track by defining their "memorization" measure by probing the model's ability to regenerate presumably memorized data, especially since our main concern wrt memorization is in regards to the model reproducing memorized values.
DigThatData t1_j6y35x2 wrote
Reply to comment by A_fellow in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
It's a startup that evolved out of a community of people who found each other through common interests in open source machine learning for public good (i.e. eleuther and laion), committed to providing the public with access to ML tools that were otherwise gated by corporate paywalls. For several years, that work was all being done by volunteers in their free time. We're barely a year old as an actual company and we're not perfect. But as far as intentions and integrity go: you're talking about a group of people who were essentially already functioning as a volunteer run non-profit, and then were given the opportunity to continue that work with a salary, benefits, and resources.
If profit was our chief concern, we wouldn't be giving these models away for free. Simple as that. There're plenty of valid criticisms you could lob our way, but a lack of principles and greed aren't among them. You might not like the way we do things or certain choices we've made, but if you think the intentions behind those decisions is primarily profit motivated: you should really learn more about the people you are criticizing, because you couldn't be more misinformed.
DigThatData t1_j6xexyf wrote
Reply to comment by pm_me_your_pay_slips in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
> That models that memorize better generalize better has been observed in large language models
I think this is an incorrect reading here. increasing model capacity is a reliable strategy for increasing generalization (Kaplan et al 2020, Scaling Laws), and larger capacity models have a higher propensity to memorize (your citations). The correlations discussed in both of those links are to capacity specifically, not generalization ability broadly. scaling law research has recently been demonstrating that there is probably a lot of wasted capacity in certain architectures, which suggests that the generalization potential of those models could be achieved with a much lower potential for memorization. see for example Tirumala et al 2022, Chinchilla.
which is to say: you're not wrong that a lot of recently trained models that generalize well have also been observed to memorize. but I don't think it's accurate to suggest that the reason these models generalize well is linked to a propensity/ability to memorize. it's possible this is the case, but I don't think anything suggesting this has been demonstrated. it seems more likely that generalization and memorization are correlated through the confounder of capacity, and contemporary research is actively attacking the problem of excess capacity in part to address the memorization question specifically.
EDIT: Also... I have some mixed feelings about that last paper. It's new to me and I just woke up so I'll have to take another look after I've had some coffee, but although their approach feels intuitively sound from the direction of the LOO methodology, their probabilistic formulation of memorization I think is problematic. They formalize memorization using a definition that appears to me to be indistinguishable from an operational definition of generalizability. Not even OOD generalizability: perfectly reasonable in-distribution generalization to unseen data, according to these researchers, would have the same properties as memorization. That's... not helpful. Anyway, need to read this closer, but "lower posterior likelihood" to me seems fundamentally different from "memorized". Their approach appears to make no effort to distinguish between a model that had "memorized" a training datum and one that had "learned" meaningful features in the neighborhood of a datum that has high [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics). Are they detecting memorization or outlier samples? If the "outliers" are valid in distribution samples, removing them harms the diversity of the dataset and the model may have significantly less opportunity to learn features in the neighborhood of those observations (i.e. they are high leverage). My understanding is that the problem of memorization is generally more pathological in high density regions of the data, which would be undetectable by their approach.
DigThatData t1_j6uxsdj wrote
Reply to comment by IDoCodingStuffs in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
> full image comparison.
that's not actually the metric they used precisely for the reasons you suggest: they found it to be too conservative. Specifically, they found they were getting too-high scores from images that had large black backgrounds. they chunked up each image into regions and used the score for the most dissimilar (but corresponding) regions to represent the whole image.
Further, I think they demonstrated their methodology probably wasn't too conservative when they were able to use the same approach to get a 2.3% (concretely: 23 memorized images in 1000 tested prompts) hit rate from Imagen. This hit rate is very likely a big overestimate of Imagen's propensity to memorize, but it demonstrates that the author's L2 metric has the ability to do its job.
Also, it's not like the authors didn't look at the images. They did, and found a handful more hits, which that 0.03% is already accounting for.
DigThatData t1_j6uu82y wrote
Reply to comment by ItsJustMeJerk in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
This is true, and also generalization and memorization are not mutually exclusive.
EDIT: I can't think of a better way to articulate this, but the image that keeps coming to my mind is a model memorizing the full training data and simulating a nearest neighbors estimate.
DigThatData t1_j6ugpgr wrote
Reply to comment by RandomCandor in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
very difficult is correct. The authors identified 350,000 candidate prompt/image pairs that were likely to have been memorized because they were duplicated repeatedly in the training data, and were only able to find 109 cases of memorization in Stable Diffusion in that 350k.
EDIT:
Conflict of Interest Disclosure: I'm a Stability.AI employee, and as such I have a financial interest in protecting the reputation of generative models generally and SD in particular. Read the paper for yourself. Everything here is my own personal opinion, and I am not speaking as a representative of Stability AI.
My reading is that yes: they demonstrated these models are clearly capable of memorizing images, but also that they are clearly capable of being trained in a way that makes them fairly robust to this phenomenon. Imagen has a higher capacity and was trained on much less data: it unsurprisingly is more prone to memorization. SD was trained on a massive dataset and has a smaller capacity: after constraining attention to the content we think it had the best excuse to have memorized, it barely memorized any of it.
There's almost certainly a scaling law here, and finding it will permit us to be even more principled about robustness to memorization. My personal reading of this experiment is that SD is probably pretty close to the pareto boundary here, and we could probably flush out the memorization phenomenon entirely if we train it on more data or trim away at the capacity tinker with the model's topology.
DigThatData t1_j61zv3l wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
Compute Is All You Need
DigThatData t1_j58zwdc wrote
Reply to [D] Did YouTube just add upscaling? by Avelina9X
do you have images of "baseline" compression artifacts to compare this against?
DigThatData t1_j46bnn7 wrote
Reply to [D] Has ML become synonymous with AI? by Valachio
AI has basically become a buzzword that means "this thing is capable of achieving what it does because it's powered by ML", and in this context especially, ML has become synonymous with deep learning.
DigThatData t1_j3v2gjs wrote
Reply to comment by thecodethinker in [R] Diffusion language models by benanne
attention is essentially a dynamically weighted cross-product. if you haven't already seen this blog post, it's one of the more popular explanations: https://jalammar.github.io/illustrated-transformer/
DigThatData t1_j3v26zy wrote
Reply to comment by benanne in [R] Diffusion language models by benanne
i think you read that comment backwards :)
DigThatData t1_j3v21hi wrote
Reply to comment by jimmymvp in [R] Diffusion language models by benanne
Have you read the stable diffusion paper? They discuss the motivations there. https://arxiv.org/abs/2112.10752
DigThatData t1_j3nvle9 wrote
Reply to [R] Diffusion language models by benanne
i just wanted to comment that your solution to the galaxy zoo contest forever ago was the first demonstration to really open my eyes to what was possible with clever data augmentation.
DigThatData t1_j2s71x9 wrote
I'd like to see this evaluated on more than just a single dataset
DigThatData t1_j2m3iw7 wrote
Reply to comment by fdis_ in [D] Machine Learning Illustrations by fdis_
you should reorder sections and nesting so the major book sections each appear in the sidebar instead of "references" and "todo". it's bad enought the user has to click on anything to reveal the table of contents, but hiding it behind "machine learning" makes it two clicks deep. there's no reason for that.
DigThatData t1_j0kjvw6 wrote
Reply to [R] Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow by Ok-Teacher-22
Relevant reference I think you should include for your discussion: a summary of some especially pernicious silent bugs in scikit-learn that were deliberate design choices made by the library authors and whose bug impact was a consequence of opaque documentation or deceptive/non-obvious naming choices, in some cases even in spite of complaints about undesirable behavior by users. - https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/?context=3
EDIT: also this - https://www.reddit.com/r/MachineLearning/comments/aryjif/d_alternatives_to_scikitlearn/egrctzk/?context=3
like you say, "do not blindly trust the framework"
--- full disclosure: I wrote that under my old account. If you choose to add that comment as a reference, please attribute it to David Marx
DigThatData t1_iy5j8en wrote
Reply to comment by hadaev in [P] Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results (+ Colab Notebooks + Negative Embedding) by minimaxir
actually it's all text2im, but "text" includes some custom learned tokens.
DigThatData t1_iy5j566 wrote
Reply to [P] Stable Diffusion 2.0 and the Importance of Negative Prompts for Good Results (+ Colab Notebooks + Negative Embedding) by minimaxir
excellent work, thanks for digging so deeply into this phenomenon and writing up!
DigThatData t1_ixswsat wrote
Reply to comment by deepestdescent in [D] Alternatives to the shap explainability package by deepestdescent
lol whoops. did you create issues on those project to let them know they're depending on an unmaintained repo?
DigThatData t1_ixsggx9 wrote
try poking around this list, surely there's something in there that fits what you need: https://github.com/stars/dmarx/lists/ml-explainability
EDIT: OK, here are probably your best candidates from that list:
- https://github.com/salesforce/OmniXAI
- https://github.com/Trusted-AI/AIX360
- https://github.com/interpretml/interpret
- https://github.com/pytorch/captum
- https://github.com/MAIF/shapash
And another list: https://github.com/jphall663/awesome-machine-learning-interpretability
DigThatData t1_ixkghj5 wrote
Reply to comment by alwayslttp in [D] Schmidhuber: LeCun's "5 best ideas 2012-22” are mostly from my lab, and older by RobbinDeBank
sounds like the problem here is the metrics then. which also is something I'm pretty sure only even became a thing extremely recently. For a long time, the only citation-based metric anyone talked about was their Erdos number, which was a tongue-in-cheek thing anyway. Concern over metrics like this is more likely than not going to damage research progress by encouraging gamification. The only "cited by" count I ever concern myself with is for sorting stuff on google scholar, which I never presume is an exact count or directly maps to the sorting I really need.
DigThatData t1_ixinfbc wrote
Reply to comment by crouching_dragon_420 in [D] Schmidhuber: LeCun's "5 best ideas 2012-22” are mostly from my lab, and older by RobbinDeBank
> If they cite GRU, they should cite LSTM as well.
that's not how citations work...
> GRU cite LSTM so it's fine to cite GRU but not LSTM.
but that's literally how citations work. If you cite paper X, you are implicitly citing everything that paper X cited as well. citation graphs are transitive.
DigThatData t1_ixbg446 wrote
Reply to Suggestions for a socially valuable project that would welcome an unpaid contributor [D] by AnthonysEye
https://eleuther.ai/ and https://laion.ai/ generally have several interesting projects going at any time and are always looking for volunteers.
DigThatData t1_iw7r5ou wrote
Reply to comment by moist_buckets in [D] When was the last time you wrote a custom neural net? by cautioushedonist
this is interesting to me and it sounds like there's probably an opportunity here to develop some pre-trained models specifically to support astrophysicists, especially with JWST hitting the scene.
would you be interested in connecting to discuss the potential opportunity here? just because there aren't currently any 'foundation models' relevant to your work doesn't mean there couldn't be.
DigThatData t1_j7tb03a wrote
Reply to comment by edjez in [D] Are there emergent abilities of image models? by These-Assignment-936
i'm not sure that's an emergent ability so much as it is explicitly what the model is being trained to learn. it's not surprising to me that there is a "painting signature" concept it has learned and samples from when it generates gibberish of a particular length and size in the bottom right corner (for example). that sounds like one of the easier "concepts" it would have learned.