IntelArtiGen
IntelArtiGen t1_izj2jc3 wrote
Reply to comment by OutOfCharm in Representation ability of a MLP network [D] by OutOfCharm
The amount of values must be sufficient and the model must be able to process these values. We could imagine a model which would not perform well with 10 values because it's too much to process but it would perform better with 3 values, even though the "perfect model" would need 10 values to give the best results.
IntelArtiGen t1_iziveld wrote
I don't know if I entirely got the question but I can try to answer. With one number you can in theory represent an infinite amount of information. In practice, on computers, we don't have an infinite amount of precision on one number (fp16, fp32 etc.), and we don't have an infinite amount of precision on how a DL algorithm can interpret this number. If -1 and +1 are two different pieces of information, it's fine. If 0.9999999 and 1.000001 are two different pieces of information, a DL algorithm will have trouble learning it.
So there is a relationship because for practical reasons we can't represent everything in one number. But there also is a limit, if you can fit all the information you need in 10 values, using 100000 values to represent it won't help. And if you want to know what is the right value in theory I'm afraid you can't because it depends on the dataset, the model and the training process.
Perhaps this has a bit to do with information theory. But I'm not aware of an information theory that would focus on DL, this field of science is maybe under-investigated.
IntelArtiGen t1_izipwih wrote
Reply to [D] When to use 1x1 convolution by Ananth_A_007
1x1 convolutions are practical when you need to change the shape of a tensor. If you have a tensor of shape (B, H, W, 128) you can use an 1x1 to have a tensor of shape (B, H, W, 64) without loosing too much information.
You can use an 1x1 with stride 2 in place of a max pooling depending on your constraints. It could perform better, it could be more computationally intensive or take an extra memory you don't have.
For mobilenetv2 I think you're talking about inverted residual / linear bottleneck? I think the point of this layer is to expand and then compress the information, plus it's a residual layer. Because the 1x1 allows you to efficiently expand and compress a tensor, you can use it to do these steps in this layer, and to re-shape the tensor so that it can be added as a residue. It seems that "expand / process (dwise) / compress / residue" requires less parameters for the same result than just doing "process / process / process" as we usually do, or even "process / residue ..." in resnet. However it's not easier for the algorithm to learn so the training might be longer and still be more parameter efficient.
If you're working on new neural network architectures, you have to be able to manipulate tensors of different shapes, 1x1 essentially helps to change shapes of tensors while keeping information.
IntelArtiGen t1_iycm9kk wrote
It depends on the accuracy you want, I can train a transformer in 30 min with 30k sentences on an RTX2070 Super and get meaningful embeddings (similar words are close to each others), it works but same as for all models it won't be SOTA if you don't use billions of sentences and a much larger model with much more GPUs.
I was told the same thing and I wouldn't agree, you need a huge pretraining process if you want SOTA results, if you can compromise you don't need as much data, but LSTM might perform better with little data.
IntelArtiGen t1_iw860x6 wrote
Reply to comment by iamnotlefthanded666 in [D] When was the last time you wrote a custom neural net? by cautioushedonist
- Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
- Input: video (multiple images and sounds in a continuous stream + real-time constraint)
- Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
- Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it
So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.
IntelArtiGen t1_iw5gjpi wrote
When needed, I usually take an existing architecture and only adapt small parts of it to solve my task. I also wrote a custom autoencoder layer by layer for audio spectrograms (I didn't find an existing model which could do it with my constraints), and a model to convert embeddings from one self-supervised model to another self-supervised model (it's not a complex architecture) while the three models train simultaneously.
Tbh I would prefer to use existing architectures because re-designing an architecture is long to do, to optimize, and to train, but existing models are often very adapted to one task and perform bad on unexpected new tasks. Also you may have constraints (like real-time, memory efficiency etc.) which are not taken into account in easy-to-reuse published models.
Images have pretrained CNNs, but if you want a model to perform self-supervised continual learning and real-time inference on images with just one RTX, it can be harder to find an existing optimized solution for this task.
IntelArtiGen t1_ivj6nih wrote
Reply to comment by blimpyway in [D] At what tasks are models better than humans given the same amount of data? by billjames1685
Probably not, because a 16 y.o. human has 16 years of interactive navigation pretraining in a real world environment in real time before learning to drive. So it depends on how you include this pretraining.
And it also depends on the accuracy of the model as a function of the size of the dataset. Let's say Tesla is 80% (random number) accurate while driving after training on 780M miles, a human is 75% accurate after 3M miles, and if you train the Tesla model on 3M miles instead of 780M it's 75% accurate, on these metrics alone Tesla would be as efficient as a human.
No comparison is perfect but we can't ignore that during the first years of our lives we train to understand the world while not being very efficient to perform tasks.
IntelArtiGen t1_ivg765c wrote
Reply to comment by billjames1685 in [D] At what tasks are models better than humans given the same amount of data? by billjames1685
Ok that's one way to say it I also agree. I tend to not use the concept of "transfer learning" for how we learn because I think it's more appropriate for well-defined tasks and we are rarely confronted with tasks that are as well-defined as the ones we give to our models.
And transfer learning implies that you have to re-train a part of the model on a new task, and that's not exactly how I would define what we do. When I worked on reproducing how we learn words I instead implemented the solution as a way to put a new label on a representation we were already able to produce based on our unsupervised pretraining. I don't know which way is the correct one I just know that doing that works and that you can teach new words/labels to a model without retraining it.
IntelArtiGen t1_ivg4d74 wrote
Reply to comment by billjames1685 in [D] At what tasks are models better than humans given the same amount of data? by billjames1685
I think it's not just "transfer learning" or "image classification" it's also just learning without explicitly using "labels". Like contrastive learning / self supervised learning / reinforcement learning etc.
IntelArtiGen t1_ivg33d5 wrote
Reply to comment by billjames1685 in [D] At what tasks are models better than humans given the same amount of data? by billjames1685
>I’m pretty sure it’s been well established that we can learn after seeing a few images even for things we haven’t seen before
An 18 years old can do that. Ask a 1 y.o. to identify 50 different objects, it won't work, even though this 1 y.o. was trained continuously on thousands of images during his first year of life. Of course you were not talking about training a 1 y.o. but an adult, and that's why you can't really compare. In order to be an adult you need to be a 1 y.o., you need to watch the world during thousands of days before you can have that "pretraining" that makes adults able to handle all these tasks more easily than most models.
>our brains have way more compute than any model
That's not as well-established as many people could think. We would want models to do what an 18 years old could do, yet no deep learning model has been trained with real-world interactions for 18 years.
IntelArtiGen t1_ivfxox3 wrote
Reply to [D] At what tasks are models better than humans given the same amount of data? by billjames1685
For many tasks you can't really compare because we are fed with multiple types of raw data continuously while most models train on one specific type of data coming from one clean dataset.
>we can distinguish, say, dogs from cats given only a couple input examples.
After we've seen billions of images during multiple months/years of life. We had a very large and long "pretraining" before being able to perform "complex" tasks. So it depends on what you compare, most models need less data but train on a cleaner dataset with architectures that are already optimized for that specific task.
IntelArtiGen t1_iutivin wrote
Reply to [P] Implementation of MagicMix from ByteDance researchers, - New way to interpolate concepts with much more natural, geometric coherency (implemented with Stable Diffusion!) by cloneofsimo
Thanks for this implem, I'll try it out!
IntelArtiGen t1_iu24fuh wrote
I'm not sure how it would really learn something from the input if you don't define a more useful task. How would this model penalize a "collapse" situation where both models always predict 0 for example or any random value?
Contrastive learning algorithms try to build two different embeddings for different parts of the same input, penalize collapse, and train a model to make the two embeddings as close as possible, knowing they come from the same input, even if they are from different parts of that input. It looks a bit like what you said but I don't know an implementation that is like the one you said.
IntelArtiGen t1_ir88qh5 wrote
>While our internal testing suggest much of explicit and violent content can be filtered out, there still exists social biases and stereotypes which are challenging to detect and filter. We have decided not to release the Imagen Video model or its source code until these concerns are mitigated.
I think they'll never be mitigated and we'll have to wait for other people trying to reproduce the results and make them open-source.
IntelArtiGen t1_iqv7vu7 wrote
Reply to comment by sharp7 in [P] Small problems to test out transformers? by sharp7
The task is described in the paper I linked (3.1, Task #1: Masked LM). Any implementation of BERT should use it, like this one.
IntelArtiGen t1_iqu74lu wrote
Reply to [P] Small problems to test out transformers? by sharp7
Transformers like the one in BERT have already defined tasks to train themselves without labels. You can use a corpus like Universal Dependencies if you want to predict labels on words / sentences but you can also just use any text and do tasks like "predict hidden words" or "predict next sentence", the way they are defined here: https://arxiv.org/pdf/1810.04805.pdf or any other way as long as it makes sense for the neural network, you can also use OPUS if you want to try translating sentences with the whole encoder-decoder architecture of the Transformer.
You probably don't need a high-end GPU to train a small transformer on a small corpus. I trained a basic transformer in 30min with an rtx2070s on europarl with just the masked word prediction task. If you don't have a GPU it'll be harder though, I never tried to train a very small Transformer, don't know how they scale. I guess you could try to predict masked words with ~100 sentences and a very small transformer and train that model on CPU.
If you're only testing the architecture of the transformer and not the embeddings you can start the model from pretrained embeddings it should speed up the training a lot.
IntelArtiGen t1_izqc26r wrote
Reply to comment by Ananth_A_007 in [D] When to use 1x1 convolution by Ananth_A_007
The information you have before a layer is conditioned by how it goes into that layer, at first the information that goes into that layer is noise, weights change depending on the loss such that when the information goes into that layer it reduces the loss, and becomes something meaningful.
So the question would be: is it better for information processing in the neural network to compare 2x2 values and take the max? or is it better to train the network such that it can put the correct information in 1 of the 2x2 values and always keep that one?
I think the answer depends on the dataset, the model and the training process.
And I think the point of that layer isn't necessarily to look at everything but just to shrink dimensions without loosing too much information. Perhaps looking at everything is not required to keep enough information.