Recent comments in /f/deeplearning

thesupernoodle t1_jcsj2iw wrote

For maybe a few hundred bucks, you can test out the exact configurations you want to buy:

https://lambdalabs.com/service/gpu-cloud

You may even decide that you’d rather just cloud compute, as opposed to spending all that money upfront. It would only cost you about 19 K to run 2xA100 in the cloud for 24/365 for a solid year. And that also includes electricity costs.

9

Immarhinocerous t1_jcsdckk wrote

As someone who's mostly self-taught (I have a BSc, but it's in health sciences), I followed a similar route to what they recommended. It gives you:

  • income,

  • experience working in software development - you will hopefully learn a lot from this,

  • exposure to co-workers who you may be able to learn from, especially if your backend role is at a place doing ML.

With income, you can also afford to take more courses. Even if it's only the occasional weekend course, or something you work on a few nights a week, it can help you expand your skillset while gaining other practical skills (backend work with APIs, DBs, cloud infrastructure, etc are all useful).

After doing that awhile, you may be able to land a more focused ML role, or be able to do a master's program (which combined with your SWE experience will give you a leg up on landing the role you want). If you want to go straight into an ML role after SWE, you will definitely need project experience. But you can do that while working, if you're up for it.

One of the best ML people I know has a maths background, works in risk/finance, and is basically entirely self-taught. But the guy is brilliant and insanely passionate about what he does. I just mention him to show that you don't absolutely need to go the master's route. But it could be worthwhile when you can afford it, especially if you're lacking in maths.

3

alki284 t1_jcpxhj6 wrote

Frankly, you’ll struggle. A lot of junior ML positions require a masters degree as a minimum. But beyond that, what is your ML background like? Are you comfortable with the maths? Do you have side projects to show case your skills and knowledge?

If you don’t have a background, then going into a back end SWE role and transitioning after a couple of years is also a viable path. You can try and get in ‘ML adjacent’ type roles and gain experience from there

7

hijacked_mojo t1_jcpstsu wrote

Yes, you have the right idea but also add this to your mental model: the queries and values are influenced by their *own* set of weights. So it's not only the keys getting modified, but also queries and values.

In other words, the queries, keys and values weights all get adjusted via backprop to minimize the error. So it's entirely possible on a backprop that the value weights get modified a lot (for example) while the key weights are changed little.

It's all about giving the network the "freedom" to adjust itself to best minimize the error.

1

FunQuarter3511 OP t1_jcpstej wrote

Reply to comment by p0p4ks in Question on Attention by FunQuarter3511

Fully agree!

I think my issue was that because of the terms query, key, value, I was trying to relate them in a database or hashtable context. But in reality, those terms seem to be misnomers, and backprop will set the key/query pair to whatever is needed such that the dot product for important context will be large and be weighted appropriately.

I was over complicating it.

1

p0p4ks t1_jcppzf4 wrote

I get these confusions all the time. But then I remember we are back propagating the errors. Imagine your case happening and the model output was incorrect, the backprop will take care of fixing the key value being too big or small and fix the output.

1

FunQuarter3511 OP t1_jcpmkyt wrote

>I have a video that goes through everything

First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!

So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.

Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).

In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.

Does that sound right?

1