Submitted by soupstock123 t3_106zlpz in deeplearning

Hi there, I'm building a machine for deep learning and ml work, and I wanted to critique advice on my build. The target is 4 3090s, but I'm just trying to decide the cpu and the motherboard rn. There are a few other options I considered and these were my thoughts on each. Let me know if there's some flaw in my thinking

  1. amd thread ripper 3:
  • expensive chip and mobo
  • end of life already and prices still haven't gone down much on these lol
  • 64 PCIe4 lanes so def enough lanes
  1. Intel i9 10980xe and a x299 motherboard
  • 48 PCIe3 lanes, enough for 4 gpus
  • kinda old and slight premium for x299 chipset

In the end, I decided to do this build: https://ca.pcpartpicker.com/list/Vmyvtn https://ca.pcpartpicker.com/list/vGkhwc

~~- AMD Ryzen 9 7950X:

  • 16 PCIe5 lanes,
  • with 4 gpus that's 4 PCIe4 lanes per gpu~~

I'm wondering what's your opinion on my build, yes, there are only 16 lanes, but they're PCIe5, and 4 lanes of PCIe5 equals 8 lanes of PCIe4, so in theory, it should be fine right? For the case, I'm planning on just using a mining rig frame and putting everything there for now. Future plans would be to waterblock everything and have a nice case.

Edit: After reviewing some of the comments, I've decided to get a threadripper 3960X, and an ASRock TRX40 Creator for the mobo.

Also, question about ram speed, the ASRock Creator can support DDR4 ram speeds up to 4666, is there a need to go that high? I'm planning to go to 128GB of RAM, and higher speeds are definitely more expensive. Is there a sweet spot of cost/perf or does ram speed not even matter for deep learning?

Somethings I learned: Check the bifurcation/division of lanes on the PCIe ports on the mobo, even if the processor has enough lanes, the mobo might not split them ideally.

7

Comments

You must log in or register to comment.

rikonaka t1_j3jucus wrote

I think the thread rapper 5990x is better, the mainboard you can uses supermicro server motherboard, I have a computer using amd 5950x and a 3090ti, the mainboard is x570, if you use 4x 3090, it is best to use a server mainboard, for stable and performance.😉

2

rikonaka t1_j3jvvne wrote

I read your shopping list, there's has two problems, the motherboard and power supply, the power of one 3090 is 350 watts, and the power of four is 1400 watts, so your power supply should be at least 2000 watts (The specific calculation will be made after the CPU is determined), the problem with the motherboard is that the b650 does not support four 3090 video cards, and it only has two videos card slots.😉

1

soupstock123 OP t1_j3jx9ko wrote

Yeah, there's no way to add two PSUs on the pcparpicker, so that's mean to be 2 of the 1000W ones.

The b650 supports 4. It has enough ports. The blocking is not an issue because I'm going to be using GPU risers to fit the 4 GPUs.

To respond to your first comment, the thread ripper is also very expensive, and I'm waiting until Sept 2023 when threadripper 7 comes out and drops prices for thread rippers.

1

rikonaka t1_j3jyzos wrote

Well, I’m not sure how feasible the method of two power supplies for one host is😂, and it’s still a problem with the motherboard. I can’t comment on the stability of the method of connecting four 3090s using graphics card expansion (because I haven’t done so yet), I think you should carefully consider your plan, the cost of trial and error is not low.

1

hjups22 t1_j3k2kei wrote

What is the intended use case for the GPUs? I presume you intend to train networks, but which kind and at what scale? Many small models, or one big model at a time?

Or if you are doing inference, what types of models do you intend to run.

The configuration you suggested is really only good for training / inferencing many small models in parallel, and will not be performant for anything that uses more than 2 GPUs via NVLink.
Also don't forget about system RAM... depending on the models, you may need ~1.5x the total VRAM capacity in system RAM, and deepspeed requires a lot more than that (upwards of 4x) - I would probably go with at least 128GB for the setup you described.

6

emanresuymsseug t1_j3kdzbv wrote

> - with 4 gpus that's 4 PCIe4 lanes per gpu

With the Asus PRIME B650M-A AX you are looking at 16 lanes for 1 GPU and 1 lane each for the other 3 GPUs.

PCIEX16_2, PCIEX16_3 and PCIEX16_4 slots are electrically connected in x1 mode.

Bifurcation is only supported via PCIEX16_1 slot.

4

qiltb t1_j3kjvki wrote

I actually works very well with ADD2PSU connector (used like 5 PSUs for one 14x3090 rig). He should actually think more of getting 1600W HIGH QUALITY PSU.

Corsair RM series IS NOT SUITABLE for workload you are looking for. Use preferrably AXi series or HXi if you really want to cheap out. We are talking about really abusing those PSUs. AX1600i is still unmatched for this usecase.

1

Volhn t1_j3kx4ut wrote

Just get a single 3090 and the 7950x or 13 series intel, then use the rest that you would have spent on renting bigger GPUs in the cloud.

1

VinnyVeritas t1_j3l0gqt wrote

I don't know if that's going to work well to have 16 PCIe lane, everyone here I've seen making 4 GPUs machines uses the CPUs that have 48 or 16 PCIe lanes.

Also you'll need a lot of watts to power that monster, not to mention you need a 10-20% margin if you don't want fry the PSU.

1

soupstock123 OP t1_j3l0lhk wrote

Thanks for the advice, can you elaborate more about how the Corsair RM series is not suitable for the workload? My rationale was that because it's an open air mining frame instead of a case, I wanted the RM series which supposedly is quieter?

1

soupstock123 OP t1_j3l0q8f wrote

Yeah, that's what I've basically discovered too. The mobo with the 16 PCIe lanes isn't going to work out. Changed my build to threadripper. Any advice or suggestions for a PSU that can handle the workload?

1

hjups22 t1_j3l1l6n wrote

That information is very outdated, and also not very relevant...
The 3090 is an Ampere card with 2x faster NVLink, which has a significant advantage in speed compared to the older GPUs. I'm not aware of any benchmarks that explicitly tested this though.

Also, Puget benchmarked what I would consider "small" models. If the model is small enough, then the interconnect won't really matter all that much as you're going to spend more time in com setup than transfer.
But for the bigger models, you better bet it matter!
Although to be fair, my original statement is based on a node with 4x A6000 GPUs, configured in a pair-wise NVLink configuration. When you jump from 2 paired GPUs over to 4 GPUs with batch-parallelism, the training time (for big models - ones which barely fit in the 3090) will only increase by about 20% rather than the expected 80%.
It's possible that the same scaling will not be seen on 3090s, but I would expect the scaling to be worse in the system described by the OP, since the 4x system allocated a full 16 lanes to each GPU via dual sockets.

Note that this is why I asked about the type of training being done, since if the models are small enough (like ResNet-50), then it won't matter - though ResNet-50 training is pretty quick and won't really benefit that much from multiple GPUs in the grand scheme of things.

4

soupstock123 OP t1_j3l2srl wrote

Right now mostly CNNs, RNNs, and playing around with style transfers with GANs. Future plans include running computer vision models trained on videos and testing inferencing, but still researching how demanding that would be.

1

hjups22 t1_j3l3ln2 wrote

Those are all going to be pretty small models (under 200M parameters), so what I said probably won't apply to you. Although, I would still recommend parallel training rather than trying to link them together (4 GPUs means you can run 4 experiments in parallel - or 8 if you double up on a single GPU).

Regarding RAM speed, it has an effect, but it probably won't be all that significant given your planned workload. I recently changed the memory on one of my nodes so that it could train GPT-J (reduced the RAM speed so that I could increase the capacity), the speed difference for other tasks is probably within 5%, which I don't think matters (when you expect to run month long experiments, an extra day is irrelevant).

2

qiltb t1_j3l9hll wrote

Under full load, AXi series is basically silent. But main reason is that PSU is not of high enough quality to actually sustain that load (even higher grade PSUs like EVGA P2 series has problems with infamous 3090 under DL task load) . Also take a look at my big comment on this reddit post.

1

VinnyVeritas t1_j3ng2u9 wrote

Do you have some numbers or a link because all benchmarks I've seen point to the contrary? I'm happy to update my opinion if things have changed and there's data to support it.

1

VinnyVeritas t1_j3nh3g4 wrote

I suppose one PSU will take care of motherboard + CPU + some GPUs and the other one will take care of remaining GPUs.

So if you get 4x 3090, that's 350W x4 = 1400W just for GPUs, +300 watts for CPU, +powering the rest of the components, drives, etc... So let's say we round that up to 2000W, then add at least 10% margin, that's 2200W total.

So maybe 1600W PSU for mobo and some GPUs, and another 1000W or more for the remaining GPUs. Note, if you go with 3090TI, it's more like 450-500W per card, so you have to do the maths.

Or if you want to go future proof, just put two 1600W PSUs, and then you can just swap your 3090 with 4090 in the future and not worry about upgrading PSUs.

1

hjups22 t1_j3nqeim wrote

Then I agree. If you are doing ResNet inference on 8K images, then it will probably be quite slow. However 8K segmentation will probably be even slower (the point of comparison that I was thinking of).
Also, when you get to large images, I suspect the PCIe will become a bottleneck (sending data to the GPUs), which will not be helped by the setup described by the OP.

1

qiltb t1_j3o7ull wrote

I actually assumed you will be having 2 PSUs. For least problems, buy 2xAX1600i, for cheaper option buy 2xAX1200i. One PSU is actually the worst case, but yeah you can try with a single SFL 2000.

1

VinnyVeritas t1_j3rrzvr wrote

Actually I've been sort of looking at ML computers (kind of of browsing and dreaming one day I would have one, but it's always going to be out of my means and needs anyway). Anyway, they can put two PSUs in a box, obviously it's made by companies, so the total cost is twice or 3 times the cost of the parts alone (e.g. building yourself would be 2-3x cheaper) but it could inspire you for picking your parts.https://bizon-tech.com/amd-ryzen-threadripper-up-to-64-cores-workstation-pc

https://shop.lambdalabs.com/gpu-workstations/vector/customize

1