ReginaldIII

ReginaldIII t1_j054n38 wrote

Please read my updated comment.

> I think the use case here is pretty obvious

With the greatest of respect, I don't.

> and I tried to just give some basic examples but I’m certainly not an expert and have not been involved in the types of trouble shooting required to get something like this working.

Also with the greatest of respect, I am an expert in this area, and have also worked with blockchains extensively.

I do not think blockchain is a "stream of buzzwords". I think it is the wrong tool to solve "this" problem.

2

ReginaldIII t1_j053c4i wrote

I earnestly believe it solves problems that contain similar words. But it just does not present a practical solution to this problem.

We can't put the returned values on the blockchain. It just isn't possible to store them they are too big and too many, and there is no reason to store them, we only want to pass them onto the next worker or workers that immediately need them. We do care about fault tolerance to make sure they get to their destination.

So there's no way for this pool of blockchain nodes to form a consensus over the returned values being "correct" like this. We can't put the relevant information on the blockchain to allow it be compared.

What you end up with is just a classic non-blockchain vote by agreement system between workers of unknown trustworthiness. No blockchain needed.

You are correct that voting by consensus is needed, you just don't need all the rest of the things that turn that into a blockchain.

6

ReginaldIII t1_j050qmd wrote

Explain to me the mechanism by which you would encode the "correctness" of a result as a transaction or even smart contract on an idealized blockchain.

> Blockchain technology would absolutely accomplish the issue of trusting your workers. Why else do people invest millions in mining rigs? Because of a system of decentralized trust built on the blockchain, where they won’t gain any benefit from trying to create fake/malicious blocks.

We are talking about "trusting" fundamentally different things. A blockchain would be able to encode that at a specific point in time a worker going by some name returned something. It would be immutably stored in the blockchain, such that in the future we can look back and say "Yes, at that specific point in time a worker going by that name returned something".

And that tells us nothing about whether that worker returned the "correct" result, or a manipulated one.

I am talking about where the worker has returned the value that it proposes is the result and we care about having a mechanism to trust that the value itself is "correct" and therefore the worker has, at least this time, acted in a trustworthy fashion.

So if I am missing something, please, explain to me the mechanism by which you would encode the "correctness" of a set of activations and gradients for a chunk of work on a blockchain?

5

ReginaldIII t1_j04y036 wrote

That's what I mean by malicious clients.

You'd be relying on the malicious client to self report the git hash of the code it is "running". It can just lie to you.

The only defence is by duplicating each computation across multiple workers in the pool and having them compare results, most common result wins.

0

ReginaldIII t1_j04vb36 wrote

None of the issues I have raised can be solved by using blockchains.

We don't need to prove immutably that every chunk of work was processed for later auditing. We need to make sure every chunk gets processed "right now" as it is happening.

Blockchains do not present a solution to fault tolerance, they present a solution to auditing.

Blockchains also don't present a solution to trustworthiness here. In the same way that a wallet being present in a transaction on the blockchain says nothing about the real identity of the parties, nor does it say anything about whether the goods or services the transaction was for were carried out honestly.

Chunks of work encoded on the blockchain would tell you nothing about whether the activations and gradients computed were correct or manipulated, it would only tell you that they had in fact happened.

7

ReginaldIII t1_j04del0 wrote

Could you help me understand the split labels?

What specifically do you mean by "Offloading on 1x A100" ? Do you mean each chunk of work to do a forward pass is dispatched locally to a single GPU in sequence, but without the overheads of going through full on Petals?

Is there a difference between "Petals on 3 physical servers" and "Petals on 14 real servers" other than the number?

What you mean by "Petals on 12 virtual servers, simulated on 3x A100" and also by "Same, but with 8 clients running simultaneously" ?

Many thanks :)

1

ReginaldIII t1_j03wlpe wrote

Awesome, thanks for the details!

I like your reputation scaling idea, although dynamic reputation/trust scaling can be tricky to implement nicely in practice so I don't envy the task.

I think vote by consensus helps solve the problem especially when your worker population is high enough you can duplicate a lot of the work. But that does ultimately limit scaling efficiency with more worker nodes.

Can I ask, have you done any scaling experiments for large models on samples per second or training steps per second with an increasing number of workers, compared to the gold standard environment of a proper HPC cluster running MPI for communication? And also against existing Federated and Split Learning systems?

I realize a crowd structured compute environment is not aiming to hit the raw performance of these environments but I think these scalability comparisons would give a strong baseline to compare off of, and also to see future improvements against.

3

ReginaldIII t1_j03sbkj wrote

> Would it be possible to repeat the same training tasks on multiple workers and verify the workers against each other?

That's what I meant here.

>> A nice benefit of building on kafka is that multiple consumers looking at a queue can consume the same messages such that you can get voting by consensus for what the results to be passed on should be.


> OTOH it's more work to create a malicious worker than creating a malicious free LM, no?

Different types of malicious. A malicious worker could leak data it's passed off to someone else or it could work to destabilize the training limiting final accuracy or causing overfits.

If you are a company brokering access to privately trained LLM's and you have the opportunity to prevent a crowd sourced LLM reaching as good quality as your own there could exist an incentive to harm that effort. Corporate espionage is a thing.

There are plenty of ways in which a crowd-computing effort could be misused or attacked.

3

ReginaldIII t1_j02utp9 wrote

I've been looking at heterogenous compute a lot lately for some tasks related to this sort of problem.

Are you assuming that all of your workers are trustworthy all of the time? Do you have any consideration for bad actors poisoning the training? Or potentially encoding hidden/malicious data or leaking training data out of your computation? I'd be interested to hear what you are doing to mitigate these threats if you are looking at them.

Also, related to trustworthiness, is the question of fault tolerance. What mechanism are you using to pass and buffer chunks of inputs/outputs between workers? Do you ensure every chunk of data eventually gets processed by exactly one worker and the results definitely make it to their destination or is it a bit lossy for the sake of throughput?

I had been looking at chaining workers together using a mixture of local (on worker) and global (centralized in the cloud) kafka clusters to ensure every chunk of data does eventually make it through properly and nothing gets lost. A nice benefit of building on kafka is that multiple consumers looking at a queue can consume the same messages such that you can get voting by consensus for what the results to be passed on should be.

Kafka also really helps deal with buffering and availability of your workers to be ready to recieve work without worrying if they are going to drop incoming packets because they were busy at the time.

Interested to hear if you've hit any of these issues! :)

27

ReginaldIII t1_izoi4bq wrote

Nothing wrong with exploring new AI technology. But there is absolutely a point when you are talking about deploying a system for long term or widespread use where you should stop to consider the environmental impact.

The hostility from people because they've been asked to even consider the environmental impact is telling.

1

ReginaldIII t1_izoevj0 wrote

Sustainability in ML and HPC is a huge part of my job.

If you dont consider that important and think its bs that doesnt actually change that an important part of my job is to consider it.

At no point was I mean to OP. Im not being mean to a person who is littering by telling them not to litter. And I'm not being mean to a person making and distributing confetti as their hobby by pointing out how it is also littering.

−10

ReginaldIII t1_izo2mww wrote

They also cache heavily. Sustainability is a huge problem in ML and HPC.

In my job I spend a lot of time considering the impact of the compute that we do. It is concerning that the general public dont see how much extra and frivolous compute hours we are burning.

It's one thing to have a short flash of people trying out something new and novel and exciting. It is another to suggest a tool naively built on top of it with the intention of long term use and wide spread adoption.

The question of the environmental impact is legitimate.

3

ReginaldIII t1_iznxeag wrote

A calculator can be significantly more energy efficient than manual calculations.

Crunching a high end GPU to essentially perform text spinning on a stack trace is not more efficient than directly interpreting the stack trace.

E: See this is a weird comment to downvote because it is literally correct. Some usages of energy provide higher utility than others. Radical idea, I know.

−2