Viewing a single comment thread. View all comments

Surur t1_jaeezsj wrote

I think the RL-HF worked really well because the AI is basing its judgement not on a list of rules, but the nuanced rules it learnt itself from human feedback.

Just like most AI things, we can never encode strictly enough all the elements which guide our decisions, but using neural networks we are able to black-box it and get a workable system that has in some way captured the essence of the decision-making process we use.

2

Liberty2012 OP t1_jaejlry wrote

There is a recent observation that might question exactly how well this working. There seems to be a feedback loop causing a deceptive emergent behavior from the reinforcement learning.

https://bounded-regret.ghost.io/emergent-deception-optimization

2

Surur t1_jaem8nr wrote

It is interesting to me that

a) its possible to teach a LLM to be honest when we catch it in a lie.

b) if we ever get to the point where we can not detect a lie (eg. novel information) the AI is incentivised to lie every time.

2