andreichiffa

andreichiffa t1_jdzcbln wrote

Depends on which hardware you have. A rule of thumb is that if you want to be efficient, you need about 3x the model size in VRAM to store optimizers state, plus some headroom for data.

You also need to use float for training, due to stability issues. So unless your GPU supports float8, double the RAM.

Realistically, if you have an RTX 4090, you can go up to 6-7B models (Bloom-6B, GPT-j, …). Anything below, and I would aim at 2.7B models (GPT-neo).

I would avoid LLaMA family due to how you get access to pretrained model weights, for liability, and stay with FOSS. In the latter case you can contribute back and gain some visibility this way, assuming you want some.

4

andreichiffa t1_jaa3v5s wrote

Based on some of the comments over on /r/ChatGPT asking to remove the disclaimers while they teach themselves plumbing, HVAC and electric works with ChatGPT, we are a couple of lawsuits from OpenAI and MS actually creating a GPT certification and workplaces requiring it to interact with LLMs/insurances refusing claims resulting from ChatGPT interaction without certification.

1

andreichiffa t1_j9t35a6 wrote

No. As a matter of fact, I consider it harmful, and I am far from being alone in that regard.

What you need to understand is that AI* kills already. Not only military/law enforcement AI that misidentifies people and leads to them being killed / searched & killed / empoisoned & killed in prison, the types of AI that you interact on a daily basis. Recommendation algorithms that promote disinformation regarding vaccines safety and COVID risk killed hundreds of thousands. Medical AIs that are unable to identify sepsis in 70% of cases but are widely used and override doctors in hospitals have killed thousands. Tesla autopilot AIs that kill their passengers on a regular basis. Conversational agent LLMs that will tell the users how to do electric work and kill them in the process.

But here is the thing. Working on the safety of such AIs leads to a conflict - with the engineers and researchers developing them, with execs that greenlight them, with influencers that touted them, with stakeholders who were getting money from additional sales the AI feature has generated. So safety and QA teams get fired, donations get made to universities to get rid of particularly vocal current state of affairs critics, Google de-indexes their works and Facebook randomly and accidentally deletes their posts (Bengio vs LeCun circa 2019, I believe, and the reason the latter moved to Twitter).

The problem with super-human AGI folks (and generally the longtermism/EA, to which Eliezer Yudkowsky belongs), is that they claim that none of those problems matter, because if SH-AGI arises, if it decides to mingle into human affairs, if we don't have an enclaves free from it, and even if it occurs in 100 years, it will be so bad, that it will make everything else irrelevant.

That's a lot of "ifs". And a long timeline. And there are pretty good theoretical reasons to believe that even when SG-AGI arises, its capabilities would not be as extensive as EA crowd claims (impossibility theorems and Solomonoff computability support wrt energy and memory support). And then there are theoretical guarantees as to why we won't be able to prevent it now even if it started to emerge (Godel's incompletness).

But in principle - yeah, sure why not, you never know if something interesting pops along the way.

The problem is that in the way it is currently formulated and advertised, it hits the cultural memes (HAL, A.I., ..) and the A-type personalities of younger engineers and researchers (work on the **most important** problem likely to make you **most famous**) in a way that completely drowns out the problems with AI that are already here - both from the general public's and engineer's perspective.

It is perhaps not a coincidence that a lot of entities that would stand to loose in reputation/income from in-depth looks into current AIs safety and alignment are donating quite a lot to EA/long-termism and lending them of their own credibility.

*To avoid sterile semantic debates, to me an AI is any non-explicitly coded programs that perform decisions on its own. Hence LLMs without a sampler are non-AI ML, whereas generative LLMs with a sampler are AI (generative ML).

3

andreichiffa t1_j9fa9kz wrote

It's a grey area.

It's not general enough to warrant a full research paper, but on the other hand, it is equivalent to an SQL injection due to non-sanitation attack and would be reported as a CVE if we were in traditional programming.

I think eventually there will be a database like that, so save the prompt, date and context of the conversation, preferably somewhere that can has a timestamp (eg public github repo commit with a PGP signature), so that once the system goes live you can add to it.

3

andreichiffa t1_j8vqd42 wrote

It’s a RedHat for ML and especially LLMs. You want clean internals and things that work? You pay the consulting/on-premises fees. In the meantime they are pushing forwards FOSS models and supporting sharing and experimentation on established models.

I really don’t think you realize how much worse the domains that don’t have their HuggingFace are doing.

7

andreichiffa t1_j8hf2th wrote

10% is what OpenAI considered as "good enough" for theirs, but the problem is with the fact that the detection is not uniform. Most neurodivergent folks will be misclassified as generative models, just as for people with social anxiety who tend to be wordy. Non-native and non-fluent English speakers are the other big false-positive triggers.

1

andreichiffa t1_j8hawd4 wrote

I have reported to Huggingface what its detector was used for and its failure modes (hint:false positives are worse). In the first days of December. They decided to keep it up. It’s on their consciousness.

Same thing with API providers. Those willing to sell you one are selling you snake oil. It’s on their consciousness.

Same thing for you. You want to build an app that sells snake oil that can be harmful in a lot of scenarios? It’s on your consciousness.

But at that point you even don’t need an API to build it.

1

andreichiffa t1_j7t9ul8 wrote

I am pretty sure that was an Anthropic paper first (Predictability and Surprise in Large Generative Models). Makes me truly wonder WTF exactly is going on in Google lately.

As to your question, no one has stacked enough attention layers yet, but there is very high probability that they will. Someone already mentioned the ability to spell, but it could potentially help with things such as hands, number of hands/feet/legs/arms/paws/tails and other things that make a lot of generated images today disturbing.

The issue will most likely be with funding enough data, given that unlike texts most images on the internet are copyrighted (cough Getty cough).

6

andreichiffa t1_j6n9lg6 wrote

A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805

About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf

Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf

Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.

1

andreichiffa t1_j6mdm66 wrote

On a very high level, transformer-derived architectures struggle with the concept of reality because they need distributions in the token embedding space to remine wide. Especially for larger model, the training data is so sparse that without that they would struggle with generalization and exposure biais.

Repeated prompting and prompt optimization can pull out elements of training set from it (in some cases), because in the end they do memorize, but the exact mechanism is not yet clear and cannot be counted on.

You can go around it by adding a « critic » post-processor that would classify if model tries to mention a fact, look it up, and force it to re-generate until statement is factually correct. This is very close to GeDi, the Guided Generation introduced by a Salesforce team back in 2020. Given that OpenAI went this route for ChatGPT and InstructGPT to make them less psycho and more useful to the end users (+ iterative fine-tuning from user's and critic model input), there is a good chance they will go this route as well.

You can also add discrete non-differentiable layers to train model to recognize factual statements from others in-text text and learn to switch between the modes allowing it to process them differently. However, you loose nice back-propagation properties and have to do black-box optimization on discrete layers, which is costly, even by LLM standards. That seems to be the Google approach with PaLM.

3

andreichiffa t1_j60625r wrote

So. First of all it’s not the size, or at least not only the size.

Before ChatGPT OpenAI experimented with InstructGPT, which at 6B parameters completely destroyed the 175B GPT3 when it came to satisfying users interacting with it and not being completely psycho.

Code-generating abilities start around 12B parameters (OpenAI codex), so most of things you are interacting with and are impressed by could be done with 12B parameters model. What really is doing heavy lifting for Chat-GPT is fine-tuning and guided generation to make it conform to user’s expectations.

Now, the model size allows for nice emerging properties, but there is a relationship between the dataset size and model size, meaning that without increasing the dataset, bigger model do nothing better. At 175B parameters, GPT-3 was already past that point compared to the curated dataset OpenAI used for it. And given that their dataset already contained CommonCrawl, it was pretty much all public writing on the internet.

They weren’t short by a bit - over a factor of 10x. Finding enough data to just finish training GPT-3 is a challenge already; larger models would need even more. That’s why they could dump code and more text into GPT-3 to create GPT-3.5 without creating bottlenecks.

Now, alternative models to GPT-3 have been trained (OPT175B or BLOOM), but at least for OPT175, it underperforms. OpenAI actually did a lot of data preparation, meaning that anyone who would want to replicate it would need to figure out the “secret sauce”.

7

andreichiffa t1_j5uczy3 wrote

*Lecun. And their Galactica was subject of so much ridicule that after pompous launch it was in-launched 48 hours later. OPT-175B is a clone of OpenAI’s GPT3, but performs worth and is essentially a massive pain in the ass cyber-security and phishing/desinformation.

Lecun always was into CovNets for machine vision - text-to-text is Hinton, Bengio, and Sutskever.

So far it looks like Baidu and Google have bigger transformer-based models that could perform better, but only Google’s PaLM is architecturally different enough to potentially perform better.

There are also augmented variants of Transformer-based model that are capable of more factual response, but they tend to be less conversational.

1