andreichiffa
andreichiffa t1_jdzcbln wrote
Depends on which hardware you have. A rule of thumb is that if you want to be efficient, you need about 3x the model size in VRAM to store optimizers state, plus some headroom for data.
You also need to use float for training, due to stability issues. So unless your GPU supports float8, double the RAM.
Realistically, if you have an RTX 4090, you can go up to 6-7B models (Bloom-6B, GPT-j, …). Anything below, and I would aim at 2.7B models (GPT-neo).
I would avoid LLaMA family due to how you get access to pretrained model weights, for liability, and stay with FOSS. In the latter case you can contribute back and gain some visibility this way, assuming you want some.
andreichiffa t1_jdvojfg wrote
Reply to comment by shanereid1 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.
However the overparameterization at the training stage can be trimmed at the inference stage.
andreichiffa t1_jdu5wmj wrote
Reply to [D] Will prompting the LLM to review it's own answer be any helpful to reduce chances of hallucinations? I tested couple of tricky questions and it seems it might work. by tamilupk
Yes, that’s the mechanism GPT-4 paper showed they were using for a bunch of things in the annex. It was initially discovered in the toxicity detection domain (RealToxicPrompts paper I believe)
andreichiffa t1_jajuk03 wrote
Reply to comment by LetterRip in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
That, and the fact that OpenAI/MS want to completely dominate LLM market, in the same way Microsoft dominated OS/browser market in the late 90s/early 2000s.
andreichiffa t1_jadwt07 wrote
Reply to An anti-seizure medication shows promise in reducing the likelihood of heavy drinking, desire to drink, and positive alcohol expectancies, according to new research by chrisdh79
A lot of anti-epileptic medication also has a stabilizing effect on milder forms of bipolar disorders. I wonder if that might play a role here.
andreichiffa t1_jaa3v5s wrote
Reply to [D] What do you think of this AI ethics professor suggestion to force into law the requirement of a license to use AI like chatGPT since it's "potentially dangerous"? by [deleted]
Based on some of the comments over on /r/ChatGPT asking to remove the disclaimers while they teach themselves plumbing, HVAC and electric works with ChatGPT, we are a couple of lawsuits from OpenAI and MS actually creating a GPT certification and workplaces requiring it to interact with LLMs/insurances refusing claims resulting from ChatGPT interaction without certification.
andreichiffa t1_j9t35a6 wrote
Reply to [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt
No. As a matter of fact, I consider it harmful, and I am far from being alone in that regard.
What you need to understand is that AI* kills already. Not only military/law enforcement AI that misidentifies people and leads to them being killed / searched & killed / empoisoned & killed in prison, the types of AI that you interact on a daily basis. Recommendation algorithms that promote disinformation regarding vaccines safety and COVID risk killed hundreds of thousands. Medical AIs that are unable to identify sepsis in 70% of cases but are widely used and override doctors in hospitals have killed thousands. Tesla autopilot AIs that kill their passengers on a regular basis. Conversational agent LLMs that will tell the users how to do electric work and kill them in the process.
But here is the thing. Working on the safety of such AIs leads to a conflict - with the engineers and researchers developing them, with execs that greenlight them, with influencers that touted them, with stakeholders who were getting money from additional sales the AI feature has generated. So safety and QA teams get fired, donations get made to universities to get rid of particularly vocal current state of affairs critics, Google de-indexes their works and Facebook randomly and accidentally deletes their posts (Bengio vs LeCun circa 2019, I believe, and the reason the latter moved to Twitter).
The problem with super-human AGI folks (and generally the longtermism/EA, to which Eliezer Yudkowsky belongs), is that they claim that none of those problems matter, because if SH-AGI arises, if it decides to mingle into human affairs, if we don't have an enclaves free from it, and even if it occurs in 100 years, it will be so bad, that it will make everything else irrelevant.
That's a lot of "ifs". And a long timeline. And there are pretty good theoretical reasons to believe that even when SG-AGI arises, its capabilities would not be as extensive as EA crowd claims (impossibility theorems and Solomonoff computability support wrt energy and memory support). And then there are theoretical guarantees as to why we won't be able to prevent it now even if it started to emerge (Godel's incompletness).
But in principle - yeah, sure why not, you never know if something interesting pops along the way.
The problem is that in the way it is currently formulated and advertised, it hits the cultural memes (HAL, A.I., ..) and the A-type personalities of younger engineers and researchers (work on the **most important** problem likely to make you **most famous**) in a way that completely drowns out the problems with AI that are already here - both from the general public's and engineer's perspective.
It is perhaps not a coincidence that a lot of entities that would stand to loose in reputation/income from in-depth looks into current AIs safety and alignment are donating quite a lot to EA/long-termism and lending them of their own credibility.
*To avoid sterile semantic debates, to me an AI is any non-explicitly coded programs that perform decisions on its own. Hence LLMs without a sampler are non-AI ML, whereas generative LLMs with a sampler are AI (generative ML).
andreichiffa t1_j9fa9kz wrote
Reply to [D] Maybe a new prompt injection method against newBing or ChatGPT? Is this kind of research worth writing a paper? by KakaTraining
It's a grey area.
It's not general enough to warrant a full research paper, but on the other hand, it is equivalent to an SQL injection due to non-sanitation attack and would be reported as a CVE if we were in traditional programming.
I think eventually there will be a database like that, so save the prompt, date and context of the conversation, preferably somewhere that can has a timestamp (eg public github repo commit with a PGP signature), so that once the system goes live you can add to it.
andreichiffa t1_j95d303 wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
I really think we need an intermediate between conference papers and arxiv, to just evaluate how reproducible/sane the paper is without evaluating whether it is important or not.
Because at this stage I genuinely can't tell if that's a press release, a report in a paper form, or an actual paper.
andreichiffa t1_j8vqd42 wrote
It’s a RedHat for ML and especially LLMs. You want clean internals and things that work? You pay the consulting/on-premises fees. In the meantime they are pushing forwards FOSS models and supporting sharing and experimentation on established models.
I really don’t think you realize how much worse the domains that don’t have their HuggingFace are doing.
andreichiffa t1_j8hf2th wrote
Reply to comment by ateqio in [D] Looking for recommendations for an affordable API service to classify AI-generated text by ateqio
10% is what OpenAI considered as "good enough" for theirs, but the problem is with the fact that the detection is not uniform. Most neurodivergent folks will be misclassified as generative models, just as for people with social anxiety who tend to be wordy. Non-native and non-fluent English speakers are the other big false-positive triggers.
andreichiffa t1_j8hawd4 wrote
Reply to comment by ateqio in [D] Looking for recommendations for an affordable API service to classify AI-generated text by ateqio
I have reported to Huggingface what its detector was used for and its failure modes (hint:false positives are worse). In the first days of December. They decided to keep it up. It’s on their consciousness.
Same thing with API providers. Those willing to sell you one are selling you snake oil. It’s on their consciousness.
Same thing for you. You want to build an app that sells snake oil that can be harmful in a lot of scenarios? It’s on your consciousness.
But at that point you even don’t need an API to build it.
andreichiffa t1_j8h43hh wrote
Reply to [D] Looking for recommendations for an affordable API service to classify AI-generated text by ateqio
You can’t. Anyone with enough technical knowledge will not want to go anywhere near legal ramifications and responsibility it implies (in addition to looking like a clown in about 10 minutes of uptime once bypasses are found).
There are fundamental limitations on detectability as of now.
andreichiffa t1_j7t9ul8 wrote
I am pretty sure that was an Anthropic paper first (Predictability and Surprise in Large Generative Models). Makes me truly wonder WTF exactly is going on in Google lately.
As to your question, no one has stacked enough attention layers yet, but there is very high probability that they will. Someone already mentioned the ability to spell, but it could potentially help with things such as hands, number of hands/feet/legs/arms/paws/tails and other things that make a lot of generated images today disturbing.
The issue will most likely be with funding enough data, given that unlike texts most images on the internet are copyrighted (cough Getty cough).
andreichiffa t1_j6n9lg6 wrote
Reply to comment by visarga in Few questions about scalability of chatGPT [D] by besabestin
A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805
About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf
Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf
Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.
andreichiffa t1_j6mojfv wrote
Reply to comment by Blutorangensaft in [Discussion] ChatGPT and language understanding benchmarks by mettle
Most likely as a post-processor, along the lines of guided generation; pretty much the GeDi proposed by Salesforce in 2020.
andreichiffa t1_j6mdm66 wrote
On a very high level, transformer-derived architectures struggle with the concept of reality because they need distributions in the token embedding space to remine wide. Especially for larger model, the training data is so sparse that without that they would struggle with generalization and exposure biais.
Repeated prompting and prompt optimization can pull out elements of training set from it (in some cases), because in the end they do memorize, but the exact mechanism is not yet clear and cannot be counted on.
You can go around it by adding a « critic » post-processor that would classify if model tries to mention a fact, look it up, and force it to re-generate until statement is factually correct. This is very close to GeDi, the Guided Generation introduced by a Salesforce team back in 2020. Given that OpenAI went this route for ChatGPT and InstructGPT to make them less psycho and more useful to the end users (+ iterative fine-tuning from user's and critic model input), there is a good chance they will go this route as well.
You can also add discrete non-differentiable layers to train model to recognize factual statements from others in-text text and learn to switch between the modes allowing it to process them differently. However, you loose nice back-propagation properties and have to do black-box optimization on discrete layers, which is costly, even by LLM standards. That seems to be the Google approach with PaLM.
andreichiffa t1_j6c9xf1 wrote
Reply to comment by visarga in Few questions about scalability of chatGPT [D] by besabestin
That’s a very bold claim that flies in the face of pretty much all the research on the subject to the date.
Surely you have extraordinary evidence to support such extraordinary claims?
andreichiffa t1_j60625r wrote
So. First of all it’s not the size, or at least not only the size.
Before ChatGPT OpenAI experimented with InstructGPT, which at 6B parameters completely destroyed the 175B GPT3 when it came to satisfying users interacting with it and not being completely psycho.
Code-generating abilities start around 12B parameters (OpenAI codex), so most of things you are interacting with and are impressed by could be done with 12B parameters model. What really is doing heavy lifting for Chat-GPT is fine-tuning and guided generation to make it conform to user’s expectations.
Now, the model size allows for nice emerging properties, but there is a relationship between the dataset size and model size, meaning that without increasing the dataset, bigger model do nothing better. At 175B parameters, GPT-3 was already past that point compared to the curated dataset OpenAI used for it. And given that their dataset already contained CommonCrawl, it was pretty much all public writing on the internet.
They weren’t short by a bit - over a factor of 10x. Finding enough data to just finish training GPT-3 is a challenge already; larger models would need even more. That’s why they could dump code and more text into GPT-3 to create GPT-3.5 without creating bottlenecks.
Now, alternative models to GPT-3 have been trained (OPT175B or BLOOM), but at least for OPT175, it underperforms. OpenAI actually did a lot of data preparation, meaning that anyone who would want to replicate it would need to figure out the “secret sauce”.
andreichiffa t1_j5uczy3 wrote
Reply to [D]Are there any known AI systems today that are significantly more advanced than chatGPT ? by Xeiristotle
*Lecun. And their Galactica was subject of so much ridicule that after pompous launch it was in-launched 48 hours later. OPT-175B is a clone of OpenAI’s GPT3, but performs worth and is essentially a massive pain in the ass cyber-security and phishing/desinformation.
Lecun always was into CovNets for machine vision - text-to-text is Hinton, Bengio, and Sutskever.
So far it looks like Baidu and Google have bigger transformer-based models that could perform better, but only Google’s PaLM is architecturally different enough to potentially perform better.
There are also augmented variants of Transformer-based model that are capable of more factual response, but they tend to be less conversational.
andreichiffa t1_j5ivwgc wrote
Reply to comment by TheTerrasque in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
or OPT175.
However 7B is more than large enough to do a lot of shady stuff that 175B models can do. Even 1.5B ones are already starting to do a good job with a minimally competent user.
andreichiffa t1_j5il76n wrote
Reply to comment by e-rexter in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
Yup. And then they also will detect human texts that start in the same way as MS COCO dataset as GTP-generated.
andreichiffa t1_j5fd581 wrote
Reply to [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut
They kinda exist (eg GPT2 detector from Hugging Face), based off the data that they trained on (which is the limiting factor). However, ultimately every model can be modified (fine-tuned) to evade them. Even for large models (>7B parameters), it can be done reasonably fast on commodity hardware these days.
andreichiffa t1_j50x4ky wrote
Reply to comment by DaLameLama in [D] Inner workings of the chatgpt memory by terserterseness
Reported token size is 2048, but they likely do a hard attention mask. In about 1/4th of words
andreichiffa t1_jec26vk wrote
Reply to comment by Jadien in [D] Turns out, Othello-GPT does have a world model. by Desi___Gigachad
Which is basically the self-attention mechanism + universal approximators nature of NNs. So I am not sure what that proves or what is new about it.