Submitted by austintackaberry t3_120usfk in MachineLearning

Databricks shows that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in less than three hours on one machine, using high-quality training data.

They fine tuned GPT-J using the Alpaca dataset.

Blog: https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
Github: https://github.com/databrickslabs/dolly

593

Comments

You must log in or register to comment.

Civil_Collection7267 t1_jdjce9w wrote

This is really impressive. If anyone's curious, here's a comparison with results from Alpaca Native:

Write a love letter to Edgar Allan Poe.

>My dearest Edgar,
>
>I have been walking in a dream these past few days, unable to believe that you are no longer with us. Your words have been on my mind constantly, and I find myself looking for your presence in every corner of our world. Everywhere I go, I see your influence, from the stars above to the rivers below. Even now, as I write this letter, your spirit is with me.
>
>Your poetry was like a beacon of light in the darkness, illuminating the depths of my soul. Through your words, I discovered a new way of seeing the world, one filled with mystery and beauty. You showed me that life could be full of surprises and that nothing should ever be taken at face value.
>
>You were so ahead of your time, my love, and I know that even after all these years, your teachings will continue to live on. I will never forget everything you’ve given me, and I hope that one day we can meet again in another realm. Until then, I will keep your memory alive in my heart.
>
>Forever yours,
>
>[Name]

Write a tweet announcing Dolly, a large language model from Databricks.

>We're thrilled to announce Dolly, our latest language model from Databricks! Dolly is a large-scale language model with state-of-the-art performance on many tasks, including text classification and question answering. #LangModeling #NLP #AI

I'm selling my Nikon D-750, write a short blurb for my ad.

>I am selling my Nikon D-750 camera body with 16GB memory card and the original box and all original paperwork. The camera has been used lightly and is in excellent condition. It comes with the original lens, a Nikkor 35mm f/1.8G. This camera is perfect for any photographer looking to capture stunning images and videos. It offers an impressive array of features including a 24MP DX CMOS sensor, EXPEED 4 image processing engine, 51 point AF system, 1080p HD video recording, and built-in WiFi. The camera is also dust and splash proof, making it ideal for both amateur and professional photographers alike. I am selling this camera due to upgrading to a newer model, and it is in perfect condition.

Explain to me the difference between nuclear fission and fusion.

>Nuclear fission is a process in which a large atom is split into smaller atoms, releasing energy in the form of heat and radiation. Nuclear fusion is a process in which two or more small atoms are combined to form a larger one, releasing energy in the form of heat and light. Fission occurs naturally in some types of radioactive decay, while fusion can occur when certain elements combine in a controlled reaction. The primary difference between these processes is the type of energy released. Fission releases kinetic energy, while fusion releases less energy but more light. Fission also produces more neutrons, which can be used for power production, while fusion produces fewer neutrons and is not as useful for power generation.

Give me a list of 5 science fiction books I should read next.

>1.Dune by Frank Herbert - This classic science fiction novel tells the story of Paul Atreides, whose family is betrayed and given control of the desert planet Arrakis. As he matures, Paul must face dangerous political intrigues and physical threats as he struggles to save his people.
>
>2.The Hitchhiker's Guide to the Galaxy by Douglas Adams - This comedic sci-fi novel follows Arthur Dent, who is told Earth will be destroyed to make way for an intergalactic bypass and escapes with his alien friend Ford Prefect. They travel together through space, encountering strange creatures and learning about the universe.
>
>3.The Foundation Trilogy by Isaac Asimov - This series follows the life of psychohistorian Hari Seldon and his efforts to create a foundation based on psychology and mathematics to preserve human knowledge after the fall of the Empire. It includes the novels Foundation, Foundation and Empire, and Second Foundation.
>
>4.The Martian Chronicles by Ray Bradbury - This collection of stories follows the colonization of Mars by humans, their interactions with the native Martians, and their eventual departure from the planet. It examines themes such as technology, religion, and the human spirit.
>
>5.Ender’s Game by Orson Scott Card - This novel follows Andrew “Ender” Wiggin, a gifted young boy recruited by the military to train as a leader of an army of genetically-engineered children. He must use his intelligence and empathy to fight a war against an alien race.

70

big_ol_tender t1_jdjcfc8 wrote

The alpaca dataset has a no commercial license so idk what they are doing.. I’ve asked Stanford to change it but heard nothing back

13

MasterEpictetus t1_jdjdobi wrote

This type of work needs to get a lot of attention. Great demonstration on getting instruction following behavior without the biggest LLM model that only companies with major resources can train.

154

machineko t1_jdjeh6y wrote

We have a similar open-source project focused on personalization of LLMs and efficient fine-tuning: https://github.com/stochasticai/xturing

We actually released code for GPT-J, LLaMA and GPT-2 before these guys but we are a small team. You can run it on any local machines too.

182

Colecoman1982 t1_jdjkgjy wrote

When you asked, did you clarify that you were asking about the training data versus the whole project? The final Alpaca project was built, in part, on top of Meta's LLaMa. Since LLaMa has a strictly non-commercial license, there is no way that Stanford can ever release their final project for commercial use (as they've already stated in their initial release of the project). On the other hand, any training data they've created on their own (without needing any code from LLaMa) should be within their power to re-license. If they think you are asking for the whole project to be re-licenced, they are likely to just ignore your request.

23

MjrK t1_jdjqz9h wrote

> We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based on OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

https://crfm.stanford.edu/2023/03/13/alpaca.html

22

ZetaReticullan t1_jdjrecp wrote

What a time to be alive! jointly terrifying and exciting!

18

Reeeeeeeeedit t1_jdjryxm wrote

Where is the instruction training data? Couldn’t find it in the GitHub repo

6

SWESWESWEh t1_jdk8rtn wrote

Doing the lords work my friend. Does it work with Apple Silicon Metal shaders? I've trained my own models as both TF and pytorch support it but I've noticed a lot of people use cuda only methods which makes it hard to use open source stuff

17

__Maximum__ t1_jdkdtp2 wrote

ClosedAI is feeding off of our data. If we start using/supporting Open Assistant instead, it will beat chatgpt in a month or two.

54

__Maximum__ t1_jdkepie wrote

Also, it's very shady for a company called OpenAI. They claimed they became for profit because they needed the money to grow, but these restrictions just show that they are filthy liars and only care about keeping the power and making profit. I'm sure they already have a strategy going around that 30B cap, just like they planned stealing money and talent by calling themselves non-profit first.

17

kromem t1_jdkfj5w wrote

> The model underlying Dolly only has 6 billion parameters, compared to 175 billion in GPT-3, and is two years old, making it particularly surprising that it works so well. This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.

The exciting thing here is the idea that progress in language models is partially contagious backwards to earlier ones by using newer models to generate the data to update older ones not in pre-training but in fine tuning (and I expect, based on recent research into in context learning, this would extend into additional few shot prompting).

I'm increasingly wondering if we'll see LLMs develop into rolling releases, particularly in the public sector. Possibly with emphasis on curating the data set for fine tuning with a platform agnostic stance towards the underlying pre-trained model powering it.

In any case, it looks more and more like the AI war between large firms will trickle down into open alternatives whether they'd like it to or not.

38

light24bulbs t1_jdks13d wrote

Question: i notice there's a focus here on fine tuning for instruction following, which is clearly different from the main training where the LLM just reads stuff and tries to predict the next word.

Is there any easy way to continue that bulk part of the training with some additional data? Everyone seems to be trying to get there with injecting embedding chunk text into prompts (my team included) but that approach just stinks for a lot of uses.

8

mxby7e t1_jdktvqr wrote

The license won’t change. The dataset was collected in a way that violates the term of service of OpenAI, which they used to generate the data. If they allowed commercial use it would open them up to lawsuit.

8

throwaway2676 t1_jdl0y80 wrote

Alpaca was only trained on 50k instructions, right? A large group of grad students or a forum like reddit could construct that many manually in a couple weeks. I'm surprised they even had to resort to using ClosedAI

8

mxby7e t1_jdl18t6 wrote

Maybe, open assistant by Stability.ai is doing this type of manual dataset collection. The training data and the model weights are supposed to be released once training is complete

11

dreamingleo12 t1_jdl3qgp wrote

It’s just a shameless copy of Stanford’s work. The innovative thing about Stanford Alpaca is it makes a ChatGPT style assistant with a language model, Meta LLaMA, and the cost is low. Databricks just followed Stanford’s approach and uses a different base model and claims it’s a big innovation. Alpaca actually can be fine-tuned with the same dataset in 3 hours and performs better than Databricks’ model.

4

Educational_Ice151 t1_jdl47lq wrote

Hello Dolly. This look pretty interesting. I have been playing with creating cross model feedback loops that iterate for several cycles using few shot prompts and chain of thought models. This would work really well for my concept. I’ll likely publish my code in a day or two.

Shared to r/aipromptprogramming

10

WarAndGeese t1_jdl5t0z wrote

Boo hoo to openai, people should do it anyway. Is the terms of service the only reason not to do it or are there actual material barriers? If it's a problem of money then as long as people know how much money it can be crowdfunded. If it's a matter of people power then there are already large volunteer networks. Or is it just something that isn't practical or feasible?

7

elbiot t1_jdlgxnz wrote

In my understanding, if you have text, it's not a challenge to train on next word prediction. Just keep the learning rate low. The reason there's a focus on the instruction based fine tuning is because that data is harder to come by.

My only experience is I've done this with a sentence embedding model (using sbert) and I just trained on my new text and the original training data 50/50 and it both got better at embedding my text and didn't forget how to do what it was originally trained on

5

master3243 t1_jdlhb8c wrote

I have a theory that the main reason OpenAI decided to start keeping it's training and architectural details private is because through minor modification in training data and data augmentation they were able to gain significant improvements in the qualitative output of GPT.

Thus any competitor could replicate the pipeline with ease and reproduce the improvements, so they decided to keep it as a trade secret.

Glad more research like this is being done and shared to the rest of the community.

29

Disastrous_Elk_6375 t1_jdlix6j wrote

The demo was up for a couple of days. The first hours of it being online were rough (80-200 people in queue). It got better the following day, and better still the 3'rd day. I believe they removed the demo ~1week later. IMO they've proven a point - the demo was extremely impressive for a 7b model.

1

Disastrous_Elk_6375 t1_jdlj4rn wrote

> and uses a different base model and claims it’s a big innovation

Huh? My read of their blog was that they wanted to highlight the fact that you can fine-tune a ~2yo LLM and still get decent results. I don't think they've claimed this is innovative, or that the innovation is theirs to boast...

I've played with GPT-neo (non X) and GPT-J when they were released, and the results were rough. You had to do a ton of prompt engineering work and exploration to find useful cases. This shows that even smaller, older models can be fine-tuned with the method proposed in Alpaca.

14

lexcess t1_jdlj8tf wrote

Classy, especially when they are breezing past any copyright of the datasets they are training off of. I wonder if they can legally enforce that without creating a potentially bad precedent for themselves. Or if it could be worked around if the training was indirect through something like Alpaca.

3

dreamingleo12 t1_jdlkbxl wrote

WSJ:

“Databricks Launches ‘Dolly,’ Another ChatGPT Rival The data-management startup introduced an open-source language model for developers to build their own AI-powered chatbot apps” (Apparently DB paid them)

DB’s blog:

“Democratizing the magic of ChatGPT with open models”

Introduced? ChatGPT rival? Didn’t you just follow Stanford’s approach? You used Stanford’s dataset which was generated by GPT right? huh? This is Stanford’s achievement not DB’s. DB went too far on marketing.

1

SeymourBits t1_jdlkln7 wrote

I second this. I was able to extract fairly useful results from Neo but it took a huge amount of prompt trial and error, eventually getting decent/stable results but not in the same ballpark as GPT3+. The dolly training results here seem good, if not expected. I'm now ready to move to a superior model like LLaMA/Alpaca though. What are you running?

7

Disastrous_Elk_6375 t1_jdllii0 wrote

> https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html

This is the blog post that I've read. I can't comment on the WSJ article, and your original message implied a bunch of things that, IMO, were not found in the blog post. If you don't like the WSJ angle, your grief should be with them, not databricks. shrug

From the actual blog:

> We show that anyone can take a dated off-the-shelf open source large language model (LLM) and give it magical ChatGPT-like instruction following ability by training it in 30 minutes on one machine, using high-quality training data.

> Acknowledgments > > This work owes much to the efforts and insights of many incredible organizations. This would have been impossible without EleutherAI open sourcing and training GPT-J. We are inspired by the incredible ideas and data from the Stanford Center for Research on Foundation Models and specifically the team behind Alpaca. The core idea behind the outsized power of small dataset is thanks to the original paper on Self-Instruct. We are also thankful to Hugging Face for hosting, open sourcing, and maintaining countless models and libraries; their contribution to the state of the art cannot be overstated.

More to the point of your original message, I searched for "innovative" "innovation" "inovate" and found 0 results in the blog post. I stand by my initial take, the blog post was fair, informative and pretty transparent in what they've done, how, and why.

7

dreamingleo12 t1_jdllxww wrote

Well if you ever worked with marketing or communication teams you would’ve known that DB co-authored the WSJ article. My point is that the democratization is an achievement of the Stanford Alpaca team, not DB. DB marketed it like they did the major work which is untrue.

−6

visarga t1_jdlonpq wrote

One way to speed this up is to make an extension for voluntary contributions of LLM interactions to open source. A user decides when a chat deserves to be donated to open source and pushes a button to share. I don't think OpenAI can object to users donating their data.

7

visarga t1_jdloqee wrote

Most of our pre-2020 NLP skills are worthless now, what required bespoke models and datasets is just another emergent LLM ability. It's like a new starting line and we don't know what human skills will be valuable in the future.

15

sdmat t1_jdm0pmi wrote

> It's like a new starting line and we don't know what human skills will be valuable in the future.

With each passing day, the creature stirs, growing hungrier and more restless. The ground trembles beneath our feet, but we dismiss the warning signs.

Text above naturally written by GPT4.

Maybe we should start flipping the assumption - why would you want a human if inexpensive and dependable AI competence is the default?

5

sneakpeekbot t1_jdm0yoj wrote

2

ephemeralentity t1_jdm6wkc wrote

Playing around with this. Running BaseModel.create("llama_lora") seems to return "Killed". I'm running it on WSL2 from Windows 11 so I'm not sure if that could be the issue. Running on my RTX 3070 with only 8GB VRAM so maybe that's the issue ...

EDIT - Side note, I first tried directly on Windows 11 but it seems deepspeed dependency is not fully supported: https://github.com/microsoft/DeepSpeed/issues/1769

2

light24bulbs t1_jdmad5n wrote

Hey, I've been looking at this more and it's very cool. One thing I REALLY like is that I see see self-training using dataset generation on your roadmap. This is essentially the technique that Facebook used to train ToolFormer, if I'm reading their paper correctly.

I'd really love to use your library to try to reimplement toolformers approach someday.

2

SDRealist t1_jdmdwkl wrote

Users could certainly donate their questions, but I believe the TOS for ChatGPT forbid using the generated output to train competing models (at least for commercial purposes).

8

Daveboi7 t1_jdme4qf wrote

Can we just download the model?

1

gamerx88 t1_jdmndip wrote

Food for thought. Is this really surprising considering that the InstructGPT paper in early 2022, already showed how even a 1.3B model after RLHF could beat a much larger 175B model?

I guess what this shows is that it's the data that matters rather than SFT vs RLHF. Wondering if any ablation studies have been done here.

2

RiyazRockz t1_jdnbroi wrote

Hey, I want to fine tune a model to solve a pharma related problem. I want to know if I can fine tune my model with this.. Could you please share your contact details so that I can learn about this more?

1

ebolathrowawayy t1_jdnc05i wrote

But what if you're training a model for a narrow use-case and don't intend for anyone to use it except for a niche set of users? Is that enough to be in the clear? Or is any use of OpenAI's model output to train a model for any purpose a no-no?

1

Daveboi7 t1_jdnedrd wrote

But which cloud service did you use to train them?

I tried using databricks to train a model but the setup was too complicated.

I’m wondering is there a more straightforward platform to train on?

1

baffo32 t1_jdnppmp wrote

this is the same task as instruction tuning. instruction tuning just uses specific datasets where instructions are followed. it‘s called “finetuning” but nowadays people are using adapters and peft to do this on low end systems.

1

biggieshiba t1_jdojnn6 wrote

I don't understand why anyone would care, in a few years half the internet will be ai generated. If someone uses GPT-4 to generate a sentence posted on Wikipedia how will you know before using it ? Don't you think many models will use that sentence?

Plus, how will they know, training data is not easy to extract from a model. Except if you are a direct OpenAI competitor they won't ever care or even look at you (well maybe their superAI will).

Lastly the dataset is full of errors, better generate again or even pay people would be quite cheap for 50k examples. This is quite a bad dataset when you really look at it, empty inputs or outputs, unclear instructions, instructions not fit for model... The fact that it is bad and small is very encouraging BTW since it performs pretty well.

2

ephemeralentity t1_jdp2pu8 wrote

Thanks looks like gpt2 worked! Sorry, stupid question but how do I save/re-use the results of my model finetune? When I re-finetune for 0:2 epochs it gives a reasonable response but if I try to skip model.finetune, it responds with new lines only (\n\n\n\n\n\n\n\n ...).

1

SatoshiNotMe t1_jdpgj80 wrote

I hope this is not closely tied to the Databricks ecosystem (i.e. their notebooks, spark clusters etc). Running things in DB notebooks is not a pleasant experience.

1

machineko t1_jdqzmyq wrote

model.save("path/to/your/weights") saves it to the directory
After that, you can load it with
model = BaseModel.create("gpt2", "path/to/your/weights")

Can you share the input text you have used? It is possible that GPT-2 is too small and needs custom generation parameters.

2

baffo32 t1_jdrhj77 wrote

I was still confused as to your response, and I’m thinking that if you wanted a model to behave like you had given different pretraining data, you would probably first finetune on the different bulk data, and then after this finetune on the target task such as instruction following.

Instruction following is indeed of course just predicting the next word: on data where the next word is obedient to instructions preceding it.

1

light24bulbs t1_jdrm9kh wrote

That's the part I wasn't getting. I assumed the fine tuning involved a different process. I see now that it is fact just more training data, often templated into a document in such a way that it's framed clearly for the LLM.

The confusing thing is that most of the LLM-as-a-service companies, Open-AI included, will ONLY take data in the question answer format, as if that's the only data you'd want to use to fine tune.

What if i want to feed a book in so we can talk about the book? A set of legal documents? Documentation of my project? Transcriptions of TV shows?

There are so many use cases for training on top of an already pre-trained LLM that aren't just question answering.

I'm into training llama now. I simply took some training code i found, removed the JSON parsing question answer templating stuff, and done.

1

SatoshiNotMe t1_jdtemml wrote

So if the notebook is tuning on a fixed dataset, anyone running it will arrive at the same weights after an expensive compute, which seems wasteful. Why not just share the weights, I.e the final trained + tuned model ? Or is that already available?

1

nemorocksharder t1_jdz8kt5 wrote

What you're describing is exactly what I have been looking to do too, and am really surprised I'm not hearing more about it. Have you found any useful approaches to essentially adding to the LLM's Corpus with target material/text? or anyone else trying to do this?

1

SatoshiNotMe t1_jealb7d wrote

Is there a "nice" way to use this model, (say, via the command-line like in the GPT4All or alpaca.cpp repos), rather than in a databricks notebook or in HG spaces? For example I'd like to chat with it on my M1 MacBook Pro. Any pointers appreciated!

1