Submitted by jaxolingo t3_125qztx in MachineLearning
Im_Unlucky t1_je6hn68 wrote
Reply to comment by LetGoAndBeReal in [D] The best way to train an LLM on company data by jaxolingo
This is the way ^
- shove data into vector db
- embed incoming query
- return topk matches from vector db
- inject those into prompt
- make api call to openai
- display response
- ???
- you are now a ML engineer
Hands0L0 t1_je6o5t5 wrote
Off topic but I love how underpants gnomes memes are still relevant 25 years later
[deleted] t1_je8ajeo wrote
[removed]
visarga t1_je6k74d wrote
Often it seems that the model can't properly synthesise information from a bunch of snippets, it lacks the context of those snippets, so it will combine the information incorrectly or hallucinate an explanation.
Retrieval + loading data in the context is far from solved.
Im_Unlucky t1_je6vjqd wrote
This is true. Good prompt engineering has definitely gone a long way for me in improving rag performance though. I don’t think shoving snippets into the prompt is the long term solution, but at this point in time it’s the best option in my experience.
kromem t1_je6uv46 wrote
"Moar layers" doesn't only need to apply to the NN.
CoT prompting works by breaking analysis down into smaller steps that each generate their own additional context.
Doing something similar with DB analysis is absolutely possible, such as preemptively summarizing schema and using that summary as part of the retrieval to contextualize the specific fragments.
Additionally, having static analysis examples on hand for related tables that's fed in to go from zero shot to few shot would go a long way at reducing some of the issues you highlight.
Tostino t1_je847jg wrote
Literally just worked through this today manually as a proof of concept, using the LLM to augment the DB schema with comments describing any relevant info or corner cases. I'm essentially just manually feeding it as context to my prompts when I need to know something related to that set of tables, but it seems pretty powerful. Automating this is going to be nuts.
kromem t1_je84zam wrote
> Automating this is going to be nuts.
Yes, yes it is.
Cantareus t1_je6vn79 wrote
>shove data into vector db
>
>embed incoming query
>
>return topk matches from vector db
Can these steps be done through the openai API?
You only reply using JSON. Write 5 queries that will return useful data to aid you in answering the users questions "What was Apple’s return compared to it’s sector last month" return in JSON array with no explanations. Each query must run independently. Use the return format [{"Reason":"Why the query is useful","query":"The sql query"}] The schema is:
(Imagine a valid schema is here)
GPT-4 Response:
[
{
"Reason": "Retrieve Apple's stock return last month",
"query": "SELECT stock_return FROM stocks WHERE company = 'Apple' AND date >= DATEADD(month, -1, GETDATE()) AND date < GETDATE();"
},
{
"Reason": "Retrieve sector for Apple",
"query": "SELECT sector FROM companies WHERE company_name = 'Apple';"
},
{
"Reason": "Retrieve companies in Apple's sector",
"query": "SELECT company_name FROM companies WHERE sector = (SELECT sector FROM companies WHERE company_name = 'Apple');"
},
{
"Reason": "Calculate average return for Apple's sector last month",
"query": "SELECT AVG(stock_return) as sector_return FROM stocks WHERE company IN (SELECT company_name FROM companies WHERE sector = (SELECT sector FROM companies WHERE company_name = 'Apple')) AND date >= DATEADD(month, -1, GETDATE()) AND date < GETDATE();"
},
{
"Reason": "Compare Apple's return to the average return of its sector last month",
"query": "SELECT (SELECT stock_return FROM stocks WHERE company = 'Apple' AND date >= DATEADD(month, -1, GETDATE()) AND date < GETDATE()) - (SELECT AVG(stock_return) as sector_return FROM stocks WHERE company IN (SELECT company_name FROM companies WHERE sector = (SELECT sector FROM companies WHERE company_name = 'Apple')) AND date >= DATEADD(month, -1, GETDATE()) AND date < GETDATE()) as comparison;"
}
]
jaxolingo OP t1_je7c63z wrote
Cool, this is exactly the approach I had in mind that the model would take
gstatton t1_je7mwmv wrote
I’ve done something similar to this, but I created a view in snowflake, dumped the ‘describe table <table-name>’ to CSV, used that CSV descriptor in the head of the prompt, and asked the LLM to generate a SQL query from the user’s question using the provided table schema…I’ll take the model’s return and pass it directly to snowflake, return it as a data frame, and then send that back to the LLM to get a text summary of the query results
You can also use LangChain to set up the chain to do this automatically..
qncapper t1_jebjy9s wrote
Cool, how can I be confident about my model not spewing sh*t or not making up things on the fly, cause what it gives out has impact on my stakeholders.
[deleted] t1_jeblsuz wrote
[deleted]
Viewing a single comment thread. View all comments