5 ways to deploy your own large language model

“Right now, we’re converting everything to a vector database,” says Ellie Fields, chief product and engineering officer at Salesloft, a sales engagement platform vendor. “And yes, they’re working.”

And it’s more effective than using simple documents to provide context for LLM queries, she says.

The company primarily uses ChromaDB, an open-source vector store, whose primary use is for LLMs. Another vector database Salesloft uses is Pgvector, a vector similarity search extension for the PostgreSQL database.

“But we’ve also done some research using FAISS and Pinecone,” she says. FAISS, or Facebook AI Similarity Search, is an open-source library provided by Meta that supports similarity searches in multimedia documents.

And Pinecone is a proprietary cloud-based vector database that’s also become popular with developers, and its free tier supports up to 100,000 vectors. Once the relevant information is retrieved from the vector database and embedded into a prompt, the query gets sent to OpenAI running in a private instance on Microsoft Azure.

“We had Azure certified as a new sub-processor on our platform,” says Fields. “We always let customers know when we have a new processor for their information.”

But Salesloft also works with Google and IBM, and is working on a gen AI functionality that uses those platforms as well.

“We’ll definitely work with different providers and different models,” she says. “Things are changing week by week. If you’re not looking at different models, you’re missing the boat.” So RAG allows enterprises to separate their proprietary data from the model itself, making it much easier to swap models in and out as better models are released. In addition, the vector database can be updated, even in real time, without any need to do more fine-tuning or retraining of the model.

“We’ve switched out models, from OpenAI to OpenAI on Azure,” says Fields. “And we’ve switched among different OpenAI models. We may even support different models for different parts of our customer base.”

Sometimes different models have different APIs, she adds. “It’s not trivial,” she says. But switching out a model is still easier than retraining. “We haven’t yet found a use case that’s better served by fine tuning rather than a vector database,” Fields adds. “I believe there are use cases out there, but so far, we haven’t found one that performs better.”

One of the first applications of LLMs that Salesloft rolled out was adding a feature that lets customers generate a sales email to a prospect. “Customers were taking a lot of time to write those emails,” says Fields. “It was hard to start, and there’s a lot of writer’s block.” So now customers can specify the target persona, their value proposition, and the call to action — and they get three different draft emails back they can personalize. Salesloft uses OpenAI’s GPT 3.5 to write the email, says Fields.

Locally run open source models

Boston-based Ikigai Labs offers a platform that allows companies to build custom large graphical models, or AI models designed to work with structured data. But to make the interface easier to use, Ikigai powers its front end with LLMs. For example, the company uses the seven billion parameter version of the Falcon open source LLM, and runs it in its own environment for some of its clients.

To feed information into the LLM, Ikigai uses a vector database, also run locally. It’s built on top of the Boundary Forest algorithm, says co-founder and co-CEO Devavrat Shah.

“At MIT four years ago, some of my students and I experimented with a ton of vector databases,” says Shah, who is also a professor of AI at MIT. “I knew it would be useful, but not this useful.”

Keeping both the model and the vector database local means no data can leak out to third parties, he says. “For clients who are okay with sending queries to others, we use OpenAI,” says Shah. “We are LLM agnostic.”

PricewaterhouseCoopers, which built its own ChatPWC tool, is also LLM agnostic. “ChatPWC makes our associates more capable,” says Bret Greenstein, the firm’s partner and leader of the gen AI go-to-market strategy. For example, it includes pre-built prompts to generate job descriptions. “It has all my formats, templates, and terminology,” he says. “We have an HR, data and prompt experts, and we design something that generates very good job postings. Now nobody needs to know how to do the amazing prompting that generates job descriptions.”

The tool is built on top of Microsoft Azure, but the company also built it for Google Cloud Platform and AWS. “We have to serve our clients, and they exist on every cloud,” Says Greenstein. Similarly, it’s optimized to use different models on the back end, because that’s how clients want it. “We have every model working,” he adds. “Llama 2, Falcon — we have everything.”

The market is changing quickly, of course, and Greenstein suggests enterprises adopt a “no regrets” policy to their AI deployments.

“There’s a lot people can do,” he says, “like building up their data that’s independent of models, and building up the governance.” Then, when the market changes, and a new model comes out, the data and governance structure will still be relevant.

The fine tuning

Management consulting company AArete took open source model GPT 2 and fine tuned it on its own data. “It was lightweight,” says Priya Iragavarapu, the company’s VP of digital technology services. “We wanted an open source one to be able to take it and post it internally in our environment.”

If AArete used a hosted model and connected to it via API, trust issues come up. “We’re concerned where the data from the prompting might end up,” she says. “We don’t want to take those risks.”

When choosing an open source model, she looks at how many times it was previously downloaded, its community support, and its hardware requirements.

“The foundational model should also have some task relevancy,” she says. “There are some models for specific tasks. For example, I recently looked at a Hugging Face model that parses content from PDFs into a structured format.”

Many companies in the financial world and in the health care industry are fine-tuning LLMs based on their own additional data sets.

“The basic LLMs are trained on the whole internet,” she says. With fine tuning, a company can create a model specifically targeted at their business use case.

A common way of doing this is by creating a list of questions and answers and fine tuning a model on those. In fact, OpenAI began allowing fine tuning of its GPT 3.5 model in August, using a Q&A approach, and unrolled a suite of new fine tuning, customization, and RAG options for GPT 4 at its November DevDay.

This is particularly useful for customer service and help desk applications, where a company might already have a data bank of FAQs.

Also in the Dell survey, 21% of companies prefer to retrain existing models, using their own data in their own environment.

“The most popular option seems to be Llama 2,” says Andy Thurai, VP and principal analyst at Constellation Research Inc. Llama 2 comes in three different sizes, and is free for companies with fewer than 700 million monthly users. Companies can fine-tune it on their own data sets and have a new, custom model fairly quickly, he says. In fact, the Hugging Face LLM leaderboard is currently dominated by different fine-tunings and customizations of Llama 2. Before Llama 2, Falcon was the most popular open source LLM, he adds. “It’s an arms race right now.” Fine tuning can create a model that’s more accurate for specific business use cases, he says. “If you’re using a generalized Llama model, the accuracy can be low.”

And there are some advantages to fine-tuning over RAG embedding. With embedding, a company has to do a vector database search for every query. “And you’ve got the implementation of the database,” Thurai says. “That’s not going to be easy, either.”

There are no context window limits on fine tuning, either. With embedding, there’s only so much information that can be added to a prompt. If a company does fine tune, they wouldn’t do it often, just when a significantly improved version of the base AI model is released.

Finally, if a company has a quickly-changing data set, fine tuning can be used in combination with embedding. “You can fine tune it first, then do RAG for the incremental updates,” he says.

Rowan Curran, analyst at Forrester Research, expects to see a lot of fine-tuned, domain-specific models arising over the next year or so, and companies can also distil models to make them more efficient at particular tasks. But only a small minority of companies — 10% or less — will do this, he says.

Software companies building applications such as SaaS apps, might use fine tuning, says PricewaterhouseCoopers’ Greenstein. “If you have a highly repeatable pattern, fine tuning can drive down your costs,” he says, but for enterprise deployments, RAG is more efficient in 90 to 95% of cases.

“We’re actually looking into fine-tuning models for specific verticals,” adds Sebastien Paquet, VP of ML at Coveo, a Canadian enterprise search and recommendations company. “We have some specialized verticals with specialized vocabulary, like the medical vertical. Enterprises selling truck parts have their own way of how the parts are named.”

For now, however, the company is using OpenAI’s GPT 3.5 and GPT 4 running on a private Azure cloud, with the LLM API calls isolated so Coveo can switch to different models if needed. It also uses some open source LLMs from Hugging Face for specific use cases.

Build an LLM from scratch

Few companies are going to build their own LLM from scratch. After all, they are, by definition, quite large. OpenAI’s GPT 3 has 175 billion parameters and was trained on a data set of 45 terabytes and cost $4.6 million to train. And according to OpenAI CEO Sam Altman, GPT 4 cost over $100 million.

That size is what gives LLMs their magic and ability to process human language, with a certain degree of common sense, as well as the ability to follow instructions.

“You can’t just train it on your own data,” says Carm Taglienti, distinguished engineer at Insight. “There’s value that comes from training on tens of millions of parameters.”

Today, nearly all LLMs come from the big hyperscalers or AI-focused startups like OpenAI and Anthropic.

Even companies with extensive experience building their own models are staying away from creating their own LLMs.

Salesloft, for example, has been building their own AI and machine learning models for years, including gen AI models using earlier technologies, but is hesitant about building a brand-new, cutting edge foundation model from scratch.

“It’s a massive computational step that, at least at this stage, I don’t see us embarking on,” says Fields.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts