AI products such as OpenAI's ChatGPT, Google Gemini, and Anthropic's Claude are good at answering questions about topics that are in the public domain and generally known, thanks to all of the web and open source data used to train the models. However, if you want these models to answer questions about data that you have, for your own reference or for your employees or customers, you will need custom AI.
There are a few different ways to build custom AI models depending on your use case and the type and amount of data you have. We will discuss the pros and cons of each approach, starting with the simplest.
Utilize the long context windows to upload data
If you are looking to analyze or understand one or more documents - say legal contracts or a lab report - most AI products offer the option to upload the documents. You can then ask questions about it and have the AI explain it to you. This approach utilizes a feature of Large Language Models (LLMs) called context window or context length. It represents the amount of data you can include in your prompt, and the uploaded data becomes a part of the AI’s working memory. It is measured in tokens, where a token is the basic unit that text or input data is broken down into for ingestion by the AI model. 1 token is roughly 3/4 of a word. The larger the context window, the more data you can upload. The Llama 4 Scout models have one of the largest context windows with 10M tokens (~7.5M words) as of this writing. GPT 4.1 and Gemini 1.5 Pro support up to 1M tokens.
A key benefit of this approach are that it easy to setup. This approach can sometimes offer better performance than RAG (discussed in the next section), particularly when full-document comprehension is needed. However, this depends on the specific task and quality of retrieval.
There are a few critical downsides:
- There are limitations on data that can be uploaded. While 1M tokens (~750k words) is fairly generous - for e.g. the book War and Peace by Leo Tolstoy has ~600k words - internal data sources can often exceed that limit
- It is expensive relative to alternatives, given that AI products often charge based on number of tokens used, and this approach uses up a large number of tokens
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) was introduced in 2020 by Facebook AI and gained traction in practical applications starting around 2021. The basic idea involves taking your data, chunking it (i.e. breaking it down) into small components that are converted into embeddings (machine interpretable vectors). This becomes a part of your retriever. When you prompt your AI, instead of directly querying the LLM, it will go to the retriever which retrieves the top relevant content from your sources, and pass all of this information formatted with instructions to the LLM that then generates the response. To summarize, you now introduce a ‘retriever’ stage between your query and a standard AI model, and this retriever holds your custom data.
Relative to uploading your data within the prompt context described in the previous section, RAGs are more cost efficient since for each inquiry, only the most relevant parts of your data are passed on to the LLM. Consequently it uses far fewer tokens and hence the lower cost. RAG is also not limited by the size constraints imposed by the context window. As we referenced earlier, RAG may have a degraded performance compared to uploading data within the context, likely due to selective parts of the data being sent to the LLM instead of all of it. While token usage is lower in RAG, infrastructure costs can be higher due to the need to maintain an embedding database and retrieval pipeline.
Finetuning
Finetuning refers to training the base model on your data. We have discussed finetuning in our previous post on the Llama family of AI models. Open source and open-weights models will often have a base LLM and an ‘instruct’ or chatbot version. The latter is generated by finetuning the base model on human-curated conversational data. Such a finetuning approach is similarly feasible for your custom data as well.
LoRA (Low-Rank Adaptation) is a popular method that fine-tunes only a small subset of model parameters, keeping most of the original model unchanged. This makes fine-tuning more efficient and less resource-intensive.
Finetuning is typically preferred to train the base model for specialized tasks using custom data such as building a custom chatbot, or a sentiment classifier, or a model for legal or medical queries. It requires additional training configuration and compute, though with much lower compute costs than training a base model. It is faster at inference time relative to RAG, since the query can directly be sent to the finetuned model instead of going through an intermediate retriever in RAG.
There are a few caveats:
- Finetuning is static in that as your data changes, or is augmented with new information, it won’t automatically update your model unless you finetune it again. In such instances, RAG is a better option since the retriever can be set to easily ingest new information.
- The responses of a finetuned model to queries on custom data are also less interpretable than when using RAG. RAG allows you to see what context was passed to the LLM for a query and you can review the LLM response based on this context.
- Finetuning for task-specific behavior (e.g., classifiers, assistants) often works well. But finetuning for factual retrieval (e.g. "when was the customer invoice for order #37564 generated?") can be brittle unless your data is highly curated and relatively stable.
—-
The 3 approaches described above - large context uploads, RAG and finetuning - follow distinct paths to incorporating custom data and building custom AI models. However, it doesn't have to be one or the other. One can build a finetuned model that is further enhanced with RAG to allow for data augmentation, and also allow user uploads that take advantage of large context windows. One can also set a system up for each of the above, and based on the query, the AI system can pick between them optimizing for performance and cost.
AI products such as ChatGPT allow users to build custom GPTs and when knowledge retrieval is enabled during custom GPT setup, it will use RAG. Meta's Llama API supports finetuning. As AI products get more sophisticated, we expect them to abstract out these distinctions, by providing a standard user interface to link one’s data sources and specify use cases. The product could then pick the optimal approach based on performance and cost, or allow the user to do so. For now, choosing the right approach or combination of approaches requires an understanding of the different pieces involved - data, use cases, cost and infrastructure, and the know-how to set up one or more systems and an experimentat framework to identify the right choice.