The Llama series of models built and released by Meta are notable because they are among the most popular open weights models out there. This means that we can take advantage of the millions Meta has spent building the base model to now fine-tune it for our purposes at a reasonable cost.
Meta released the first in the series, Llama 1, on February, 2023. Since then they have released Llama 2 and Code Llama through the second half of 2023, Llama 3 and its versions in 2024, and Llama 4 in April this year.
Llama 4 released in versions called Llama Scout and Llama Maverick, are a mixture-of-experts models. They are multimodal (can process images, pdfs and docs in addition to text prompts), multi-lingual (support 12 languages), and have long context windows (10M for Scout, and 1M for Maverick). Llama 4 Scout with 17B active parameters and 16 experts is the lightweight version designed to be configured on a single Nvidia H100 GPU. Llama 4 Maverick has 17B active parameters and 128 experts which enables it to do advanced reasoning and multi-image understanding. Both Scout and Maverick are distilled from Llama 4 Behemoth with 288B parameters and 16 experts, where distillation refers to the process of transferring knowledge from a large model to a smaller one. The use cases that benefit most from the large context windows of Llama 4 models include legal document analysis, scientific and academic research summarization, financial analysis, and multilingual document analysis, according to Meta.
There are a few key advantages of Llama (and more generally open weights/open source models):
- On-prem set-up: If you are concerned about sharing or uploading proprietary information to 3rd party services such as ChatGPT or Claude, Llama is one solution. You can download and install the model on your servers, and once setup, it will perform all processing locally without communicating with any third party services. One remarkable example of this is the installation of Llama at the International Space Station (ISS) where astronauts in low orbit can query Llama without internet access or connecting with computers on Earth
- Build and deploy custom models: If you want to build a model customized to your use-cases such as building a custom chatbot for your customers, or building FAQs on your documentation, Llama enables you to do so with a process called finetuning. Meta's API platform that was released this year allows you to use the API to finetune the model. This can be helpful if you lack the compute resources or the coding know-how to do so. Note that this would run fine-tuning on Meta's servers, so if you are concerned about sharing proprietary information, you may want to consider finetuning your models locally.
To see if Llama 4 is right for you, start with meta.ai where you can query the model. It is free, and there are no pricing tiers. When we asked it about limits, it mentioned 150 queries every 24 hours. You can attach files and generate images.
If you would now like to use Llama models, you can do so in a few different ways. Note that there are a few caveats to each approach.
A. Use locally for querying
If you plan to use the model to query and get responses, but would like to do so without sending any information to a remote server, you can use Ollama, an open source platform designed to run LLMs locally. These include Llama 3.3 and also others such as Qwen3 and DeepSeek-R1. As of this writing, it does not have the latest Llama 4 models. One also needs to be mindful of the storage and inference requirements for the models. For e.g. Llama 3.2 with 1B parameters has a size of 2 GB while Llama 3.3 with 70B parameters has a much larger file size of 43 GB.
Setup is straightforward. We recommend starting with Llama 3.2 to test since it is smaller. You start with downloading Ollama for your machine here. You can then download the Llama3.2 model and start chatting with it by running the following command on your terminal
ollama run llama3.2
At the prompt you can write your query, and enter to see the response. To exit, enter /bye
You can find commands for other open source/open weights models supported by ollama here. If you prefer to have a UI instead of the terminal to interact with the model, you can find options in the community integrations section.
B. Finetune to create custom models
Finetuning refers to training the base model on your data. There are a couple of options to finetune your model.
-
Finetuning locally: This is a preferable option if you want to avoid uploading your data to 3rd party services. It is relatively more complex to set-up, though there are a fair number of examples and resources to do so. Unsloth has notebooks that you can setup on your server and run to finetune models. Finetuning requires you to set-up your dataset in a format that can be consumed by the model. You can find more details for the data format as specified by unsloth here
-
Finetuning using APIs: In April this year, Meta released an API with a dashboard that you can use to build custom models using Llama 4. It allows you to finetune your models without relying on code. Note that this approach involves transferring your data to their servers. If you decide to go with API based finetuning, you have alternatives to using Meta, such as Open AI, Anthropic, Google Gemini, Mistral, among others. Depending on your comfort with using an UI vs using code, you may prefer one over the others.
In summary, the Llama models are a useful set of open weights models for use cases where you would like to run it in a closed on-premise environment without data sharing with 3rd party services. Querying is the simplest use case to setup, while building custom (finetuned) models need a bit more work.