Running ChatGPT with Llama2 Locally: Comprehensive Step-by-Step Tutorial

5 min readAug 18, 2023

Just a few days have passed since the launch of Llama 2. Yet, within this short span, a range of methods for local execution has emerged. This blog entry delves into open-source solutions available for bringing Llama 2 to life on your personal devices without any dependency to network.

Here’s an example of using a locally-running Llama 2, step by step: You need to clone the two repositories listed below, one by one.

git clone git@github.com:facebookresearch/llama.git

git clone git@github.com:ggerganov/llama.cpp.git

Where 1st repo is official repo from Meta

2nd repo is the is Inference of LLaMA model in pure C/C++

Then need to register to details in the below link to download the llama2

Llama access request form - Meta AI

Request access to the next version of Llama.

ai.meta.com

Once the details are submitted, we will receive an email at the registered email address containing the download URL. This URL will be valid for 24 hours; after that, it will expire. However, you can re-register using the same link to receive a new download URL.
The email from META appears as shown in the image below, with a download URL.

Email from META with llama2 download URL

Next, we need to navigate to the first cloned repository, ‘facebookresearch/llama.git,’ and then execute the ‘download.sh’ file using the following command.

sh download.sh

Upon executing the above command in the terminal, it will prompt us with a download URL. After entering the download URL, it will prompt us to select the required model for download, as shown in the image below.

Enter the URL from email:
Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:

The models will download inside the Llama repository, with each model in a separate folder.

Here we are going to use llama-2–7b model
Details about Models

The model architecture of Llama 2 features an auto-regressive language model utilising an optimised transformer architecture. The tuned versions employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

The Llama 2 family of models comprises various token counts that pertain exclusively to pre-training data. All models undergo training with a global batch size of 4M tokens. Larger models, such as the 70B variant, employ Grouped-Query Attention (GQA) to enhance inference scalability.
Model Dates Llama 2 was trained between January 2023 and July 2023.

Next, we need to navigate to the second cloned repository, ‘ggerganov/llama.cpp,’ and then run the following command to execute the Makefile.

make

Then, we will observe the execution command in the terminal, as shown below.

Next, we need to convert the original model to a different structure. Before proceeding with these steps, ensure that Python3 is already installed on the machine to facilitate the necessary changes.

After that, we need to install the dependency for the Llama2 C++ library by executing the following command inside the cpp repository.

python3 -m pip install -r requirements.txt

The above command will install the required packages as mentioned in the ‘requirements.txt’ file, with the exact versions specified.

Next, we need to convert the downloaded 7B model to the GGML (GluonNLP Graph and Model Library) FP16 format, where “GGML FP16” stands for “Floating Point 16.” This technology is used in deep learning to execute calculations using a 16-bit floating-point representation, as opposed to the more common 32-bit representation (FP32). This is also referred to as half-precision arithmetic.

python3 convert.py --outfile models/7B/ggml-model-f16.bin --outtype f16 ../llama/llama-2-7b

Then, we will observe the execution command in the terminal, as shown below.

Next, we need to quantize the model to 4-bits, which involves reducing the precision of numerical values in a machine learning model to 4 bits. This process aims to enhance its efficiency and suitability for deployment in resource-constrained environments. Quantization assists in representing weights and activations using only 4 bits of data for each parameter, deviating from the conventional 32-bit or 16-bit representation.

./quantize  ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

Then, we will observe the execution command in the terminal, as shown below.

Actual Model size is 12853.02 MB
Quant Model size is 3647.87 MB

Next, we need to execute the quantized model with the given prompt.

./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

Then, we will observe the execution command in the terminal, as shown below.

The user will enter a prompt in the terminal, where we can pose our questions to our own Llama2 model. Llama2 will respond to our queries, and the question itself will be shown in green color.

Note

To save both time and space, you have the option to download the pre-converted and quantized models from TheBloke on Hugging Face.

Moving Forward

Incremental refinement of the process yields quicker results. If you come across a solution that can benefit others, kindly share it in your response.

Another challenge is to have this information updated. If you find outdated information, please list the old description and state what should be the new one. That will help me a lot to know the changes.

If you know a better way to do things, let me know also. Keep things simple. I want to apply the 80/20 rule: cover the important but not every angle.

Cheers Happy Coding…!!!

Running ChatGPT with Llama2 Locally: Comprehensive Step-by-Step Tutorial

Llama access request form - Meta AI

Request access to the next version of Llama.

Note

Moving Forward

Written by Prasanna Brabourame