Low GPU Memory

If your GPU has small memory and cannot load the model, you can try the following methods.

Load the model in 8-bit mode

Add --load-in-8bit to the startup parameters.

python server.py --load-in-8bit

This will reduce the memory usage by half, and the quality will not be significantly reduced. However, only newer GPUs support 8-bit mode.

Split the model across multiple GPUs and CPUs

python server.py --auto-devices

If you can load the model with this command, but it runs out of memory when you try to generate text, try limiting the amount of memory allocated to the GPU until the error no longer occurs:

python server.py --auto-devices --gpu-memory 10
python server.py --auto-devices --gpu-memory 9
python server.py --auto-devices --gpu-memory 8
...

Where the number is in GiB units.

For better control, you can also specify the unit in MiB:

python server.py --auto-devices --gpu-memory 8722MiB
python server.py --auto-devices --gpu-memory 4725MiB
python server.py --auto-devices --gpu-memory 3500MiB
...

In addition, you can set the --no-cache value to reduce GPU usage while generating text at the cost of increased performance overhead. This may allow you to set a higher value for --gpu-memory and get a net performance gain.

Cache some layers of the model on disk

As a last resort, you can split the model across GPUs, CPUs, and disk:

python server.py --auto-devices --disk

Low GPU Memory

Load the model in 8-bit mode​

Split the model across multiple GPUs and CPUs​

Cache some layers of the model on disk​

Load the model in 8-bit mode

Split the model across multiple GPUs and CPUs

Cache some layers of the model on disk