DeepSeek-R1-0528, the latest iteration of DeepSeek's R1 inference model, demands a disk space of 715GB, positioning it among the largest open-source models currently available. Thanks to Unsloth's cutting-edge quantization techniques, the model size can be reduced to just 162GB, achieving an 80% reduction. This allows users to run the model with significantly lower hardware requirements while maintaining most of its functionality, albeit with a slight performance trade-off.
In this tutorial, we will:
- Set up Ollama and Open Web UI for running the DeepSeek-R1-0528 model locally.
- Download and configure the model’s 1.78-bit quantized version (IQ1_S).
- Run the model using both GPU+CPU and CPU-only configurations.
Step 0: Prerequisites
To operate the IQ1_S quantized variant, your system must meet the following specifications:
GPU Requirements: A minimum of one 24GB GPU (e.g., NVIDIA RTX 4090 or A6000) and 128GB of RAM. With this configuration, you can expect a generation speed of approximately 5 tokens per second.
RAM Requirements: At least 64GB of RAM is necessary to run the model without a GPU, although the performance will be limited to 1 token per second.
Optimal Setup: For the best performance (over 5 tokens per second), you’ll need at least 180GB of unified memory or a combination of 180GB RAM + VRAM.
Storage: Ensure that you have at least 200GB of free disk space available for the model and its dependencies.
Step 1: Install Dependencies and Ollama
Update your system and install the required tools. Ollama acts as a lightweight server designed for running large language models locally. Use the following commands to install it on an Ubuntu distribution:
apt-get update apt-get install pciutils -y curl -fsSL https://ollama.com/install.sh | sh
Step 2: Download and Run the Model
Use the following commands to run the 1.78-bit quantized version (IQ1_S) of the DeepSeek-R1-0528 model:
ollama serve & ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
Step 3: Set Up and Run Open Web UI
Pull the Open Web UI Docker image with CUDA support. Launch the container with GPU acceleration and Ollama integration enabled.
This command will:
- Start the Open Web UI server on port 8080
- Enable GPU acceleration with the
--gpus all
flag - Mount the necessary data directories (
-v open-webui:/app/backend/data
)
docker pull ghcr.io/open-webui/open-webui:cuda docker run -d -p 9783:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:cuda
After the container is up and running, access the Open Web UI interface in your browser at http://localhost:8080/
.
Step 4: Running DeepSeek R1 0528 in Open WebUI
Select the hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
model from the model menu.
If the Ollama server does not properly use the GPU, you can switch to CPU execution. Although this drastically reduces performance (around 1 token per second), it ensures that the model remains operational.
# Kill any existing Ollama processes pkill ollama # Clear GPU memory sudo fuser -v /dev/nvidia* # Restart Oll