Run the Complete DeepSeek-R1-0528 Model Locally

2025-06-10

DeepSeek-R1-0528, the latest iteration of DeepSeek's R1 inference model, demands a disk space of 715GB, positioning it among the largest open-source models currently available. Thanks to Unsloth's cutting-edge quantization techniques, the model size can be reduced to just 162GB, achieving an 80% reduction. This allows users to run the model with significantly lower hardware requirements while maintaining most of its functionality, albeit with a slight performance trade-off.

In this tutorial, we will:

  1. Set up Ollama and Open Web UI for running the DeepSeek-R1-0528 model locally.
  2. Download and configure the model’s 1.78-bit quantized version (IQ1_S).
  3. Run the model using both GPU+CPU and CPU-only configurations.

Step 0: Prerequisites

To operate the IQ1_S quantized variant, your system must meet the following specifications:

GPU Requirements: A minimum of one 24GB GPU (e.g., NVIDIA RTX 4090 or A6000) and 128GB of RAM. With this configuration, you can expect a generation speed of approximately 5 tokens per second.

RAM Requirements: At least 64GB of RAM is necessary to run the model without a GPU, although the performance will be limited to 1 token per second.

Optimal Setup: For the best performance (over 5 tokens per second), you’ll need at least 180GB of unified memory or a combination of 180GB RAM + VRAM.

Storage: Ensure that you have at least 200GB of free disk space available for the model and its dependencies.

Step 1: Install Dependencies and Ollama

Update your system and install the required tools. Ollama acts as a lightweight server designed for running large language models locally. Use the following commands to install it on an Ubuntu distribution:

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Step 2: Download and Run the Model

Use the following commands to run the 1.78-bit quantized version (IQ1_S) of the DeepSeek-R1-0528 model:

ollama serve &
ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0

Step 3: Set Up and Run Open Web UI

Pull the Open Web UI Docker image with CUDA support. Launch the container with GPU acceleration and Ollama integration enabled.

This command will:

  • Start the Open Web UI server on port 8080
  • Enable GPU acceleration with the --gpus all flag
  • Mount the necessary data directories (-v open-webui:/app/backend/data)
docker pull ghcr.io/open-webui/open-webui:cuda
docker run -d -p 9783:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:cuda

After the container is up and running, access the Open Web UI interface in your browser at http://localhost:8080/.

Step 4: Running DeepSeek R1 0528 in Open WebUI

Select the hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0 model from the model menu.

If the Ollama server does not properly use the GPU, you can switch to CPU execution. Although this drastically reduces performance (around 1 token per second), it ensures that the model remains operational.

# Kill any existing Ollama processes
pkill ollama 

# Clear GPU memory
sudo fuser -v /dev/nvidia* 

# Restart Oll