Dual-GPU Ollama on Nobara: Warm 7b, Cold 14b

5060 Ti plus 4060 — pin the fast model, unload before the heavy one, and keep cloud tokens out of the loop.

  • ollama
  • nobara
  • gpu
  • local-llm

Dual-GPU Ollama on Nobara: Warm 7b, Cold 14b

Running a dual-GPU setup for AI workloads is like having two chefs in the kitchen—one who’s quick and nimble (the RTX 4060) and another who can handle more complex recipes but needs time to warm up (the RTX 5060 Ti). This configuration allows me to balance efficiency with capability, ensuring that I get the best out of my local AI stack without breaking the bank. For solo developers or small studio operators looking to leverage powerful AI tools locally, this setup offers a practical middle ground.

Warm 7b on RTX 4060: Quick and Nimble

The RTX 4060 is perfect for tasks that require speed and efficiency. I use the Ollama alias qwen2.5:7b-vram tuned to run smoothly with 32K context on this GPU. This setup handles most of my day-to-day AI needs, from quick text generation to summarizing content. The key here is to keep things lightweight and fast—no need for heavy lifting when a nimble approach suffices.

Cold 14b on RTX 5060 Ti: Powerhouse in Reserve

When I need more power, the RTX 5060 Ti comes into play with qwen2.5:14b or R1 models loaded directly onto it. This GPU is reserved for tasks that demand deeper context and larger model sizes—think complex document generation or intricate data analysis. However, switching between these two GPUs requires a bit of orchestration to ensure the 7b model on the RTX 4060 is unloaded before loading the 14b model onto the RTX 5060 Ti.

Unloading Between Phases: Efficient Workflow

To avoid VRAM constraints and ensure smooth transitions, I have a simple but effective workflow. When switching from the 7b model to the 14b model, I first unload the 7b model on the RTX 4060. This ensures that there’s enough space for the larger model to load onto the RTX 5060 Ti without any issues. Conversely, when moving back to lighter tasks, unloading the 14b model allows me to quickly switch back to using the 7b model on the RTX 4060.

Context-Stack Lanes: A/B/C for Balanced Efficiency

I’ve set up three lanes in my context-stack workflow to manage different types of AI tasks efficiently:

  • Lane A: Handles basic grep/ctx operations without invoking large language models.
  • Lane B: Uses the Ollama 7b model (warm) on the RTX 4060 for quick, efficient processing.
  • Lane C: Offloads heavier tasks to a queue or an MSI system when needed.

This setup ensures that I can handle a wide range of AI tasks without overloading my GPUs. It’s particularly useful during full documentation passes across multiple workspace domains, where efficiency and resource management are crucial.

Next Steps: Optimizing VRAM Usage

While this dual-GPU setup works well for most tasks, there’s always room for improvement. One area I’m looking to optimize is the VRAM usage when running more complex models like LLaDA diffusion (Lane D). Currently, this model is constrained by VRAM limitations when the 14b model is loaded on the RTX 5060 Ti. Exploring ways to better manage VRAM and possibly offloading some tasks to cloud resources could help unlock even greater potential from my local AI stack.

By balancing speed with power through a dual-GPU setup, I’m able to maintain an efficient workflow that supports both quick tasks and complex projects. This approach is particularly valuable for solo developers and small studio operators who need flexibility without sacrificing performance.