Bridging the Gap: Achieving 10x Token Savings by Integrating Cloud and Local LLM Code Flows
A novel proof-of-concept is introduced that leverages local LLMs (specifically Qwen models) with structured workflow management to drastically reduce token consumption during complex code development tasks, effectively minimizing reliance on expensive cloud APIs like Claude.
The challenge of deploying Large Language Models (LLMs) locally, particularly for complex generative tasks like code writing, often revolves around efficient context management. Traditional usage patterns, especially when employing LLMs as agents, can quickly lead to excessive context absorption, rendering the model inefficient. To address this, a new workflow has been developed that strategically manages context and task decomposition, resulting in significant operational cost savings.
Workflow Architecture: Structured Decomposition and Local Execution
The core innovation lies in a hybrid architecture that partitions the development process. Instead of relying solely on a powerful cloud model for the entire task, the process is broken down into bite-sized, manageable tasks using a structured format like TOML (Tom's Obvious, Minimal Language).
Cloud-to-Local Task Orchestration
The initial phase involves utilizing a cloud model to define the overall feature set and decompose it into distinct, granular tasks, which are then serialized into the TOML file. This initial cloud interaction is minimal, serving only to define the scope. Subsequently, a custom Python script takes over, reading the TOML file and directing the execution to a local LLM setup.
Leveraging Qwen Code CLI for Heavy Lifting
The heavy lifting—the actual code writing, file navigation, and implementation—is handled by the qwen code CLI, identified as the optimal harness for the Qwen3.6 MoE model. This setup allows the local hardware to perform the computationally intensive generation tasks, which is the primary driver of the reported 10x token savings compared to using a purely cloud-based solution.
Built-in Quality Assurance and Iteration
A critical component of the system is the integrated unit testing phase. After the local LLM generates code, the Python orchestrator executes unit tests defined within the TOML configuration. This introduces a self-correction loop:
- Success: If the code passes the tests, the system proceeds to the next task defined in the TOML structure.
- Failure: If a test fails, the associated stack trace and relevant context are fed back into the
qwen code CLI. This allows the local model to self-correct and iteratively refine the code until the unit test passes.
This iterative refinement cycle is crucial for ensuring code quality while maintaining efficiency.
Technical Implementation Details and Performance
The proof-of-concept (PoC) was validated using the Q4 KM quantized version of the Qwen3.6 35B model. The implementation utilized a VRAM setup of 28GB distributed across two GPUs, demonstrating the capability of running high-context local models. The system was tested with a maximum context window of 48k tokens. The quality of the output generated by the local Qwen3.6 model was validated against benchmarks provided by Claude in a real codebase scenario.
This project is open-sourced, providing a framework for developers interested in combining the strengths of cloud-based planning with the efficiency and privacy of local AI execution.
← Back to homepage