Hybrid AI Coding Agent Architecture

Designing a Hybrid Local/Cloud Coding Agent for Scalable Development Workflows

This article reviews a proposed architecture for a hybrid AI coding workflow designed to support a small team of developers. The system aims to leverage the advanced reasoning capabilities of large cloud-based models while maintaining local execution and significantly reducing operational costs by minimizing cloud interaction. The core challenge revolves around the feasibility of running a high-context, multi-user execution environment on consumer-grade hardware.

Architectural Overview: The Hybrid Coding Pipeline

The proposed system outlines a sophisticated, multi-stage pipeline intended to decouple the high-level planning and architectural design tasks from the actual implementation and code execution. This separation is critical to optimizing both performance and cost efficiency.

Workflow Breakdown

The workflow is structured around a central routing mechanism. A developer initiates a request, which passes through a custom local router. This router conditionally engages a cloud planner (e.g., utilizing Codex or Claude CLI) to generate a structured execution plan. This plan is then handed off to a dedicated local executor model (such as Qwen 27B operating in FP8 precision via vLLM) for implementing the code changes directly within the repository.

Crucially, the design mandates that the entire repository is never sent to the cloud. Instead, the cloud planner receives only highly compressed data: the repository tree, specific selected files/chunks, and the initial task description. This localized approach ensures that the local model retains full access to the codebase and necessary tooling.

Technical Constraints and Feasibility Assessment

The Hardware Bottleneck: 2x RTX 3090

The central technical query focuses on the capacity of a dual-GPU setup (2x RTX 3090, 24GB VRAM each) to support the specified load. The requirements include handling concurrent tasks for approximately five developers, maintaining a 64k context window, utilizing vLLM, and running a medium-sized executor model (Qwen 27B in FP8 or 4-bit quantization) while aggressively integrating Retrieval-Augmented Generation (RAG).

The feasibility hinges on whether techniques like FP8 KV caching and the planner/executor split are sufficient to manage the memory footprint associated with five concurrent, long-context, retrieval-heavy sessions.

Key Architectural Design Goals

Cost Optimization: Avoiding massive, always-on cloud infrastructure costs.
Reasoning Quality: Leveraging the superior reasoning capabilities of large cloud models for planning.
Data Locality: Keeping the full codebase and execution logic strictly on local hardware.
VRAM Management: Mitigating the massive VRAM demands typically associated with full, long-context repository dumping.

Open Questions and Recommendations

The inquiry seeks expert input on several critical technical decisions to validate or

→ View original source

Techyon - AI News Aggregator

Building a Hybrid Local/Cloud Coding Agent for 5 Devs — Are 2x RTX 3090 Enough for 64k Context?

Designing a Hybrid Local/Cloud Coding Agent for Scalable Development Workflows

Architectural Overview: The Hybrid Coding Pipeline

Workflow Breakdown

Technical Constraints and Feasibility Assessment

The Hardware Bottleneck: 2x RTX 3090

Key Architectural Design Goals

Open Questions and Recommendations

Building a Hybrid Local/Cloud Coding Agent for 5 Devs — Are 2x RTX 3090 Enough for 64k Context?

Designing a Hybrid Local/Cloud Coding Agent for Scalable Development Workflows

Architectural Overview: The Hybrid Coding Pipeline

Workflow Breakdown

Technical Constraints and Feasibility Assessment

The Hardware Bottleneck: 2x RTX 3090

Key Architectural Design Goals

Open Questions and Recommendations

Related Articles

I built a powerful RAG and knowledge graph agent that actually runs locally

databricks-solutions /ai-dev-kit

Models.dev: open-source database of AI model specs, pricing, and capabilities

Microsoft starts canceling Claude Code licenses

AI has a multiplying effect on existing technical skills