Open Dungeon: Achieving Full 256K Context Local Roleplay via Gemma 4 QAT and FLUX Integration

A new local implementation called "Open Dungeon" enables private, high-fidelity roleplay by combining a quantized Gemma 4 model for narration and FLUX for inline image generation, maintaining a minimal memory footprint even at maximum context windows.

Architectural Overview

Open Dungeon is designed as a fully local, privacy-centric alternative to cloud-based AI roleplay platforms. The system architecture eliminates the need for external API keys or cloud dependencies, ensuring that all data processing remains on the user's hardware. The project integrates two primary AI components: a Large Language Model (LLM) for narrative generation and a diffusion model for visual representation.

Narrative Engine: Gemma 4 QAT

The core narrator is powered by Gemma 4, utilizing a QAT (Quantization-Aware Training) Q4 quantization. This specific optimization allows the 12B parameter model to operate efficiently without significant degradation in coherence or creative quality. The model is deployed via the Ollama framework, facilitating streamlined local inference.

Visual Integration: Uncen-FLUX

To enhance the immersive experience, the system incorporates FLUX for inline image generation. When the narrative reaches a scene deemed visually significant, the system triggers a local image generation request, producing visuals that align with the current story state without sending data to external servers.

Memory Optimization and Context Window

One of the most significant technical achievements of this implementation is the efficient management of the Key-Value (KV) cache. Despite utilizing the model's full 256K context window, the system maintains a remarkably low memory footprint, consuming approximately 7.7GB of RAM.

This efficiency is attributed to the architectural characteristics of Gemma 4, which exhibits minimal KV cache growth as the context expands. This allows the narrator to maintain a comprehensive "memory" of the entire story progression without the typical exponential increase in VRAM/RAM usage associated with long-context windows in other LLMs.

Technical Specifications Summary

  • LLM: Gemma 4 (12B)
  • Quantization: QAT Q4
  • Inference Engine: Ollama
  • Image Generation: FLUX (Uncen)
  • Maximum Context: 256,000 tokens
  • Memory Usage: ~7.7GB RAM at full context

Note: Detailed implementation code and specific system requirements beyond the RAM usage were not provided in the source material.

Original Source
Gemma 4 Local LLM Quantization-Aware Training (QAT) FLUX KV Cache Optimization Long Context Window