Open Dungeon: Achieving Full 256K Context Local Roleplay via Gemma 4 QAT and FLUX Integration

A new local implementation called "Open Dungeon" enables private, high-fidelity roleplay by combining a quantized Gemma 4 model for narration and FLUX for inline image generation, maintaining a minimal memory footprint even at maximum context windows.

Architectural Overview

Open Dungeon is designed as a fully local, privacy-centric alternative to cloud-based AI roleplay platforms. The system architecture eliminates the need for external API keys or cloud dependencies, ensuring that all data processing remains on the user's hardware. The project integrates two primary AI components: a Large Language Model (LLM) for narrative generation and a diffusion model for visual representation.

Narrative Engine: Gemma 4 QAT

The core narrator is powered by Gemma 4, utilizing a QAT (Quantization-Aware Training) Q4 quantization. This specific optimization allows the 12B parameter model to operate efficiently without significant degradation in coherence or creative quality. The model is deployed via the Ollama framework, facilitating streamlined local inference.

Visual Integration: Uncen-FLUX

To enhance the immersive experience, the system incorporates FLUX for inline image generation. When the narrative reaches a scene deemed visually significant, the system triggers a local image generation request, producing visuals that align with the current story state without sending data to external servers.

Memory Optimization and Context Window

One of the most significant technical achievements of this implementation is the efficient management of the Key-Value (KV) cache. Despite utilizing the model's full 256K context window, the system maintains a remarkably low memory footprint, consuming approximately 7.7GB of RAM.

This efficiency is attributed to the architectural characteristics of Gemma 4, which exhibits minimal KV cache growth as the context expands. This allows the narrator to maintain a comprehensive "memory" of the entire story progression without the typical exponential increase in VRAM/RAM usage associated with long-context windows in other LLMs.

Technical Specifications Summary

LLM: Gemma 4 (12B)
Quantization: QAT Q4
Inference Engine: Ollama
Image Generation: FLUX (Uncen)
Maximum Context: 256,000 tokens
Memory Usage: ~7.7GB RAM at full context

Note: Detailed implementation code and specific system requirements beyond the RAM usage were not provided in the source material.

Original Source

Gemma 4 Local LLM Quantization-Aware Training (QAT) FLUX KV Cache Optimization Long Context Window

Techyon

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

Open Dungeon: Achieving Full 256K Context Local Roleplay via Gemma 4 QAT and FLUX Integration

Architectural Overview

Narrative Engine: Gemma 4 QAT

Visual Integration: Uncen-FLUX

Memory Optimization and Context Window

Technical Specifications Summary

Open Dungeon: local roleplay with Gemma 4 QAT + inline Uncen-FLUX images, running at full 256K context under 8GB RAM (OS)

Open Dungeon: Achieving Full 256K Context Local Roleplay via Gemma 4 QAT and FLUX Integration

Architectural Overview

Narrative Engine: Gemma 4 QAT

Visual Integration: Uncen-FLUX

Memory Optimization and Context Window

Technical Specifications Summary

Related Articles

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

langchain-ai /langchain

browser-use /browser-use

Ukraine's one-time test used fully autonomous drones to kill Russian soldiers