KALAVAI / Cooperative Specialist Fusion

When does independent specialist fusion work?

A protocol for building one routable model from independently trained specialists. Shared initialization. No communication during training. Lightweight routing after the fact.

Abstract

KALAVAI treats cooperation as a routing problem. Specialists diverge from a shared base; the fused model learns when each divergence is useful.

The core result is not just that fusion can work. It is that a practical measurement, mean specialist divergence from the base model, predicts whether the cooperative is worth building before full evaluation.

Core Thesis

Shared origin

All specialists start from the same checkpoint, preserving enough representational compatibility for post-hoc fusion.

Joint inference

All specialists run at inference. Single-expert dispatch fails because each specialist forgets outside its domain.

Divergence governs gain

Specialists must become meaningfully different from the base. Below the divergence floor, fusion has little to select.

Frozen anchors

Beyond longer training horizons, freezing early layers preserves routing compatibility while allowing useful specialization.

Findings
KALAVAI improvement across model scales
Scale ladder across Pythia model sizes. The fusion mechanism remains measurable across tested scales.
Cross-domain divergence heatmap
Specialists form a diagonal cross-domain structure: each is strongest on its own domain.
Router gate weight distribution
Router confidence approaches a hard switch while retaining joint inference coverage.
Freeze depth ablation
Freeze depth becomes important as training duration increases.
Downstream benchmark results
Benchmark movement is modest, while perplexity and routing findings are the primary evidence.
Artifacts
Slides