Introduction
The rapid advancement of large language models has revealed a persistent gap between small, efficient models (0.6B–4B parameter range) and their significantly larger counterparts in terms of complex reasoning capabilities. While compact models offer advantages in deployment efficiency, latency, and resource consumption, they typically exhibit verbose, uncertain reasoning patterns characterized by stream-of-consciousness exploration, self-correction loops, and high noise-to-signal ratios in their outputs.
Knowledge distillation—the process of transferring capabilities from a larger "teacher" model to a smaller "student" model—has emerged as a promising approach to bridge this gap. However, most distillation efforts focus on general capability transfer rather than specifically targeting reasoning structure and quality. Furthermore, when multiple distilled models are merged using standard techniques, catastrophic forgetting and degradation of specialized capabilities often occur.
In this work, we present a comprehensive pipeline that addresses these challenges through:
- Domain-specific distillation datasets generated by two distinct teacher models (Qwen3.6-plus and Kimi-2.5-thinking), each emphasizing different reasoning strengths.
- QLoRA fine-tuning of Qwen3-4B-Thinking on each dataset, producing two specialized reasoning-distilled models.
- A novel SLERP merge strategy with layer-wise gradient attention and vocabulary pinning that preserves RAG capabilities while combining complementary reasoning strengths.
- Comprehensive evaluation on CMDR-Bench, a 100-test-case benchmark spanning 10 cognitive domains with graduated difficulty levels.
Our results demonstrate that the merged model achieves superior performance in logical reasoning, mathematical problem-solving, and code analysis compared to both individual distilled models and the base model, while maintaining acceptable performance in creative writing tasks.
Dataset Construction
We constructed two complementary distillation datasets, each leveraging a different teacher model to capture diverse reasoning patterns and domain expertise.
The first dataset, khazarai/kimi-2.5-high-reasoning-250x, was generated using Kimi-2.5-thinking as a teacher model. This dataset contains detailed reasoning traces and final answers for complex questions spanning multiple technical, scientific, historical, and strategic domains.
The Kimi teacher model was selected for its demonstrated strength in deep analytical reasoning, causal inference, and structured problem decomposition. The generated reasoning traces emphasize systematic analysis, hypothesis evaluation, and evidence-based conclusion derivation.
The second dataset, khazarai/qwen3.6-plus-high-reasoning-500x, was prepared using Qwen3.6-plus as the teacher model, covering topics in coding, mathematics, finance, medicine, and economics.
This dataset was designed to emphasize mathematical precision, algorithmic thinking, and structured solution formulation. The Qwen3.6-plus teacher model contributes strong capabilities in formal reasoning, quantitative analysis, and domain-specific technical knowledge.
Key Design Principle: Both datasets prioritize detailed reasoning traces over simple question-answer pairs, ensuring that the student models learn not just what the correct answer is, but how to arrive at it through structured, step-by-step reasoning.
Methodology
Our training pipeline follows a multi-stage approach designed to maximize reasoning quality while maintaining model efficiency:
Both distilled models were fine-tuned using the following shared configuration:
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen3-4B-Thinking-2507 |
| Framework | Unsloth |
| Fine-tuning Method | QLoRA (PEFT) |
| Precision | bfloat16 |
| Training Objective | Next-token prediction with reasoning traces |
Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled
This model, available as khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled, is a reasoning-distilled variant of Qwen3-4B-Thinking, fine-tuned to replicate the advanced reasoning capabilities of the larger Qwen3.6-plus teacher model. The distillation process focuses on reducing the "rambling" and "uncertainty" often found in smaller models during complex tasks, replacing them with concise, structured, and actionable solution paths.
The primary improvement in this model is the qualitative leap in reasoning structure. The transformation from the base model to the distilled variant is best understood through comparison:
🔴 Base Model (Qwen3-4B-Thinking)
- Stream-of-consciousness, exploratory, and verbose
- Self-talk ("Hmm, interesting", "Wait, no")
- Struggles with problem constraints on first attempt
- Enters loops of self-correction
- High noise-to-signal ratio
- Solution paths buried under hesitation
🟢 Distilled Model
- Structured, professional, report-oriented
- Immediate problem analysis and constraint separation
- Concrete algorithm formulation (e.g., State-Space Dijkstra)
- Confident progression without logical dead-ends
- Clean output: Analysis → Intuition → Algorithm → Complexity
- Engineering-grade tool quality
Verdict: The distilled model transforms the raw potential of the base model into an engineering-grade reasoning tool, eliminating hesitation and producing structured, actionable solution paths.
| Specification | Value |
|---|---|
| Model ID | khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled |
| Model Type | Reasoning Distillation (QLoRA) |
| Framework | Unsloth |
| Teacher Model | Qwen3.6-plus |
| Dataset | khazarai/qwen3.6-plus-high-reasoning-500x |
Qwen3-4B-Kimi2.5-Reasoning-Distilled
This model, available as khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled, is fine-tuned for structured, long-form reasoning using a specialized distillation dataset generated by Kimi-2.5-thinking. It is designed to bridge the gap between small, efficient models and the complex reasoning capabilities typically found in much larger models.
- Problem Decomposition: Excels at breaking down complex problems into manageable sub-components.
- Self-Correction: Demonstrates improved ability to identify and correct reasoning errors mid-generation.
- Analytical Depth: Provides detailed analytical answers with strong causal reasoning.
- Domain Versatility: Trained on diverse domains including technical, scientific, historical, and strategic reasoning.
| Specification | Value |
|---|---|
| Model ID | khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled |
| Base Model | Qwen3-4b-Thinking-2507 |
| Training Technique | Unsloth + QLoRA |
| Teacher Model | Kimi-2.5-thinking |
| Dataset | khazarai/kimi-2.5-high-reasoning-250x |
SLERP Merge Strategy
The merged model, khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Slerp, represents a highly experimental and optimized reasoning model created through a surgical SLERP (Spherical Linear Interpolation) merge of the two distilled models. The goal was to combine the deep analytical capabilities of Kimi with the mathematical and structural precision of Qwen, while mitigating the catastrophic forgetting commonly seen in standard SFT model merges.
Standard SLERP merges often destroy RAG capabilities and syntax adherence. To solve this, we developed a custom merge configuration with two key innovations:
The embed_tokens and lm_head layers are strictly pinned to 1.0 (Qwen weights). This ensures the model reads and generates using purely Qwen's vocabulary, completely eliminating the RAG degradation problem that plagues standard merges.
The intermediate attention and MLP layers follow a smooth gradient [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1] across model depth. This prevents weight interference in deep reasoning steps by allowing earlier layers to retain more of the base model's knowledge while progressively incorporating the distilled model's specialized capabilities in deeper layers.
# SLERP Merge Configuration (V5 - Golden Path)
models:
- model: khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled
- model: khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled
merge_method: slerp
base_model: khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled
parameters:
t:
- filter: embed_tokens
value: 1 # Pin to Qwen vocabulary
- filter: lm_head
value: 1 # Pin to Qwen vocabulary
- value: 1 # Default interpolation
- filter: self
value: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1] # Gradient attention
dtype: bfloat16
Synergy Effect: After multiple iterations and layer-by-layer tensor analysis, we achieved a "1+1=3 Synergy Effect" in Logical Inference and Planning, with the merged model outperforming both base models and the official Qwen Thinking model in reasoning benchmarks.
Trade-off: The sharp drop in "Creative Writing" performance is an expected and accepted trade-off to maximize extreme logical reasoning and coding precision. This model is optimized for analytical tasks, not creative generation.
Not recommended for: Creative writing, poetry, or highly imaginative storytelling.
Benchmark Design and Evaluation
We developed khazarai/Multi-Domain-Reasoning-Benchmark (CMDR-Bench), a comprehensive evaluation suite comprising 100 meticulously curated test cases across 10 distinct cognitive domains. Each domain features a graduated difficulty scale (Levels 1–10), enabling fine-grained analysis of capability thresholds from elementary to expert-level complexity.
Each test case is evaluated on a binary success metric (pass/fail) based on whether the model's output satisfies the task requirements. The success rate for each category is calculated as the percentage of passed test cases. Models evaluated include:
| Model | Type |
|---|---|
| Qwen/Qwen3-4B-Thinking-2507 | Base (unfine-tuned) |
| khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled | Distilled (Kimi teacher) |
| khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled | Distilled (Qwen teacher) |
| khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Slerp | Merged (SLERP) |
Results and Analysis
The following figure presents the performance comparison across all four models on CMDR-Bench. Each bar represents the success rate (%) for a specific benchmark category.
Mathematical Reasoning: All three fine-tuned models achieve 100% success rate, compared to 100% for the base model, demonstrating that distillation preserves mathematical capabilities while improving reasoning structure.
Python Code Analysis & Debugging: The Kimi-distilled and merged models both achieve 95.5%, a significant improvement over the base model's 69.1%. This validates the effectiveness of Kimi's analytical reasoning transfer for code-related tasks.
Logical Reasoning: The merged model achieves 76.4%, outperforming both individual distilled models (68.2% Kimi, 60.0% Qwen) and the base model (60.0%). This demonstrates the synergy effect of combining complementary reasoning strengths.
Scientific Explanation (RAG): All four models achieve 100%, indicating that the vocabulary pinning strategy successfully preserves RAG capabilities in the merged model.
Constrained Creative Writing: As expected, the merged model shows reduced performance (26.4%) compared to the base model (34.5%). This trade-off is intentional and acceptable given the model's focus on analytical reasoning.
Planning and Optimization: The merged model achieves 72.7%, significantly outperforming both distilled models (43.6% Kimi, 56.4% Qwen) and the base model (38.2%). This represents the strongest evidence of the synergy effect.
| Category | Base | Kimi-Distilled | Qwen-Distilled | Merged |
|---|---|---|---|---|
| Causal Reasoning (RAG) | 100.0% | 90.9% | 98.2% | 91.8% |
| SQL Query Generation | 100.0% | 85.5% | 81.8% | 81.8% |
| Python Code Analysis | 69.1% | 79.1% | 95.5% | 95.5% |
| Planning & Optimization | 38.2% | 43.6% | 56.4% | 72.7% |
| Mathematical Reasoning | 100.0% | 90.9% | 100.0% | 100.0% |
| Logical Reasoning | 60.0% | 68.2% | 60.0% | 76.4% |
| Constrained Creative Writing | 34.5% | 52.7% | 36.4% | 26.4% |
| Complex Scenario Analysis | 60.9% | 61.8% | 61.8% | 77.3% |
| Ethical Dilemma | 74.5% | 72.7% | 66.4% | 65.5% |
| Scientific Explanation (RAG) | 100.0% | 100.0% | 100.0% | 100.0% |
The merged model outperforms both individual distilled models in 5 out of 10 categories: Logical Reasoning, Complex Scenario Analysis, Planning & Optimization, Python Code Analysis, and Scientific Explanation. In Planning & Optimization specifically, the merged model (72.7%) exceeds both distilled models (43.6% and 56.4%) by a substantial margin, providing strong evidence of the claimed "1+1=3" synergy effect.
The gradient attention strategy appears to successfully combine the analytical depth of Kimi's distillation with the mathematical precision of Qwen's distillation, producing a model that leverages the strengths of both teacher models while mitigating their individual weaknesses.
Conclusion
This paper demonstrates a comprehensive pipeline for enhancing the reasoning capabilities of compact language models through knowledge distillation and strategic model merging. Our key contributions are:
- Two domain-specific distillation datasets generated by Qwen3.6-plus and Kimi-2.5-thinking, covering complementary reasoning domains.
- Two QLoRA-distilled models that transform the base model's stream-of-consciousness reasoning into structured, professional-grade analytical output.
- A novel SLERP merge strategy with gradient attention and vocabulary pinning that achieves synergistic performance improvements while preserving RAG capabilities.
- CMDR-Bench, a comprehensive 100-test-case benchmark across 10 cognitive domains for evaluating multi-domain reasoning capabilities.
Our results show that the merged model achieves superior performance in logical reasoning, mathematical problem-solving, code analysis, and planning tasks compared to both individual distilled models and the base model. The trade-off in creative writing performance is intentional and acceptable for a model optimized for analytical reasoning.
Future work includes exploring additional merge strategies, expanding the benchmark to more domains, and investigating the transferability of this pipeline to other base model architectures.
Model Availability: All models, datasets, and the benchmark are publicly available on Hugging Face under the khazarai organization.
References
- Qwen Team. (2025). Qwen3 Technical Report. Alibaba Group.
- Moonshot AI. (2025). Kimi-2.5 Technical Report.
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
- Wortsman, M., et al. (2022). Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time. ICML.
- Ilharco, G., et al. (2022). Merging Models for Free: No Additional Training Required. arXiv:2207.06469.
- Qwen3-4B-Thinking-2507. Hugging Face. https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507
- Qwen3.6-plus. Hugging Face.
- khazarai/kimi-2.5-high-reasoning-250x. Hugging Face Dataset.
- khazarai/qwen3.6-plus-high-reasoning-500x. Hugging Face Dataset.
- khazarai/Multi-Domain-Reasoning-Benchmark. Hugging Face Dataset.