I Wanted My LLM to Reason. I Only Had a 4GB GPU |

This weekend I was thinking about what reasoning models and thinking models actually are — and whether the difference is just marketing or something real.

The way I understand it now: reasoning models (o1, DeepSeek-R1, and similar) are built or shaped to produce long internal chains before they commit to an answer. They pause, backtrack, and sometimes correct themselves mid-trace. Thinking models, or plain instruct models like Qwen2.5-Instruct, will answer in one pass. You can prompt them with chain-of-thought, but they were not optimized for reliable multi-step deliberation.

That distinction matters on a concrete task. Take a GSM8K word problem: Mimi picks up 2 dozen seashells. Kyle finds twice as many. Leigh takes one-third of Kyle’s shells. How many does Leigh have? A reasoning model is supposed to hold the relationships straight — Kyle is relative to Mimi, Leigh is relative to Kyle — and only then do the arithmetic. A base instruct model might get it right once and wrong the next time, depending on temperature and luck.

That led me to a harder question: can you turn a normal LLM into a reasoning model, and how? SFT on chain-of-thought data? Distillation from a stronger teacher? Inference tricks like self-consistency? Reinforcement learning the way DeepSeek did with R1?

I did not have a cluster. I had a laptop with an NVIDIA RTX 3050 (4GB VRAM) — good enough for games, not for training anything large. So I ran a controlled experiment I called TinyReason: Qwen2.5-0.5B-Instruct on GSM8K, measuring pass@1 with seed=42, and treating 600-question eval as the north star (not quick 100-question subsets, which turned out to lie).

The Setup

Constraint	Choice
GPU	RTX 3050, 4GB VRAM
Model	Qwen2.5-0.5B-Instruct
Benchmark	GSM8K grade-school math
Primary metric	pass@1 on 600 questions (seed=42)
Training	QLoRA SFT (fits in 4GB)

The standard assumption going in: fine-tune on GSM8K chain-of-thought traces and accuracy goes up. I wanted to test that against the alternatives people actually reach for on small models — distillation, reflection, self-consistency, and structured prompting.

What I Tried

#	Method	Type	Description
1	Greedy decode	inference	Single sample, baseline
2	Self-consistency (n=5)	inference	Sample 5 answers at temperature 0.7, majority vote
3	Reflection	inference	Model reviews and revises its own solution
4	SC + disagreement reflect	inference	Reflect only when the 5 samples disagree
5	Structured SC v2	inference	Force Facts → Relationships → Calculations → Answer before voting
6	QLoRA SFT	training	Fine-tune on 500–2000 human GSM8K traces (1–2 epochs)
7	Teacher distillation	training	100 structured API teacher traces, 1 epoch

I also tried a prompt-based verifier (CORRECT / INCORRECT labels on candidate chains). The audit gate required 60% accuracy; it scored 50% — essentially random — so I skipped the full SC+verifier eval.

The Headline Result

On this hardware and this model, spending compute at inference beat spending it on training — and it wasn’t close.

Method	pass@1 (600q)
Self-consistency (n=5)	47.0% (282/600)
SC + disagreement reflect	46.2% (277/600)
Greedy	43.7% (312/721)
Structured SC v2	39.8% (239/600)
Reflection	39.0% (100q only)
Distillation (100 traces, 1ep)	26.7% (160/600)
QLoRA (2000 examples, 1ep)	24.2% (145/600)

Self-consistency added +3.3 percentage points over greedy (43.7% → 47.0%). Modest, but consistent and cheap relative to a full QLoRA run.

The result that fooled me early: Structured SC v2 hit 49% on 100 questions but dropped to 39.8% on 600. That is a quick-eval mirage — the structured prompt looked helpful on a short run but did not scale. Longer generations added noise, or the four-step format was harder to sustain across the full test set. Lesson: always scale eval before celebrating.

Why Training Failed

Regression taxonomy

I compared cases where the base model was correct but QLoRA was wrong. There were 154 regressions; I sampled 50 for manual taxonomy:

Failure type	Share
Relationship misunderstanding	48%
Arithmetic error	38%
Unit confusion	8%
Other	6%

Fine-tuning did not fix relationship errors — it often introduced them. The model learned surface patterns from training traces but lost calibration on the base model’s already-fragile reasoning.

One concrete example: Roy has saved 40% more in money earned by chores than his brother Anthony. Anthony has saved $10 more than Eva, who saved $20. How much does Roy have? Ground truth: 42. Base model: 42. QLoRA: 12 — it treated “40% more” as “40% of Anthony’s amount” instead of “Anthony’s amount plus 40%.”

The strangest result: memorization

Before blaming a pipeline bug, I audited the training setup and ran a memorization experiment.

20_10ep (20 examples, 10 epochs): 15% train pass@1. Pipeline checks passed — train/eval prompt parity confirmed, LoRA weights active, teacher-forcing NLL at 0.58 (not memorized). Failures were wrong reasoning, not format extraction bugs.

5_20ep (5 examples, 20 epochs, 100 optimizer steps): train loss dropped to 0.14, teacher-forcing NLL to 0.14. Yet generation pass@1 on those same 5 questions: 1/5 (20%).

Example	Expected	Generated	Match
1 (seashells)	16	16 (exact GSM8K format)	Yes
2 (pets)	19	17	No
3 (toy cars)	196	177	No
4 (bank)	4	14	No
5 (hike)	15	10	No

Example 1 was memorized verbatim, including <<calc=result>> markers and #### 16. Examples 2–5 produced plausible-looking chains with wrong arithmetic or relationship errors.

The model learns target tokens under teacher forcing but cannot reliably reproduce them at generation time. That is exposure bias plus reasoning drift — not a missing #### or eval format mismatch.

Reflection and verifier: search beats judgment

Reflection alone hurt: greedy 49% → reflection 39% on 100 questions. Transition matrix: 12 regressions (correct → wrong) vs 2 fixes (wrong → correct). The 0.5B model “fixes” correct solutions into wrong ones more often than it rescues bad ones.

The prompt verifier labeled all wrong QLoRA chains as CORRECT — 50% audit accuracy, below the 60% gate. If a model cannot verify its own mistakes, search beats judgment at this scale.

DeepSeek-R1: What RL Found (That I Could Not Replicate)

They found RL could create reasoning traces that never existed in the training data. The model started writing things like: Wait… Let’s reconsider. That doesn’t seem right. — without being taught.

That quote captures one of the most surprising findings from DeepSeek-R1-Zero. Researchers applied reinforcement learning (GRPO), optimizing purely for correct final answers on hard reasoning tasks. The model was not explicitly taught to critique its own work. Instead, self-reflection emerged during reward optimization — backtracking, pausing, generating internal dialogue when it hit a dead end.

Why that shocked people: standard internet training data rarely contains messy trial-and-error reasoning. Most written content online is the final answer, not the wrong turns that got there. DeepSeek essentially forced the model to produce multi-step chains and rewarded correctness at the end — and the model figured out how to think logically to maximize its own rewards.

Approach	DeepSeek-R1 (frontier scale)	TinyReason (0.5B, 4GB)
RL / GRPO	Emergent self-reflection	Deferred — infra cost too high on 4GB
SFT on CoT	Part of the R1 pipeline	Regressed to 24–27%
Inference search	Used at scale	SC +3.3 pp — only reliable win

RL is the path frontier labs took. On my hardware, inference-time search was the only thing that worked. I am not claiming RL cannot work on small models — I am saying I could not run it here, and everything I could run on the training side made things worse.

What I Would Try Next: Knowledge Transfer

Based on these results, knowledge distillation from a stronger teacher looks like a more practical next step than RL on a 4GB GPU — but not in the naive form of “more epochs on 0.5B.”

I already distilled 100 structured API teacher traces (Facts → Relationships → Calculations → Verification → Answer). Teacher quality was perfect: 100/100 traces valid, all step labels present, all answers matching ground truth. The student still landed at 26.7% — better than raw QLoRA (24.2%) but far below the base model’s greedy 43.7%.

The bottleneck is student capacity, not teacher quality. More traces without a stronger student or a larger model is unlikely to help much. What might:

Distill into a larger student (1.5B–3B) if GPU budget allows
Use a stronger teacher (frontier model traces, not just structured v2 format)
Skip SFT entirely and try RL with cloud credits — the DeepSeek path, deferred here for infrastructure reasons

Hyperparameter sweeps (learning rate, LoRA rank, 500–2000 trace counts) would likely land in the same 24–29% band and teach nothing new. Publishing the negative results felt more useful than another week of GPU time proving the same thing.

Key Takeaways

Try self-consistency first on sub-1B models — best ROI per line of code (+3.3 pp at n=5).
Always eval at scale — 100-question subsets can lie (Structured SC v2: 49% → 39.8%).
Measure against base after every fine-tune — QLoRA and distillation both regressed.
Check raw generations, not just pass@1 — memorization tests expose SFT failure modes that aggregate metrics hide.
Publish negative results — reflection, verifier, and SFT all failed here; that saves everyone else a week.
Reasoning at frontier scale ≠ reasoning at 0.5B — different toolkits, different ceilings.

Reproduce

Full leaderboard and per-experiment artifacts live under results/:

results/LEADERBOARD.md
results/OVERFIT_REPORT.md
results/INFERENCE_ANALYSIS.md

.venv\Scripts\Activate.ps1
.\scripts\reproduce_leaderboard.ps1

Memorization audit:

.venv\Scripts\python.exe scripts\inspect_memorization.py --experiment-id 5_20ep --print-raw

When I get access to more compute, the first thing I want to try is distillation into a bigger student — not another QLoRA sweep on 0.5B. RL stays on the list too, but DeepSeek already showed that path works at scale; my experiment showed what happens when you try to brute-force reasoning into a model that barely has room to breathe.

Until next time.