I Wanted to Win Kaggle. I Built a Memory System Instead. |

I did not set out to build a memory system. I set out to win Kaggle.

That sounds naive written down, but it is the honest origin. I kept reading the same advice in winner writeups (stratified K-Fold, target encoding, pseudo-labeling, weighted blends) and thinking: this is not magic, it is a checklist. Top competitors are not discovering new mathematics every week. They are applying known tricks faster, with better feature creativity, and with CV discipline that most beginners skip.

So I started building AutoKaggle: an agent that reads winning solutions, extracts what worked, and generates ML pipelines for new competitions. Crawl notebooks. Parse strategies. Embed them. Retrieve them at plan time. Run LightGBM. Submit.

Simple architecture. Hard execution.

The First Wall Was Speed, Not Intelligence

My first ingest pipeline sent every crawled document to a local LLM. Ollama, Qwen, HTTP round-trips. Two to six minutes per notebook. Ninety documents meant an evening gone, and most of those notebooks were EDA tutorials with three useful lines buried in markdown.

The fix was not a bigger model. It was not calling the model at all for most documents.

I added a three-layer extractor:

Regex pulls # Feature Engineering, # Model, # Validation sections instantly.
A keyword signal filter counts competition-relevant terms — LightGBM, GroupKFold, stacking, target encoding.
A tier classifier sends only medal notebooks, winner writeups, and high-signal discussions to the LLM. Everything else gets regex-only strategy JSON.

That cut LLM calls by roughly ninety percent. Ingest went from hours to minutes. This was the first laptop-GPU lesson: compute is scarce; spend it deliberately.

The Second Wall Was Memory

Once ingest worked, retrieval failed in subtler ways.

The vector DB returned plausible chunks. The agent still made bad plans. Duplicated tricks piled up. One notebook said PCA helped; another said it destroyed tree models. Low-quality extractions became “observations” and polluted future runs. I ran audits and found fake observations, extraction failures, and semantic near-duplicates everywhere.

Around that time Karpathy posted about using LLMs to compile personal wikis: raw sources in, structured markdown concepts out, LLM as curator not search engine. Lex Fridman described a similar setup. I had been grinding on the same problem without the celebrity tweet, and it was validating and annoying in equal measure.

The insight I had missed: RAG is not memory.

Retrieval finds similar text. Memory decides what is true, current, and applicable. That requires gatekeeping, contradiction handling, confidence decay, and an experiment loop. Not another embedding model.

AutoKaggle grew a second brain:

SQLite for structured strategy records, trials, and atomic observations
Chroma for chunked semantic search
NetworkX for technique relationships (LightGBM → pairs with → target encoding)
A wiki compiler that clusters observations into concept pages — Karpathy-style, but for Kaggle tricks
Progressive RAG that scans high-confidence observations before dumping full vector context into the prompt

The competition pipeline stayed: analyze → plan → features → CV → train → ensemble → submit. But planning stopped being “nearest notebook chunk” and started being “verified observation + graph neighbor + retrieved evidence.”

Three Memory Designs (and Why I Did Not Pick Just One)

Once I started comparing approaches, three patterns kept showing up. They look similar from a distance. They solve different problems.

Dimension	Karpathy wiki	Supermemory	AutoKaggle
Primary goal	Personal research knowledge base	General agent memory OS	Competition strategy + experiment memory
Raw storage	`raw/` articles, papers, repos	Conversations + documents	Crawled JSONL (notebooks, writeups, discussions)
Compiled layer	LLM-written markdown wiki with backlinks	Memory graph with profiles	Observations + wiki compiler + strategy graph
LLM role	Curator, editor, researcher	Memory policy engine	Extractor, compiler, planner (tier-gated)
Retrieval	Index files + summaries; light RAG at scale	Hybrid memory + RAG + profiles	Progressive RAG: observations → graph → vectors
Truth mechanism	Human Q&A + incremental wiki linting	Contradiction handling, forgetting, temporal updates	Confidence scoring, FAISS merge, experiment outcomes
Validation	Lint wiki for inconsistencies	Memory importance + decay policies	Cross-source support, CV score delta, observation gate
What gets forgotten	Manual / LLM-driven cleanup	Automatic temporal forgetting	Prune low-confidence, low-support, stale rows
Ontology	Concepts and links	User-centric memories	`{technique, task_type, context, impact, evidence}`
Best at	Thinking and research synthesis	Conversational agent personalization	“What trick for this tabular CV problem?”
Weak at	Automated ML execution	Domain-specific experiment tracking	Generic chat memory, user profiles
My takeaway	Steal the compiler idea	Steal policies (contradict, decay, forget)	Keep custom ontology + experiment loop

Karpathy’s system is for thinking. Supermemory is for remembering users. AutoKaggle is for remembering what wins, and that needs experiment validation, not just good summaries.

I did not plug in Supermemory as the core. I borrowed its policy ideas and built a compiler layer that Karpathy described, but grounded in Kaggle-specific observations and CV outcomes.

I Only Had a 4GB GPU

Same machine as my TinyReason experiments: an NVIDIA RTX 3050 laptop, four gigabytes of VRAM, perfectly adequate for games and completely inadequate for the models I wished I could run.

I wanted Mistral-7B or Qwen-7B on GPU with batch inference. Reality: 4-bit quantization or nothing. Batch size one or OOM. Serial Promptfoo evals or VRAM spikes. Never run ingest and eval at the same time.

The codebase encodes these constraints directly. local_llm.py detects VRAM below eight gigabytes and downgrades 7B/8B/9B names to a 4B fallback. BitsAndBytes 4-bit loading is default on CUDA. OOM triggers cache clear and retry, then regex emergency fallback. Promptfoo runs with -j 1, one test at a time, fifty-three minutes for fifteen tests, zero harness errors.

What I wanted	What I shipped
Mistral-7B batch extraction	Qwen3-4B 4-bit, batch_size=1
LLM on every notebook	LLM on ~10–20% (Tier 1 only)
Parallel eval	Serial eval, shared model singleton
Cloud-free everything	OpenRouter consensus optional, off by default

The tradeoff I accept: local autonomy over raw quality. RAG synthesis hit 100% on Promptfoo rubrics. Extraction still struggles with JSON validity and coarse task types on a 4B model. I know exactly where the ceiling is, and it is GPU-shaped.

What Would Change With 12GB+ VRAM

Four gigabytes forced a specific operating mode. Twelve gigabytes or more would not just make things faster. It would change what I turn on.

Capability	RTX 3050 (4GB, today)	12GB+ VRAM (e.g. RTX 3060/4070 laptop)
Primary model	Qwen3-4B 4-bit	Qwen2.5-Coder-7B or Mistral-7B 4-bit
VRAM fallback	Auto-downgrade 7B→4B in `local_llm.py`	Rarely triggered; 7B becomes default
Extraction batch	`EXTRACTION_BATCH_SIZE=1`	Batch 4–8 on Tier 1 docs
Tier 2 LLM	Regex-only for ~80–90% of corpus	Enable LLM on Tier 2 for richer strategy JSON
Code-only notebooks	`NOTEBOOK_LLM_INTELLIGENCE` off	Code-to-knowledge pass for medal notebooks without markdown
Multi-pass extraction	Focused retries budget-tight	Full missing-section retries without emergency regex
Promptfoo eval	`-j 1` serial, ~54 min / 15 tests	`-j 2–4` parallel, model loaded once in worker
Ingest + eval	Never on same GPU	Still risky shared singleton, but less OOM fear
Quantization	Required (4-bit)	Optional 8-bit or fp16 for 7B on 12GB
Extraction quality	JSON glitches, coarse `task_type`	Longer context, stabler structured output
Consensus	OpenRouter only (local GPU reserved)	Room for local 7B + cloud consensus in parallel

Concrete config shift I would make on a 12GB machine:

EXTRACTION_MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
EXTRACTION_BATCH_SIZE=4
EXTRACTION_USE_4BIT=true
EXTRACTION_USE_LLM_FOR_TIER2=true
NOTEBOOK_LLM_INTELLIGENCE_ENABLED=true

And for eval:

npx promptfoo@latest eval -c promptfoo/autokaggle_eval.yaml -j 2

The design docs originally assumed Mistral-7B batch inference on GPU, five to ten× faster than HTTP Ollama per doc. I could not run that comfortably on 4GB. On 12GB it becomes the intended architecture, not a wishlist item.

The tier system would also relax. Today Tier 2 exists partly because I cannot afford an LLM call per EDA notebook. With headroom, Tier 2 gets a cheap LLM pass too: regex for structure, model for interpretation.

Promptfoo already showed the cost of regex-only Tier 2: on the Spaceship Titanic writeup, the 4B model extracted TabNet/neural nets but missed the LightGBM/CatBoost weighted blend the writeup actually used. A 7B Tier-2 pass is the first extraction upgrade I would benchmark after a hardware swap (currently 0/6 on P0-EXT with 4B).

What Winning Actually Requires

Something I should have written on a sticky note earlier:

Reading gives you ideas. Experiments give you truth.

The Kaggle playbook is learnable: CV strategy, feature engineering, ensembling, leakage suspicion, metric alignment, pseudo-labeling near deadline. An agent does not need to discover these from first principles. It needs to store them as {technique, context, impact}, detect when context matches, and validate on real CV scores. Not on public leaderboard hope.

That is the loop I am building toward:

Question → targeted crawl → extract → validate → compile → experiment → update memory

Not random Playwright browsing. Not “scrape everything, store everything.” Search with intent. Cross-source agreement. Promote to verified memory only after experiments confirm impact.

The Research Agent Loop (Playwright in Practice)

Playwright is already in the codebase, but not for random Google surfing. Today it does three concrete jobs on Kaggle:

Command	Role
`autokaggle capture-session`	Browser login → save `storage_state` + Cookie/XSRF headers
`autokaggle discover-endpoints`	Watch network traffic → register `/api/i/` crawler paths
`autokaggle crawl` / `self-learn`	Authenticated fetch → `data/crawled/documents.jsonl` → ingest

Without capture-session, winner writeups often return shell pages and extraction quality collapses. Playwright here is authenticated Kaggle ingestion, not a general web scraper.

That is the Kaggle half of the research loop. The open-web half (papers, GitHub repos, blogs) is where query templates come in. The mistake I almost made was Playwright → Google → store everything. Noise at scale. The structured version looks like this:

Question
  → targeted queries (templates)
  → crawl / fetch (Playwright + APIs)
  → ingest (tier classify → extract)
  → validate (cross-source + quality gate)
  → compile (observations → wiki)
  → experiment (CV on held-out competition)
  → update memory (confidence, support, contradictions)

Query templates

Instead of open-ended browsing, I start from a question tied to a gap in memory:

“How to improve tabular models with categorical data?”
“Best augmentation techniques for small image datasets?”
“How to prevent overfitting in time series?”

Those become templated searches:

queries = [
    "kaggle {task} tricks",
    "{technique} machine learning paper",
    "{task} github solution kaggle",
    "{task} feature engineering techniques",
]

Worked example: tabular competition with categoricals

Suppose I am planning for a new binary classification competition with heavy categoricals and imbalanced targets. Memory query returns low-confidence on target encoding alternatives. The active-learning question becomes:

“What works better than target encoding for high-cardinality categoricals?”

Generated queries:

Template	Filled query
`kaggle {task} tricks`	`kaggle tabular classification tricks`
`{technique} machine learning paper`	`target encoding leakage machine learning paper`
`{task} github solution kaggle`	`tabular binary classification github solution kaggle`
`{task} feature engineering techniques`	`high cardinality categorical feature engineering techniques`

Source priority when results arrive:

Kaggle notebooks/discussions > GitHub repos > arXiv papers > blogs > random web

Each hit goes through the same pipeline as crawled Kaggle docs: parse, tier, extract to {technique, context, why, impact}, gate into observations. Cross-source agreement bumps confidence:

if technique appears in 5+ independent sources:
    increase_confidence()
if technique improves CV on proxy competition:
    mark as "proven"

Active learning queries

The loop also searches for what the system does not know:

Memory signal	Generated query
Low confidence on `target_encoding`	`better alternatives to target encoding tabular`
High contradiction count on `PCA`	`PCA feature engineering tree models when to avoid`
Failed experiment on pseudo-labeling	`pseudo labeling confidence threshold best practices`

Playwright handles pages that need a browser (Kaggle auth, JS-rendered listings). arXiv and GitHub often go through direct HTTP. The unifying piece is not the browser. It is that every fetch enters the same ingest → validate → compile path, not a separate junk drawer.

Implementation stages (where I am)

Stage	Status
Step 1: Hardcoded 5–10 queries, scrape top results, run ingest	Designed; Kaggle crawl + ingest live
Step 2: LLM-generated queries from memory gaps	Planned
Step 3: Fully autonomous research → experiment loop	Target

The principle I keep repeating: reading gives ideas, experiments give truth. Playwright is a fetch tool. The research agent is the loop around it.

Where It Stands

AutoKaggle today is a research prototype, not a Kaggle Grandmaster in a box. It crawls, ingests with hybrid extraction, maintains hybrid memory, serves RAG, plans strategies, trains tabular models, and records what it tried. The eval harness keeps me honest with separate tracks for seeded fixture retrieval vs generic crawl honesty.

I still want to win competitions. But the project taught me that the path there runs through memory engineering, the unglamorous layer Karpathy and half the industry are suddenly building in public.

I started by asking how to beat the leaderboard. I ended up asking how an agent remembers what works, forgets what does not, and compounds knowledge across competitions on a gaming laptop and a four-gigabyte budget.

That was not the plan. It might be the more interesting project.