I Wanted to Win Kaggle. I Built a Memory System Instead.

I did not set out to build a memory system. I set out to win Kaggle.

That sounds naive written down, but it is the honest origin. I kept reading the same advice in winner writeups (stratified K-Fold, target encoding, pseudo-labeling, weighted blends) and thinking: this is not magic, it is a checklist. Top competitors are not discovering new mathematics every week. They are applying known tricks faster, with better feature creativity, and with CV discipline that most beginners skip.

So I started building AutoKaggle: an agent that reads winning solutions, extracts what worked, and generates ML pipelines for new competitions. Crawl notebooks. Parse strategies. Embed them. Retrieve them at plan time. Run LightGBM. Submit.

Simple architecture. Hard execution.

The First Wall Was Speed, Not Intelligence

My first ingest pipeline sent every crawled document to a local LLM. Ollama, Qwen, HTTP round-trips. Two to six minutes per notebook. Ninety documents meant an evening gone, and most of those notebooks were EDA tutorials with three useful lines buried in markdown.

The fix was not a bigger model. It was not calling the model at all for most documents.

I added a three-layer extractor:

  1. Regex pulls # Feature Engineering, # Model, # Validation sections instantly.
  2. A keyword signal filter counts competition-relevant terms — LightGBM, GroupKFold, stacking, target encoding.
  3. A tier classifier sends only medal notebooks, winner writeups, and high-signal discussions to the LLM. Everything else gets regex-only strategy JSON.

That cut LLM calls by roughly ninety percent. Ingest went from hours to minutes. This was the first laptop-GPU lesson: compute is scarce; spend it deliberately.

The Second Wall Was Memory

Once ingest worked, retrieval failed in subtler ways.

The vector DB returned plausible chunks. The agent still made bad plans. Duplicated tricks piled up. One notebook said PCA helped; another said it destroyed tree models. Low-quality extractions became “observations” and polluted future runs. I ran audits and found fake observations, extraction failures, and semantic near-duplicates everywhere.

Around that time Karpathy posted about using LLMs to compile personal wikis: raw sources in, structured markdown concepts out, LLM as curator not search engine. Lex Fridman described a similar setup. I had been grinding on the same problem without the celebrity tweet, and it was validating and annoying in equal measure.

The insight I had missed: RAG is not memory.

Retrieval finds similar text. Memory decides what is true, current, and applicable. That requires gatekeeping, contradiction handling, confidence decay, and an experiment loop. Not another embedding model.

AutoKaggle grew a second brain:

  • SQLite for structured strategy records, trials, and atomic observations
  • Chroma for chunked semantic search
  • NetworkX for technique relationships (LightGBM → pairs with → target encoding)
  • A wiki compiler that clusters observations into concept pages — Karpathy-style, but for Kaggle tricks
  • Progressive RAG that scans high-confidence observations before dumping full vector context into the prompt

The competition pipeline stayed: analyze → plan → features → CV → train → ensemble → submit. But planning stopped being “nearest notebook chunk” and started being “verified observation + graph neighbor + retrieved evidence.”

Three Memory Designs (and Why I Did Not Pick Just One)

Once I started comparing approaches, three patterns kept showing up. They look similar from a distance. They solve different problems.

DimensionKarpathy wikiSupermemoryAutoKaggle
Primary goalPersonal research knowledge baseGeneral agent memory OSCompetition strategy + experiment memory
Raw storageraw/ articles, papers, reposConversations + documentsCrawled JSONL (notebooks, writeups, discussions)
Compiled layerLLM-written markdown wiki with backlinksMemory graph with profilesObservations + wiki compiler + strategy graph
LLM roleCurator, editor, researcherMemory policy engineExtractor, compiler, planner (tier-gated)
RetrievalIndex files + summaries; light RAG at scaleHybrid memory + RAG + profilesProgressive RAG: observations → graph → vectors
Truth mechanismHuman Q&A + incremental wiki lintingContradiction handling, forgetting, temporal updatesConfidence scoring, FAISS merge, experiment outcomes
ValidationLint wiki for inconsistenciesMemory importance + decay policiesCross-source support, CV score delta, observation gate
What gets forgottenManual / LLM-driven cleanupAutomatic temporal forgettingPrune low-confidence, low-support, stale rows
OntologyConcepts and linksUser-centric memories{technique, task_type, context, impact, evidence}
Best atThinking and research synthesisConversational agent personalization“What trick for this tabular CV problem?”
Weak atAutomated ML executionDomain-specific experiment trackingGeneric chat memory, user profiles
My takeawaySteal the compiler ideaSteal policies (contradict, decay, forget)Keep custom ontology + experiment loop

Karpathy’s system is for thinking. Supermemory is for remembering users. AutoKaggle is for remembering what wins, and that needs experiment validation, not just good summaries.

I did not plug in Supermemory as the core. I borrowed its policy ideas and built a compiler layer that Karpathy described, but grounded in Kaggle-specific observations and CV outcomes.

I Only Had a 4GB GPU

Same machine as my TinyReason experiments: an NVIDIA RTX 3050 laptop, four gigabytes of VRAM, perfectly adequate for games and completely inadequate for the models I wished I could run.

I wanted Mistral-7B or Qwen-7B on GPU with batch inference. Reality: 4-bit quantization or nothing. Batch size one or OOM. Serial Promptfoo evals or VRAM spikes. Never run ingest and eval at the same time.

The codebase encodes these constraints directly. local_llm.py detects VRAM below eight gigabytes and downgrades 7B/8B/9B names to a 4B fallback. BitsAndBytes 4-bit loading is default on CUDA. OOM triggers cache clear and retry, then regex emergency fallback. Promptfoo runs with -j 1, one test at a time, fifty-three minutes for fifteen tests, zero harness errors.

What I wantedWhat I shipped
Mistral-7B batch extractionQwen3-4B 4-bit, batch_size=1
LLM on every notebookLLM on ~10–20% (Tier 1 only)
Parallel evalSerial eval, shared model singleton
Cloud-free everythingOpenRouter consensus optional, off by default

The tradeoff I accept: local autonomy over raw quality. RAG synthesis hit 100% on Promptfoo rubrics. Extraction still struggles with JSON validity and coarse task types on a 4B model. I know exactly where the ceiling is, and it is GPU-shaped.

What Would Change With 12GB+ VRAM

Four gigabytes forced a specific operating mode. Twelve gigabytes or more would not just make things faster. It would change what I turn on.

CapabilityRTX 3050 (4GB, today)12GB+ VRAM (e.g. RTX 3060/4070 laptop)
Primary modelQwen3-4B 4-bitQwen2.5-Coder-7B or Mistral-7B 4-bit
VRAM fallbackAuto-downgrade 7B→4B in local_llm.pyRarely triggered; 7B becomes default
Extraction batchEXTRACTION_BATCH_SIZE=1Batch 4–8 on Tier 1 docs
Tier 2 LLMRegex-only for ~80–90% of corpusEnable LLM on Tier 2 for richer strategy JSON
Code-only notebooksNOTEBOOK_LLM_INTELLIGENCE offCode-to-knowledge pass for medal notebooks without markdown
Multi-pass extractionFocused retries budget-tightFull missing-section retries without emergency regex
Promptfoo eval-j 1 serial, ~54 min / 15 tests-j 2–4 parallel, model loaded once in worker
Ingest + evalNever on same GPUStill risky shared singleton, but less OOM fear
QuantizationRequired (4-bit)Optional 8-bit or fp16 for 7B on 12GB
Extraction qualityJSON glitches, coarse task_typeLonger context, stabler structured output
ConsensusOpenRouter only (local GPU reserved)Room for local 7B + cloud consensus in parallel

Concrete config shift I would make on a 12GB machine:

EXTRACTION_MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
EXTRACTION_BATCH_SIZE=4
EXTRACTION_USE_4BIT=true
EXTRACTION_USE_LLM_FOR_TIER2=true
NOTEBOOK_LLM_INTELLIGENCE_ENABLED=true

And for eval:

npx promptfoo@latest eval -c promptfoo/autokaggle_eval.yaml -j 2

The design docs originally assumed Mistral-7B batch inference on GPU, five to ten× faster than HTTP Ollama per doc. I could not run that comfortably on 4GB. On 12GB it becomes the intended architecture, not a wishlist item.

The tier system would also relax. Today Tier 2 exists partly because I cannot afford an LLM call per EDA notebook. With headroom, Tier 2 gets a cheap LLM pass too: regex for structure, model for interpretation.

Promptfoo already showed the cost of regex-only Tier 2: on the Spaceship Titanic writeup, the 4B model extracted TabNet/neural nets but missed the LightGBM/CatBoost weighted blend the writeup actually used. A 7B Tier-2 pass is the first extraction upgrade I would benchmark after a hardware swap (currently 0/6 on P0-EXT with 4B).

What Winning Actually Requires

Something I should have written on a sticky note earlier:

Reading gives you ideas. Experiments give you truth.

The Kaggle playbook is learnable: CV strategy, feature engineering, ensembling, leakage suspicion, metric alignment, pseudo-labeling near deadline. An agent does not need to discover these from first principles. It needs to store them as {technique, context, impact}, detect when context matches, and validate on real CV scores. Not on public leaderboard hope.

That is the loop I am building toward:

Question → targeted crawl → extract → validate → compile → experiment → update memory

Not random Playwright browsing. Not “scrape everything, store everything.” Search with intent. Cross-source agreement. Promote to verified memory only after experiments confirm impact.

The Research Agent Loop (Playwright in Practice)

Playwright is already in the codebase, but not for random Google surfing. Today it does three concrete jobs on Kaggle:

CommandRole
autokaggle capture-sessionBrowser login → save storage_state + Cookie/XSRF headers
autokaggle discover-endpointsWatch network traffic → register /api/i/ crawler paths
autokaggle crawl / self-learnAuthenticated fetch → data/crawled/documents.jsonl → ingest

Without capture-session, winner writeups often return shell pages and extraction quality collapses. Playwright here is authenticated Kaggle ingestion, not a general web scraper.

That is the Kaggle half of the research loop. The open-web half (papers, GitHub repos, blogs) is where query templates come in. The mistake I almost made was Playwright → Google → store everything. Noise at scale. The structured version looks like this:

Question
  → targeted queries (templates)
  → crawl / fetch (Playwright + APIs)
  → ingest (tier classify → extract)
  → validate (cross-source + quality gate)
  → compile (observations → wiki)
  → experiment (CV on held-out competition)
  → update memory (confidence, support, contradictions)

Query templates

Instead of open-ended browsing, I start from a question tied to a gap in memory:

  • “How to improve tabular models with categorical data?”
  • “Best augmentation techniques for small image datasets?”
  • “How to prevent overfitting in time series?”

Those become templated searches:

queries = [
    "kaggle {task} tricks",
    "{technique} machine learning paper",
    "{task} github solution kaggle",
    "{task} feature engineering techniques",
]

Worked example: tabular competition with categoricals

Suppose I am planning for a new binary classification competition with heavy categoricals and imbalanced targets. Memory query returns low-confidence on target encoding alternatives. The active-learning question becomes:

“What works better than target encoding for high-cardinality categoricals?”

Generated queries:

TemplateFilled query
kaggle {task} trickskaggle tabular classification tricks
{technique} machine learning papertarget encoding leakage machine learning paper
{task} github solution kaggletabular binary classification github solution kaggle
{task} feature engineering techniqueshigh cardinality categorical feature engineering techniques

Source priority when results arrive:

Kaggle notebooks/discussions > GitHub repos > arXiv papers > blogs > random web

Each hit goes through the same pipeline as crawled Kaggle docs: parse, tier, extract to {technique, context, why, impact}, gate into observations. Cross-source agreement bumps confidence:

if technique appears in 5+ independent sources:
    increase_confidence()
if technique improves CV on proxy competition:
    mark as "proven"

Active learning queries

The loop also searches for what the system does not know:

Memory signalGenerated query
Low confidence on target_encodingbetter alternatives to target encoding tabular
High contradiction count on PCAPCA feature engineering tree models when to avoid
Failed experiment on pseudo-labelingpseudo labeling confidence threshold best practices

Playwright handles pages that need a browser (Kaggle auth, JS-rendered listings). arXiv and GitHub often go through direct HTTP. The unifying piece is not the browser. It is that every fetch enters the same ingest → validate → compile path, not a separate junk drawer.

Implementation stages (where I am)

StageStatus
Step 1: Hardcoded 5–10 queries, scrape top results, run ingestDesigned; Kaggle crawl + ingest live
Step 2: LLM-generated queries from memory gapsPlanned
Step 3: Fully autonomous research → experiment loopTarget

The principle I keep repeating: reading gives ideas, experiments give truth. Playwright is a fetch tool. The research agent is the loop around it.

Where It Stands

AutoKaggle today is a research prototype, not a Kaggle Grandmaster in a box. It crawls, ingests with hybrid extraction, maintains hybrid memory, serves RAG, plans strategies, trains tabular models, and records what it tried. The eval harness keeps me honest with separate tracks for seeded fixture retrieval vs generic crawl honesty.

I still want to win competitions. But the project taught me that the path there runs through memory engineering, the unglamorous layer Karpathy and half the industry are suddenly building in public.

I started by asking how to beat the leaderboard. I ended up asking how an agent remembers what works, forgets what does not, and compounds knowledge across competitions on a gaming laptop and a four-gigabyte budget.

That was not the plan. It might be the more interesting project.