I Wanted to Win Kaggle. I Built a Memory System Instead.
I did not set out to build a memory system. I set out to win Kaggle.
That sounds naive written down, but it is the honest origin. I kept reading the same advice in winner writeups (stratified K-Fold, target encoding, pseudo-labeling, weighted blends) and thinking: this is not magic, it is a checklist. Top competitors are not discovering new mathematics every week. They are applying known tricks faster, with better feature creativity, and with CV discipline that most beginners skip.
So I started building AutoKaggle: an agent that reads winning solutions, extracts what worked, and generates ML pipelines for new competitions. Crawl notebooks. Parse strategies. Embed them. Retrieve them at plan time. Run LightGBM. Submit.
Simple architecture. Hard execution.
The First Wall Was Speed, Not Intelligence
My first ingest pipeline sent every crawled document to a local LLM. Ollama, Qwen, HTTP round-trips. Two to six minutes per notebook. Ninety documents meant an evening gone, and most of those notebooks were EDA tutorials with three useful lines buried in markdown.
The fix was not a bigger model. It was not calling the model at all for most documents.
I added a three-layer extractor:
- Regex pulls
# Feature Engineering,# Model,# Validationsections instantly. - A keyword signal filter counts competition-relevant terms — LightGBM, GroupKFold, stacking, target encoding.
- A tier classifier sends only medal notebooks, winner writeups, and high-signal discussions to the LLM. Everything else gets regex-only strategy JSON.
That cut LLM calls by roughly ninety percent. Ingest went from hours to minutes. This was the first laptop-GPU lesson: compute is scarce; spend it deliberately.
The Second Wall Was Memory
Once ingest worked, retrieval failed in subtler ways.
The vector DB returned plausible chunks. The agent still made bad plans. Duplicated tricks piled up. One notebook said PCA helped; another said it destroyed tree models. Low-quality extractions became “observations” and polluted future runs. I ran audits and found fake observations, extraction failures, and semantic near-duplicates everywhere.
Around that time Karpathy posted about using LLMs to compile personal wikis: raw sources in, structured markdown concepts out, LLM as curator not search engine. Lex Fridman described a similar setup. I had been grinding on the same problem without the celebrity tweet, and it was validating and annoying in equal measure.
The insight I had missed: RAG is not memory.
Retrieval finds similar text. Memory decides what is true, current, and applicable. That requires gatekeeping, contradiction handling, confidence decay, and an experiment loop. Not another embedding model.
AutoKaggle grew a second brain:
- SQLite for structured strategy records, trials, and atomic observations
- Chroma for chunked semantic search
- NetworkX for technique relationships (LightGBM → pairs with → target encoding)
- A wiki compiler that clusters observations into concept pages — Karpathy-style, but for Kaggle tricks
- Progressive RAG that scans high-confidence observations before dumping full vector context into the prompt
The competition pipeline stayed: analyze → plan → features → CV → train → ensemble → submit. But planning stopped being “nearest notebook chunk” and started being “verified observation + graph neighbor + retrieved evidence.”
Three Memory Designs (and Why I Did Not Pick Just One)
Once I started comparing approaches, three patterns kept showing up. They look similar from a distance. They solve different problems.
| Dimension | Karpathy wiki | Supermemory | AutoKaggle |
|---|---|---|---|
| Primary goal | Personal research knowledge base | General agent memory OS | Competition strategy + experiment memory |
| Raw storage | raw/ articles, papers, repos | Conversations + documents | Crawled JSONL (notebooks, writeups, discussions) |
| Compiled layer | LLM-written markdown wiki with backlinks | Memory graph with profiles | Observations + wiki compiler + strategy graph |
| LLM role | Curator, editor, researcher | Memory policy engine | Extractor, compiler, planner (tier-gated) |
| Retrieval | Index files + summaries; light RAG at scale | Hybrid memory + RAG + profiles | Progressive RAG: observations → graph → vectors |
| Truth mechanism | Human Q&A + incremental wiki linting | Contradiction handling, forgetting, temporal updates | Confidence scoring, FAISS merge, experiment outcomes |
| Validation | Lint wiki for inconsistencies | Memory importance + decay policies | Cross-source support, CV score delta, observation gate |
| What gets forgotten | Manual / LLM-driven cleanup | Automatic temporal forgetting | Prune low-confidence, low-support, stale rows |
| Ontology | Concepts and links | User-centric memories | {technique, task_type, context, impact, evidence} |
| Best at | Thinking and research synthesis | Conversational agent personalization | “What trick for this tabular CV problem?” |
| Weak at | Automated ML execution | Domain-specific experiment tracking | Generic chat memory, user profiles |
| My takeaway | Steal the compiler idea | Steal policies (contradict, decay, forget) | Keep custom ontology + experiment loop |
Karpathy’s system is for thinking. Supermemory is for remembering users. AutoKaggle is for remembering what wins, and that needs experiment validation, not just good summaries.
I did not plug in Supermemory as the core. I borrowed its policy ideas and built a compiler layer that Karpathy described, but grounded in Kaggle-specific observations and CV outcomes.
I Only Had a 4GB GPU
Same machine as my TinyReason experiments: an NVIDIA RTX 3050 laptop, four gigabytes of VRAM, perfectly adequate for games and completely inadequate for the models I wished I could run.
I wanted Mistral-7B or Qwen-7B on GPU with batch inference. Reality: 4-bit quantization or nothing. Batch size one or OOM. Serial Promptfoo evals or VRAM spikes. Never run ingest and eval at the same time.
The codebase encodes these constraints directly. local_llm.py detects VRAM below eight gigabytes and downgrades 7B/8B/9B names to a 4B fallback. BitsAndBytes 4-bit loading is default on CUDA. OOM triggers cache clear and retry, then regex emergency fallback. Promptfoo runs with -j 1, one test at a time, fifty-three minutes for fifteen tests, zero harness errors.
| What I wanted | What I shipped |
|---|---|
| Mistral-7B batch extraction | Qwen3-4B 4-bit, batch_size=1 |
| LLM on every notebook | LLM on ~10–20% (Tier 1 only) |
| Parallel eval | Serial eval, shared model singleton |
| Cloud-free everything | OpenRouter consensus optional, off by default |
The tradeoff I accept: local autonomy over raw quality. RAG synthesis hit 100% on Promptfoo rubrics. Extraction still struggles with JSON validity and coarse task types on a 4B model. I know exactly where the ceiling is, and it is GPU-shaped.
What Would Change With 12GB+ VRAM
Four gigabytes forced a specific operating mode. Twelve gigabytes or more would not just make things faster. It would change what I turn on.
| Capability | RTX 3050 (4GB, today) | 12GB+ VRAM (e.g. RTX 3060/4070 laptop) |
|---|---|---|
| Primary model | Qwen3-4B 4-bit | Qwen2.5-Coder-7B or Mistral-7B 4-bit |
| VRAM fallback | Auto-downgrade 7B→4B in local_llm.py | Rarely triggered; 7B becomes default |
| Extraction batch | EXTRACTION_BATCH_SIZE=1 | Batch 4–8 on Tier 1 docs |
| Tier 2 LLM | Regex-only for ~80–90% of corpus | Enable LLM on Tier 2 for richer strategy JSON |
| Code-only notebooks | NOTEBOOK_LLM_INTELLIGENCE off | Code-to-knowledge pass for medal notebooks without markdown |
| Multi-pass extraction | Focused retries budget-tight | Full missing-section retries without emergency regex |
| Promptfoo eval | -j 1 serial, ~54 min / 15 tests | -j 2–4 parallel, model loaded once in worker |
| Ingest + eval | Never on same GPU | Still risky shared singleton, but less OOM fear |
| Quantization | Required (4-bit) | Optional 8-bit or fp16 for 7B on 12GB |
| Extraction quality | JSON glitches, coarse task_type | Longer context, stabler structured output |
| Consensus | OpenRouter only (local GPU reserved) | Room for local 7B + cloud consensus in parallel |
Concrete config shift I would make on a 12GB machine:
EXTRACTION_MODEL=Qwen/Qwen2.5-Coder-7B-Instruct
EXTRACTION_BATCH_SIZE=4
EXTRACTION_USE_4BIT=true
EXTRACTION_USE_LLM_FOR_TIER2=true
NOTEBOOK_LLM_INTELLIGENCE_ENABLED=true
And for eval:
npx promptfoo@latest eval -c promptfoo/autokaggle_eval.yaml -j 2
The design docs originally assumed Mistral-7B batch inference on GPU, five to ten× faster than HTTP Ollama per doc. I could not run that comfortably on 4GB. On 12GB it becomes the intended architecture, not a wishlist item.
The tier system would also relax. Today Tier 2 exists partly because I cannot afford an LLM call per EDA notebook. With headroom, Tier 2 gets a cheap LLM pass too: regex for structure, model for interpretation.
Promptfoo already showed the cost of regex-only Tier 2: on the Spaceship Titanic writeup, the 4B model extracted TabNet/neural nets but missed the LightGBM/CatBoost weighted blend the writeup actually used. A 7B Tier-2 pass is the first extraction upgrade I would benchmark after a hardware swap (currently 0/6 on P0-EXT with 4B).
What Winning Actually Requires
Something I should have written on a sticky note earlier:
Reading gives you ideas. Experiments give you truth.
The Kaggle playbook is learnable: CV strategy, feature engineering, ensembling, leakage suspicion, metric alignment, pseudo-labeling near deadline. An agent does not need to discover these from first principles. It needs to store them as {technique, context, impact}, detect when context matches, and validate on real CV scores. Not on public leaderboard hope.
That is the loop I am building toward:
Question → targeted crawl → extract → validate → compile → experiment → update memory
Not random Playwright browsing. Not “scrape everything, store everything.” Search with intent. Cross-source agreement. Promote to verified memory only after experiments confirm impact.
The Research Agent Loop (Playwright in Practice)
Playwright is already in the codebase, but not for random Google surfing. Today it does three concrete jobs on Kaggle:
| Command | Role |
|---|---|
autokaggle capture-session | Browser login → save storage_state + Cookie/XSRF headers |
autokaggle discover-endpoints | Watch network traffic → register /api/i/ crawler paths |
autokaggle crawl / self-learn | Authenticated fetch → data/crawled/documents.jsonl → ingest |
Without capture-session, winner writeups often return shell pages and extraction quality collapses. Playwright here is authenticated Kaggle ingestion, not a general web scraper.
That is the Kaggle half of the research loop. The open-web half (papers, GitHub repos, blogs) is where query templates come in. The mistake I almost made was Playwright → Google → store everything. Noise at scale. The structured version looks like this:
Question
→ targeted queries (templates)
→ crawl / fetch (Playwright + APIs)
→ ingest (tier classify → extract)
→ validate (cross-source + quality gate)
→ compile (observations → wiki)
→ experiment (CV on held-out competition)
→ update memory (confidence, support, contradictions)
Query templates
Instead of open-ended browsing, I start from a question tied to a gap in memory:
- “How to improve tabular models with categorical data?”
- “Best augmentation techniques for small image datasets?”
- “How to prevent overfitting in time series?”
Those become templated searches:
queries = [
"kaggle {task} tricks",
"{technique} machine learning paper",
"{task} github solution kaggle",
"{task} feature engineering techniques",
]
Worked example: tabular competition with categoricals
Suppose I am planning for a new binary classification competition with heavy categoricals and imbalanced targets. Memory query returns low-confidence on target encoding alternatives. The active-learning question becomes:
“What works better than target encoding for high-cardinality categoricals?”
Generated queries:
| Template | Filled query |
|---|---|
kaggle {task} tricks | kaggle tabular classification tricks |
{technique} machine learning paper | target encoding leakage machine learning paper |
{task} github solution kaggle | tabular binary classification github solution kaggle |
{task} feature engineering techniques | high cardinality categorical feature engineering techniques |
Source priority when results arrive:
Kaggle notebooks/discussions > GitHub repos > arXiv papers > blogs > random web
Each hit goes through the same pipeline as crawled Kaggle docs: parse, tier, extract to {technique, context, why, impact}, gate into observations. Cross-source agreement bumps confidence:
if technique appears in 5+ independent sources:
increase_confidence()
if technique improves CV on proxy competition:
mark as "proven"
Active learning queries
The loop also searches for what the system does not know:
| Memory signal | Generated query |
|---|---|
Low confidence on target_encoding | better alternatives to target encoding tabular |
High contradiction count on PCA | PCA feature engineering tree models when to avoid |
| Failed experiment on pseudo-labeling | pseudo labeling confidence threshold best practices |
Playwright handles pages that need a browser (Kaggle auth, JS-rendered listings). arXiv and GitHub often go through direct HTTP. The unifying piece is not the browser. It is that every fetch enters the same ingest → validate → compile path, not a separate junk drawer.
Implementation stages (where I am)
| Stage | Status |
|---|---|
| Step 1: Hardcoded 5–10 queries, scrape top results, run ingest | Designed; Kaggle crawl + ingest live |
| Step 2: LLM-generated queries from memory gaps | Planned |
| Step 3: Fully autonomous research → experiment loop | Target |
The principle I keep repeating: reading gives ideas, experiments give truth. Playwright is a fetch tool. The research agent is the loop around it.
Where It Stands
AutoKaggle today is a research prototype, not a Kaggle Grandmaster in a box. It crawls, ingests with hybrid extraction, maintains hybrid memory, serves RAG, plans strategies, trains tabular models, and records what it tried. The eval harness keeps me honest with separate tracks for seeded fixture retrieval vs generic crawl honesty.
I still want to win competitions. But the project taught me that the path there runs through memory engineering, the unglamorous layer Karpathy and half the industry are suddenly building in public.
I started by asking how to beat the leaderboard. I ended up asking how an agent remembers what works, forgets what does not, and compounds knowledge across competitions on a gaming laptop and a four-gigabyte budget.
That was not the plan. It might be the more interesting project.