cat reading_list.txt
Papers, posts, repos, and people I keep coming back to. Updated as I read.
ls papers/
-
paper
Attention Is All You Need
The one that started it all. Still the clearest explanation of why scaling dot-product attention works.
-
paper
LoRA: Low-Rank Adaptation of Large Language Models
The insight that fine-tuning weight updates are low-rank changed how I think about parameter efficiency.
-
paper
Adam: A Method for Stochastic Optimization
Bias correction in the moment estimates is one of those small ideas that made a massive practical difference.
-
paper
Fast Transformer Decoding: One Write-Head is All You Need
Multi-query attention. Sharing a single KV head across all query heads cuts KV cache memory without hurting quality much.
-
paper
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
The middle ground between MHA and MQA. Most production models (Llama 3, Mistral) use this now.
-
paper
Generating Long Sequences with Sparse Transformers
First serious attempt at breaking the quadratic attention bottleneck with fixed sparse patterns.
-
paper
Mistral 7B
Sliding window attention + GQA in a single 7B model that punches above its weight class.
-
paper
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Reframing attention as an IO problem rather than a compute problem. Tiling to SRAM is the key trick.
-
paper
Efficient Training of Language Models to Fill in the Middle
FIM training costs almost nothing extra and gives you code infilling for free. The basis of how Copilot-style tools work.
ls repos/
-
repo
karpathy/microgpt.py
A full GPT in ~200 lines of pure Python with zero dependencies. Proof that the algorithm is simple; everything else is efficiency.
-
repo
karpathy/makemore
Character-level language modeling from bigrams to transformers. The best incremental teaching sequence I've seen for neural nets.
-
repo
bigcode/tiny_starcoder_py
A 164M param code model with FIM support. Small enough to prototype on a laptop, large enough to produce real completions.
-
repo
karpathy/nanoGPT
The next step up from microgpt.py. Reproduces GPT-2 training in ~300 lines with real performance on a single GPU.
-
repo
rasbt/LLMs-from-scratch
A full book building LLMs from scratch in PyTorch. Same "understand by implementing" spirit as my GPT-in-pure-Python post, but taken much further.
-
repo
vllm-project/vllm
PagedAttention for inference serving. This is what you'd actually use to deploy the fine-tuned code models from my Copilot post.
-
repo
ggerganov/llama.cpp
LLM inference in C++ with quantization for CPU/Metal/CUDA. The reason you can run a 7B model on a laptop.
-
repo
unslothai/unsloth
2x faster LoRA fine-tuning with 70% less VRAM using custom Triton kernels. Drop-in replacement for the training loop in my Copilot post.
-
repo
pytorch/torchtune
PyTorch-native fine-tuning with SFT, DPO, and GRPO recipes. Cleaner alternative to trl for production training pipelines.
-
repo
Dao-AILab/flash-attention
The actual implementation behind the FlashAttention paper I covered. Reading the CUDA kernels teaches you more about GPU programming than any tutorial.
-
repo
state-spaces/mamba
Reference implementation of the main challenger to transformers. Linear complexity with selective state spaces instead of attention.
-
repo
guidance-ai/guidance
Constrained generation with regex, CFGs, and JSON schema enforcement. Useful for getting structured code output from the NL2Code pipeline.
-
repo
ml-explore/mlx
Apple's ML framework for unified memory on Apple Silicon. I develop on a Mac, so this is how I run local inference without fighting CUDA.
ls specs/
-
spec
Agent Skills Specification
The open spec for portable agent skills. A markdown file replacing a microservice felt crazy until I tried it.
-
spec
A2A (Agent-to-Agent) Protocol
Peer-to-peer agent coordination with discovery, task lifecycle, and streaming. The missing complement to MCP's tool layer.
ls blogs/
-
blog
Lil'Log
The gold standard for technical ML writeups. Her posts on agents, attention, and RLHF are the ones I re-read when I need to actually understand something.
-
blog
Sebastian Raschka's Blog
His LLM implementation walkthroughs and KV cache deep-dives hit the same "build it to understand it" philosophy I try to follow.
-
blog
Simon Willison's Blog
Nobody ships more practical LLM experiments per week. His posts on tool use and agentic patterns are consistently ahead of the curve.
-
blog
Hamel Husain's Blog
The best writing I've found on LLM evaluation and production fine-tuning. His "Your AI Product Needs Evals" post should be required reading.
-
blog
Eugene Yan's Blog
Production ML systems at Amazon scale. His patterns for LLM applications and retrieval systems are battle-tested, not theoretical.
-
blog
Deep Learning Focus
Turns dense papers into readable technical breakdowns. His posts on SFT, transformers, and reasoning are how I stay current without reading 30 papers a week.
-
blog
Latent Space
The podcast and newsletter that defined "AI engineer" as a role. Their interviews with researchers at OpenAI, Anthropic, and Meta are primary sources.
ls social/
-
@karpathy
His posts are primary sources for half the things I write about. When he drops a gist or a thread, I stop what I'm doing and read it.
-
@simonw
Posts daily LLM experiments with working code. Best signal-to-noise ratio on AI Twitter for practical engineering.
-
@rasbt
Threads breaking down training recipes and implementation details that don't make it into papers.
-
@eugeneyan
Production ML patterns from someone who actually ships these systems at scale.
-
@swyx
Coined the "AI engineer" framing. His takes on the ecosystem and where things are heading tend to age well.
-
@cwolferesearch
Posts concise technical breakdowns of new papers within days of release. My early warning system for what's worth reading.
EOF (2026-03-04)