Reading List

~/reading

cat reading_list.txt

Papers, posts, repos, and people I keep coming back to. Updated as I read.

ls papers/

paper Attention Is All You Need Vaswani et al., 2017
The one that started it all. Still the clearest explanation of why scaling dot-product attention works.
paper LoRA: Low-Rank Adaptation of Large Language Models Hu et al., 2021
The insight that fine-tuning weight updates are low-rank changed how I think about parameter efficiency.
paper Adam: A Method for Stochastic Optimization Kingma & Ba, 2015
Bias correction in the moment estimates is one of those small ideas that made a massive practical difference.
paper Fast Transformer Decoding: One Write-Head is All You Need Shazeer, 2019
Multi-query attention. Sharing a single KV head across all query heads cuts KV cache memory without hurting quality much.
paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Ainslie et al., 2023
The middle ground between MHA and MQA. Most production models (Llama 3, Mistral) use this now.
paper Generating Long Sequences with Sparse Transformers Child et al., 2019
First serious attempt at breaking the quadratic attention bottleneck with fixed sparse patterns.
paper Mistral 7B Jiang et al., 2023
Sliding window attention + GQA in a single 7B model that punches above its weight class.
paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Dao et al., 2022
Reframing attention as an IO problem rather than a compute problem. Tiling to SRAM is the key trick.
paper Efficient Training of Language Models to Fill in the Middle Bavarian et al., 2022
FIM training costs almost nothing extra and gives you code infilling for free. The basis of how Copilot-style tools work.
paper ReAct: Synergizing Reasoning and Acting in Language Models Yao et al., 2022
The paper that formalized interleaving chain-of-thought reasoning with tool calls. I used ReAct as the baseline agent pattern in my NL2Code post before switching to CodeAct.
paper Executable Code Actions Elicit Better LLM Agents Wang et al., 2024
CodeAct: using code execution as the agent's action space instead of JSON tool calls. This is the pattern I ended up using in production for NL2Code.
paper Automating Code Review Activities by Large-Scale Pre-training Li et al., 2022
CodeReviewer from Microsoft: pre-training tasks designed specifically for code review workflows. The insight that generic code models aren't enough for review is what pushed me toward a layered pipeline instead of one-shot LLM review.
paper Teaching Large Language Models to Self-Debug Chen et al., 2023
The rubber duck debugging trick: asking an LLM to explain code before reviewing it roughly doubled my logic bug detection rate compared to direct review prompts.

ls repos/

repo karpathy/microgpt.py Andrej Karpathy
A full GPT in ~200 lines of pure Python with zero dependencies. Proof that the algorithm is simple; everything else is efficiency.
repo karpathy/makemore Andrej Karpathy
Character-level language modeling from bigrams to transformers. The best incremental teaching sequence I've seen for neural nets.
repo bigcode/tiny_starcoder_py BigCode
A 164M param code model with FIM support. Small enough to prototype on a laptop, large enough to produce real completions.
repo karpathy/nanoGPT Andrej Karpathy
The next step up from microgpt.py. Reproduces GPT-2 training in ~300 lines with real performance on a single GPU.
repo rasbt/LLMs-from-scratch Sebastian Raschka
A full book building LLMs from scratch in PyTorch. Same "understand by implementing" spirit as my GPT-in-pure-Python post, but taken much further.
repo vllm-project/vllm vLLM Team
PagedAttention for inference serving. This is what you'd actually use to deploy the fine-tuned code models from my Copilot post.
repo ggerganov/llama.cpp Georgi Gerganov
LLM inference in C++ with quantization for CPU/Metal/CUDA. The reason you can run a 7B model on a laptop.
repo unslothai/unsloth Unsloth AI
2x faster LoRA fine-tuning with 70% less VRAM using custom Triton kernels. Drop-in replacement for the training loop in my Copilot post.
repo pytorch/torchtune PyTorch Team
PyTorch-native fine-tuning with SFT, DPO, and GRPO recipes. Cleaner alternative to trl for production training pipelines.
repo Dao-AILab/flash-attention Tri Dao
The actual implementation behind the FlashAttention paper I covered. Reading the CUDA kernels teaches you more about GPU programming than any tutorial.
repo state-spaces/mamba Albert Gu & Tri Dao
Reference implementation of the main challenger to transformers. Linear complexity with selective state spaces instead of attention.
repo guidance-ai/guidance Guidance AI
Constrained generation with regex, CFGs, and JSON schema enforcement. Useful for getting structured code output from the NL2Code pipeline.
repo ml-explore/mlx Apple
Apple's ML framework for unified memory on Apple Silicon. I develop on a Mac, so this is how I run local inference without fighting CUDA.
repo tree-sitter/tree-sitter tree-sitter
Language-agnostic incremental parsing. I used it to build the code indexing pipeline in my Jira-to-PR post: parse any language into a concrete syntax tree, extract functions as chunks for embedding.

ls specs/

spec Agent Skills Specification Open Standard
The open spec for portable agent skills. A markdown file replacing a microservice felt crazy until I tried it.
spec A2A (Agent-to-Agent) Protocol Google / Linux Foundation
Peer-to-peer agent coordination with discovery, task lifecycle, and streaming. The missing complement to MCP's tool layer.

ls blogs/

blog Lil'Log Lilian Weng
The gold standard for technical ML writeups. Her posts on agents, attention, and RLHF are the ones I re-read when I need to actually understand something.
blog Sebastian Raschka's Blog Sebastian Raschka
His LLM implementation walkthroughs and KV cache deep-dives hit the same "build it to understand it" philosophy I try to follow.
blog Simon Willison's Blog Simon Willison
Nobody ships more practical LLM experiments per week. His posts on tool use and agentic patterns are consistently ahead of the curve.
blog Hamel Husain's Blog Hamel Husain
The best writing I've found on LLM evaluation and production fine-tuning. His "Your AI Product Needs Evals" post should be required reading.
blog Eugene Yan's Blog Eugene Yan
Production ML systems at Amazon scale. His patterns for LLM applications and retrieval systems are battle-tested, not theoretical.
blog Deep Learning Focus Cameron Wolfe
Turns dense papers into readable technical breakdowns. His posts on SFT, transformers, and reasoning are how I stay current without reading 30 papers a week.
blog Latent Space swyx & Alessio
The podcast and newsletter that defined "AI engineer" as a role. Their interviews with researchers at OpenAI, Anthropic, and Meta are primary sources.

ls social/

social @karpathy Andrej Karpathy
His posts are primary sources for half the things I write about. When he drops a gist or a thread, I stop what I'm doing and read it.
social @simonw Simon Willison
Posts daily LLM experiments with working code. Best signal-to-noise ratio on AI Twitter for practical engineering.
social @rasbt Sebastian Raschka
Threads breaking down training recipes and implementation details that don't make it into papers.
social @eugeneyan Eugene Yan
Production ML patterns from someone who actually ships these systems at scale.
social @swyx Shawn Wang
Coined the "AI engineer" framing. His takes on the ecosystem and where things are heading tend to age well.
social @cwolferesearch Cameron Wolfe
Posts concise technical breakdowns of new papers within days of release. My early warning system for what's worth reading.

EOF (2026-04-09)