End-to-end AI bug-fix pipeline: tree-sitter code indexing, hybrid retrieval (vector + BM25), and a ReAct agent loop that processes a ticket and generates a fix.
Run in Google Colab | View on GitHub
In Part 1 I built a Copilot clone for inline code completion. In Part 2 I built an NL2Code agent that generates full scripts from English descriptions. Both of those posts are about writing new code. This one goes the other direction: reading a bug ticket, finding the broken code, fixing it, and shipping a pull request. No human in the loop.
The trigger was a discussion at work about our move to a monorepo. The main motivation for the migration is that AI tools like Claude Code work best when they have all the context in one place: one repo, one dependency graph, one search index. But the flip side is that a monorepo with dozens of contributors generates a lot of GitHub issues and PRs. We started looking at how large open-source projects like Kubernetes and OpenClaw handle that volume, and the conversation landed on automated PR generation for routine bug fixes. The question that stuck with me was: what would it take to go from a Jira ticket saying "NullPointerException in UserService.getProfile()" to a pull request that actually fixes the bug, with no human touching the keyboard in between? I spent the past few weeks building one to find out.
This is Part 3 of a 3-part series on AI-assisted code generation. Part 1 covered inline completion (FIM, code models, LoRA fine-tuning). Part 2 covered NL2Code with agents. This post covers the fix side: ingesting a bug ticket, building a searchable knowledge base from the repo, and running a single-agent loop that investigates, patches, tests, and opens a PR.
The obvious architecture for "ticket → PR" is a pipeline of specialized agents. A Triage agent reads the ticket and classifies it. A Planning agent decides what to investigate. A CodeGen agent writes the fix. A Review agent checks the fix. Clean separation of concerns.
The problem is information loss. Every hand-off between agents requires serializing one agent's findings into a summary that the next agent consumes. Summaries drop details. The CodeGen agent can't go back and re-read the stack trace because it only received the Planning agent's distilled version of it. If the fix attempt reveals that the initial diagnosis was wrong, there's no way to pivot without restarting the entire pipeline.
A single agent running inside a harness maintains one continuous context window across the entire lifecycle of a ticket. It can re-read the stack trace while writing code. It can change its diagnosis mid-implementation when the code contradicts its initial theory. It can skip investigation steps when the stack trace is already definitive. This mirrors how an experienced engineer actually works: you hold the full problem in your head and fluidly move between investigation, planning, and implementation. You don't hand off a memo to a different person at each step.
The trade-off is context window pressure. A 200k-token window fills faster than you'd think when you're reading full files and collecting search results. I'll cover how the harness manages that later.
Here's the full system:
Tickets come in from any source. They get normalized into a standard schema, queued by priority, and dispatched to a worker pod running the agentic harness. The harness gives the agent a set of skills (search, read, edit, test, etc.) and manages context, checkpoints, and permissions. The Knowledge Layer is what makes the agent's search skills useful. Without it, search_code("getProfile") returns nothing.
Let me build each piece.
The agent is only as good as what it can search. If a developer files a ticket about a NullPointerException in UserService.getProfile(), the agent needs to find the actual code for that method, understand what it does, find who calls it, and look up whether anyone has fixed a similar bug before. All of that depends on the Knowledge Layer being populated and fresh.
The first step is turning source files into structured records. Raw text files are useless for search: you'd be embedding entire files and hoping cosine similarity lands on the right 10-line function in a 500-line file. Instead, we parse each file into its AST and extract every named declaration (function, method, class) as a separate record.
tree-sitter handles this across languages. It produces a concrete syntax tree that we walk to extract symbols with their metadata.
Given a source file, we parse it and walk the tree to find function and class definitions:
The _extract_function helper pulls out everything we need for a single symbol:
For a Java file like UserService.java, the parser would emit a record for each method:
Each parsed symbol becomes one chunk. This is the right granularity for code search: when a ticket mentions getProfile, you want to retrieve that specific method, not the entire 800-line UserService.java file.
Two exceptions handle edge cases:
Large functions. If a function body exceeds 400 tokens, split it into overlapping sub-chunks of 400 tokens each with 80-token overlap. Attach the parent function metadata (name, class, file) to every sub-chunk so retrieval results are always locatable.
Class-level summaries. In addition to per-method chunks, emit one chunk per class containing only the declaration, field declarations, and constructor signatures (no method bodies). This lets queries like "what fields does UserService have?" hit a compact chunk instead of wading through all method bodies.
This is the most important design decision in the entire indexing pipeline. What text string you pass to the embedding model determines retrieval quality. Embedding raw source code performs poorly because ticket queries are natural language ("NullPointerException when fetching user profile from cache") and code is, well, code. The semantic gap between a bug report and a function body is huge.
The fix is to assemble an enriched text string per chunk that combines natural language context with the code:
Why this works: the ticket query "NullPointerException when fetching user profile from cache" is natural language. The enriched string contains the docstring ("Returns null on cache miss") which bridges the semantic gap. The class and method name provide symbol-level anchoring. The full source is there so the embedding also captures code-level patterns.
Here's the template:
Without enrichment, the embedding model sees public User getProfile(String userId) { User user = cache.get(userId); return user; } and has no idea that this function relates to "user profile cache miss." With enrichment, the docstring, class name, and method signature all contribute semantic signal that the embedding model can use to connect the dots.
For the embedding model, I'm using OpenAI's text-embedding-3-large. The reasoning: since the enriched input strings are a hybrid of natural language and code (not raw code), a general-purpose embedding model handles them well. A code-specialized model like voyage-code-2 would give marginal gains on raw code syntax, but the natural language portions (docstrings, signatures, file paths) are where the ticket-to-code bridge happens. Using the same vendor as the LLM also means one API key, one billing relationship, one rate-limit budget.
Each chunk gets stored in three places, each optimized for a different query pattern:
pgvector for semantic search. When the agent searches "cache miss null pointer", the query gets embedded and we find the nearest chunks by cosine similarity.
Elasticsearch for BM25 keyword search. When the agent searches getProfile, we want exact identifier matching, not fuzzy semantic similarity. The text field is tokenized source code stripped of punctuation so BM25 matches on identifiers like getProfile, userId, cache.
Postgres for the dependency graph. Who calls getProfile? What does UserService import? This is where the graph queries live.
Embedding search answers "what code is semantically related to this ticket?" but it can't answer "who calls getProfile?" or "what does UserService depend on?". Those are graph traversal questions, and they come up constantly during bug investigation. If the agent changes getProfile's return type from User to Optional<User>, it needs to find every caller and update them.
The dependency graph is extracted during the same tree-sitter parse pass that produces symbol records. For each function body, we walk the AST looking for call expressions and attribute accesses, then resolve them against the symbols we've already extracted from the same repo.
For UserController.get_user(), parsing the body finds self.user_service.get_profile(user_id), which we resolve to the symbol UserService.get_profile. For UserService.get_profile(), we find self.cache.get(user_id), resolved to UserCache.get.
The resolution step matches raw call text against the known symbol table for the repo. self.user_service.get_profile gets resolved by checking the class's __init__ to find that self.user_service is a UserService instance, then looking up UserService.get_profile in the symbol index. For cases where static analysis can't resolve the type (dynamic dispatch, duck typing), we fall back to a name-based heuristic: match get_profile against all symbols named get_profile in the repo and rank by file proximity.
Each resolved call becomes a row in Postgres:
The agent queries this graph through a get_callers skill:
When the agent runs get_callers("UserService.get_profile", "org/svc"), it gets back:
That's two callers to check when changing the return type. Without this graph, the agent would have to do a text search for "get_profile" across the entire repo and parse the results to figure out which matches are actual call sites vs. comments or string literals. The graph makes it a single indexed query.
The reverse direction is useful too. get_dependencies("UserService.get_profile", "org/svc") tells the agent what get_profile itself depends on (in this case, UserCache.get), which helps when tracing the root cause upstream.
Neither vector search nor keyword search is good enough on its own. Vector search finds semantically similar code but misses exact identifier matches. BM25 finds exact identifiers but doesn't understand that "cache miss" relates to cache.get() returning null. The solution is to run both and fuse the results.
Reciprocal rank fusion (RRF) combines the two ranked lists without needing to normalize scores across different systems:
When the agent queries "NullPointerException getProfile cache", vector search ranks the getProfile chunk high because the enriched embedding matches the semantic meaning. BM25 also ranks it high because "getProfile" and "cache" appear literally in the source. Chunks that score well on both end up at the top.
The knowledge base is useless if it falls behind the code. Every git push to a registered repo triggers a webhook that runs incremental re-indexing:
git diff --name-only {before_sha}..{after_sha} to get changed fileslast_indexed_sha in the repo registryA typical push touching 3-5 files completes in under 10 seconds. The code index is always within one push of being current.
Two additional indexes give the agent a memory of past work:
Ticket history. Past tickets are embedded as "{title}\n\n{description}" and stored. When a new ticket arrives, search_similar_tickets() retrieves past tickets with similar error messages. If someone already filed and fixed "NPE in UserService" three months ago, the agent gets a head start.
Merged PR store. Merged PRs linked to bug tickets are stored as "{pr_title}\n\n{pr_description}\n\n{diff_summary}". The diff summary (not the raw diff) is what gets embedded, because raw diffs are noisy for similarity search. A one-paragraph human-readable summary of what changed and why produces much better recall.
The harness is the runtime that wraps the LLM. It manages the loop, executes skill calls, enforces permissions, and maintains state. The reasoning about what to do lives entirely inside the LLM.
The agent has access to a registry of skills, each with a name, input schema, and permission level:
The full registry is injected into the system prompt so the agent always knows what it can do:
The system prompt defines methodology and constraints. This is the most important design surface in a single-agent system.
With a single agent holding everything in one context, window management is the critical engineering problem. A 200k-token window fills fast:
The harness uses three strategies:
Scratchpad (pinned memory). The agent writes key findings via remember(). The scratchpad is always present in the context, never compressed. It acts as working memory.
Rolling compression. Tool results older than 10 steps get replaced with agent-written summaries. A 300-line file read becomes: [UserService.java read at step 3. Key: getProfile() at line 143 returns raw User from cache.get(). No null check.]
Selective re-read. Instead of keeping large files in context, the agent re-reads only the specific lines it needs. read_file("UserService.java", start=140, end=155) is cheap and targeted.
Let me walk through the full flow. A developer files a Jira ticket: "NullPointerException in UserService.getProfile() after cache miss. Stack trace: UserService.java:143".
Jira fires a webhook. The Ingestion Service normalizes it into a TicketEvent:
The dispatcher creates a session, clones the repo into a sandbox, creates branch fix/PROJ-1234-auto, and starts the agent loop.
Here's the actual execution trace. Each step is one iteration of the THINK → ACT → OBSERVE loop.
Investigation phase:
Seven steps. The agent read one file, ran one search, checked past tickets, and pinned its diagnosis. A multi-agent pipeline would have burned three full context windows to reach the same conclusion.
Implementation phase:
Notice how the agent caught the missing import by running the linter immediately after editing, then fixed it in the next step. This is the self-correction loop that makes the single-agent pattern work. No orchestrator needed.
Validation phase:
The test failure at step 21 is the kind of thing that trips up pipeline architectures. The CodeGen agent in a multi-agent system would need to signal back to a test-fixing agent, which would need access to the original diagnosis context to understand why the return type changed. Here, the agent just reads the failure message, understands it immediately (it made the change that caused the failure), and fixes it.
PR creation:
Total: 29 skill calls, ~48k tokens, roughly $0.60, about 3 minutes from ticket creation to PR ready for review.
The full system has a lot of infrastructure (Kafka, sandboxed containers, webhook services). But the core building blocks are surprisingly accessible. The companion notebook implements three of them end-to-end:
The gap between the notebook version and the production system is mostly operational: container orchestration, webhook plumbing, and the persistence layer. The intelligence (how to index code, how to search it, how to reason about bugs) is all in the notebook.
What I'd explore next: feeding the agent's merged PRs back into the knowledge base so it learns from its own fixes over time. If the agent fixed three NPE-from-cache-miss bugs last month, the fourth one should be faster.
Originally published on AI Terminal.
Tags: rag, react, agents, bug-fix, tree-sitter