From Jira to Pull Request: Building an AI Bug-Fix Agent

End-to-end AI bug-fix pipeline: tree-sitter code indexing, hybrid retrieval (vector + BM25), and a ReAct agent loop that processes a ticket and generates a fix.

Run in Google Colab | View on GitHub

In Part 1 I built a Copilot clone for inline code completion. In Part 2 I built an NL2Code agent that generates full scripts from English descriptions. Both of those posts are about writing new code. This one goes the other direction: reading a bug ticket, finding the broken code, fixing it, and shipping a pull request. No human in the loop.

The trigger was a discussion at work about our move to a monorepo. The main motivation for the migration is that AI tools like Claude Code work best when they have all the context in one place: one repo, one dependency graph, one search index. But the flip side is that a monorepo with dozens of contributors generates a lot of GitHub issues and PRs. We started looking at how large open-source projects like Kubernetes and OpenClaw handle that volume, and the conversation landed on automated PR generation for routine bug fixes. The question that stuck with me was: what would it take to go from a Jira ticket saying "NullPointerException in UserService.getProfile()" to a pull request that actually fixes the bug, with no human touching the keyboard in between? I spent the past few weeks building one to find out.

This is Part 3 of a 3-part series on AI-assisted code generation. Part 1 covered inline completion (FIM, code models, LoRA fine-tuning). Part 2 covered NL2Code with agents. This post covers the fix side: ingesting a bug ticket, building a searchable knowledge base from the repo, and running a single-agent loop that investigates, patches, tests, and opens a PR.

1. Why a Single Agent

The obvious architecture for "ticket → PR" is a pipeline of specialized agents. A Triage agent reads the ticket and classifies it. A Planning agent decides what to investigate. A CodeGen agent writes the fix. A Review agent checks the fix. Clean separation of concerns.

The problem is information loss. Every hand-off between agents requires serializing one agent's findings into a summary that the next agent consumes. Summaries drop details. The CodeGen agent can't go back and re-read the stack trace because it only received the Planning agent's distilled version of it. If the fix attempt reveals that the initial diagnosis was wrong, there's no way to pivot without restarting the entire pipeline.

https://gist.github.com/cmenguy/95e0222efadf7c847158974842846138

A single agent running inside a harness maintains one continuous context window across the entire lifecycle of a ticket. It can re-read the stack trace while writing code. It can change its diagnosis mid-implementation when the code contradicts its initial theory. It can skip investigation steps when the stack trace is already definitive. This mirrors how an experienced engineer actually works: you hold the full problem in your head and fluidly move between investigation, planning, and implementation. You don't hand off a memo to a different person at each step.

The trade-off is context window pressure. A 200k-token window fills faster than you'd think when you're reading full files and collecting search results. I'll cover how the harness manages that later.

2. Architecture: The 30-Second Version

Here's the full system:

https://gist.github.com/cmenguy/908689cfa3e8a1fc41a5554fe5df8d2f

Tickets come in from any source. They get normalized into a standard schema, queued by priority, and dispatched to a worker pod running the agentic harness. The harness gives the agent a set of skills (search, read, edit, test, etc.) and manages context, checkpoints, and permissions. The Knowledge Layer is what makes the agent's search skills useful. Without it, search_code("getProfile") returns nothing.

Let me build each piece.

3. Building the Knowledge Base

The agent is only as good as what it can search. If a developer files a ticket about a NullPointerException in UserService.getProfile(), the agent needs to find the actual code for that method, understand what it does, find who calls it, and look up whether anyone has fixed a similar bug before. All of that depends on the Knowledge Layer being populated and fresh.

3.1 Parsing Code with tree-sitter

The first step is turning source files into structured records. Raw text files are useless for search: you'd be embedding entire files and hoping cosine similarity lands on the right 10-line function in a 500-line file. Instead, we parse each file into its AST and extract every named declaration (function, method, class) as a separate record.

tree-sitter handles this across languages. It produces a concrete syntax tree that we walk to extract symbols with their metadata.

https://gist.github.com/cmenguy/69c98dd7fad5bd503e451ba03f1dee92

Given a source file, we parse it and walk the tree to find function and class definitions:

https://gist.github.com/cmenguy/87d6390cfc11bc2e1c6c5c562f636cf0

The _extract_function helper pulls out everything we need for a single symbol:

https://gist.github.com/cmenguy/e2f3bdb352243d2f6746361e07e17f2f

For a Java file like UserService.java, the parser would emit a record for each method:

https://gist.github.com/cmenguy/6d8af46e06f69c78dc5f2a9416848253

3.2 Chunking at AST Boundaries

Each parsed symbol becomes one chunk. This is the right granularity for code search: when a ticket mentions getProfile, you want to retrieve that specific method, not the entire 800-line UserService.java file.

Two exceptions handle edge cases:

Large functions. If a function body exceeds 400 tokens, split it into overlapping sub-chunks of 400 tokens each with 80-token overlap. Attach the parent function metadata (name, class, file) to every sub-chunk so retrieval results are always locatable.

Class-level summaries. In addition to per-method chunks, emit one chunk per class containing only the declaration, field declarations, and constructor signatures (no method bodies). This lets queries like "what fields does UserService have?" hit a compact chunk instead of wading through all method bodies.

https://gist.github.com/cmenguy/9de5081946553d81b622ffcb08d6fc22

3.3 The Enrichment Trick

This is the most important design decision in the entire indexing pipeline. What text string you pass to the embedding model determines retrieval quality. Embedding raw source code performs poorly because ticket queries are natural language ("NullPointerException when fetching user profile from cache") and code is, well, code. The semantic gap between a bug report and a function body is huge.

The fix is to assemble an enriched text string per chunk that combines natural language context with the code:

https://gist.github.com/cmenguy/bd0da48501683676ba95c8652b4c9d9b

Why this works: the ticket query "NullPointerException when fetching user profile from cache" is natural language. The enriched string contains the docstring ("Returns null on cache miss") which bridges the semantic gap. The class and method name provide symbol-level anchoring. The full source is there so the embedding also captures code-level patterns.

Here's the template:

https://gist.github.com/cmenguy/fcdcf998f351e2b85a6ffe1644ad7a1a

Without enrichment, the embedding model sees public User getProfile(String userId) { User user = cache.get(userId); return user; } and has no idea that this function relates to "user profile cache miss." With enrichment, the docstring, class name, and method signature all contribute semantic signal that the embedding model can use to connect the dots.

3.4 Embedding and Storage

For the embedding model, I'm using OpenAI's text-embedding-3-large. The reasoning: since the enriched input strings are a hybrid of natural language and code (not raw code), a general-purpose embedding model handles them well. A code-specialized model like voyage-code-2 would give marginal gains on raw code syntax, but the natural language portions (docstrings, signatures, file paths) are where the ticket-to-code bridge happens. Using the same vendor as the LLM also means one API key, one billing relationship, one rate-limit budget.

https://gist.github.com/cmenguy/f7b33fbe96b0bb0dc1dd09dac0492001

Each chunk gets stored in three places, each optimized for a different query pattern:

pgvector for semantic search. When the agent searches "cache miss null pointer", the query gets embedded and we find the nearest chunks by cosine similarity.

https://gist.github.com/cmenguy/43c4060a3b2a512ec537381b316d4fb7

Elasticsearch for BM25 keyword search. When the agent searches getProfile, we want exact identifier matching, not fuzzy semantic similarity. The text field is tokenized source code stripped of punctuation so BM25 matches on identifiers like getProfile, userId, cache.

Postgres for the dependency graph. Who calls getProfile? What does UserService import? This is where the graph queries live.

3.5 Building the Dependency Graph

Embedding search answers "what code is semantically related to this ticket?" but it can't answer "who calls getProfile?" or "what does UserService depend on?". Those are graph traversal questions, and they come up constantly during bug investigation. If the agent changes getProfile's return type from User to Optional<User>, it needs to find every caller and update them.

The dependency graph is extracted during the same tree-sitter parse pass that produces symbol records. For each function body, we walk the AST looking for call expressions and attribute accesses, then resolve them against the symbols we've already extracted from the same repo.

https://gist.github.com/cmenguy/f9b57240b32626adc3a04b6d08190ccc

For UserController.get_user(), parsing the body finds self.user_service.get_profile(user_id), which we resolve to the symbol UserService.get_profile. For UserService.get_profile(), we find self.cache.get(user_id), resolved to UserCache.get.

The resolution step matches raw call text against the known symbol table for the repo. self.user_service.get_profile gets resolved by checking the class's __init__ to find that self.user_service is a UserService instance, then looking up UserService.get_profile in the symbol index. For cases where static analysis can't resolve the type (dynamic dispatch, duck typing), we fall back to a name-based heuristic: match get_profile against all symbols named get_profile in the repo and rank by file proximity.

Each resolved call becomes a row in Postgres:

https://gist.github.com/cmenguy/8ba6a9eaea49895fd67ea40c546abb5b

https://gist.github.com/cmenguy/14b437271e0e6214e67549c2a1606d14

The agent queries this graph through a get_callers skill:

https://gist.github.com/cmenguy/02698c2ccdc180a092a0c8ca4d06c300

When the agent runs get_callers("UserService.get_profile", "org/svc"), it gets back:

https://gist.github.com/cmenguy/9e8603bb9a16e7c1e6f57725fc50b1a0

That's two callers to check when changing the return type. Without this graph, the agent would have to do a text search for "get_profile" across the entire repo and parse the results to figure out which matches are actual call sites vs. comments or string literals. The graph makes it a single indexed query.

The reverse direction is useful too. get_dependencies("UserService.get_profile", "org/svc") tells the agent what get_profile itself depends on (in this case, UserCache.get), which helps when tracing the root cause upstream.

3.6 Hybrid Retrieval

Neither vector search nor keyword search is good enough on its own. Vector search finds semantically similar code but misses exact identifier matches. BM25 finds exact identifiers but doesn't understand that "cache miss" relates to cache.get() returning null. The solution is to run both and fuse the results.

https://gist.github.com/cmenguy/8e4a154a7c8851c4f4d418d0e9eb12d7

Reciprocal rank fusion (RRF) combines the two ranked lists without needing to normalize scores across different systems:

https://gist.github.com/cmenguy/9f69e55667c3a42ec0636525104ab995

When the agent queries "NullPointerException getProfile cache", vector search ranks the getProfile chunk high because the enriched embedding matches the semantic meaning. BM25 also ranks it high because "getProfile" and "cache" appear literally in the source. Chunks that score well on both end up at the top.

3.7 Keeping It Fresh

The knowledge base is useless if it falls behind the code. Every git push to a registered repo triggers a webhook that runs incremental re-indexing:

git diff --name-only {before_sha}..{after_sha} to get changed files
For each changed file: re-parse, diff against stored symbols, update only what changed
New symbols: add chunks, embed, upsert
Modified symbols: re-embed, upsert (overwrite by chunk_id)
Deleted symbols: delete from all three stores
Update last_indexed_sha in the repo registry

A typical push touching 3-5 files completes in under 10 seconds. The code index is always within one push of being current.

3.8 Learning from Past Fixes

Two additional indexes give the agent a memory of past work:

Ticket history. Past tickets are embedded as "{title}\n\n{description}" and stored. When a new ticket arrives, search_similar_tickets() retrieves past tickets with similar error messages. If someone already filed and fixed "NPE in UserService" three months ago, the agent gets a head start.

Merged PR store. Merged PRs linked to bug tickets are stored as "{pr_title}\n\n{pr_description}\n\n{diff_summary}". The diff summary (not the raw diff) is what gets embedded, because raw diffs are noisy for similarity search. A one-paragraph human-readable summary of what changed and why produces much better recall.

4. The Agentic Harness

The harness is the runtime that wraps the LLM. It manages the loop, executes skill calls, enforces permissions, and maintains state. The reasoning about what to do lives entirely inside the LLM.

4.1 Skills and Permissions

The agent has access to a registry of skills, each with a name, input schema, and permission level:

https://gist.github.com/cmenguy/378b67ef0b39c6bd7a8bc7e8a3ab766f

The full registry is injected into the system prompt so the agent always knows what it can do:

https://gist.github.com/cmenguy/a75993377bf7d814d128be39f9c74095

4.2 The System Prompt

The system prompt defines methodology and constraints. This is the most important design surface in a single-agent system.

https://gist.github.com/cmenguy/f0f0cd96f6f9015cdd0b07ddaa9a484e

4.3 Context Management

With a single agent holding everything in one context, window management is the critical engineering problem. A 200k-token window fills fast:

A single Java file: ~2,000 tokens
A test suite: ~5,000 tokens
A search_code result: ~3,000 tokens
30 steps of history: ~15,000 tokens

The harness uses three strategies:

Scratchpad (pinned memory). The agent writes key findings via remember(). The scratchpad is always present in the context, never compressed. It acts as working memory.

https://gist.github.com/cmenguy/d2b665a5ab9ed7ee3dd69fca46615574

Rolling compression. Tool results older than 10 steps get replaced with agent-written summaries. A 300-line file read becomes: [UserService.java read at step 3. Key: getProfile() at line 143 returns raw User from cache.get(). No null check.]

Selective re-read. Instead of keeping large files in context, the agent re-reads only the specific lines it needs. read_file("UserService.java", start=140, end=155) is cheap and targeted.

5. End-to-End: Fixing a NullPointerException

Let me walk through the full flow. A developer files a Jira ticket: "NullPointerException in UserService.getProfile() after cache miss. Stack trace: UserService.java:143".

5.1 Ticket Ingestion

Jira fires a webhook. The Ingestion Service normalizes it into a TicketEvent:

https://gist.github.com/cmenguy/0a4c275aa9dea699f65939e27d9a6292

The dispatcher creates a session, clones the repo into a sandbox, creates branch fix/PROJ-1234-auto, and starts the agent loop.

5.2 The Agent Loop

Here's the actual execution trace. Each step is one iteration of the THINK → ACT → OBSERVE loop.

Investigation phase:

https://gist.github.com/cmenguy/a7f0aad4db4abb6d573307897fc83ff5

Seven steps. The agent read one file, ran one search, checked past tickets, and pinned its diagnosis. A multi-agent pipeline would have burned three full context windows to reach the same conclusion.

Implementation phase:

https://gist.github.com/cmenguy/3d19e66ace2a1f30595bf51d9a7a0035

Notice how the agent caught the missing import by running the linter immediately after editing, then fixed it in the next step. This is the self-correction loop that makes the single-agent pattern work. No orchestrator needed.

Validation phase:

https://gist.github.com/cmenguy/2bcdfc5deea5cd800076d5350e1f30a0

The test failure at step 21 is the kind of thing that trips up pipeline architectures. The CodeGen agent in a multi-agent system would need to signal back to a test-fixing agent, which would need access to the original diagnosis context to understand why the return type changed. Here, the agent just reads the failure message, understands it immediately (it made the change that caused the failure), and fixes it.

PR creation:

https://gist.github.com/cmenguy/dd90d072a415b9dcf049e6c4245db26f

Total: 29 skill calls, ~48k tokens, roughly $0.60, about 3 minutes from ticket creation to PR ready for review.

6. The Pieces You Can Build Today

The full system has a lot of infrastructure (Kafka, sandboxed containers, webhook services). But the core building blocks are surprisingly accessible. The companion notebook implements three of them end-to-end:

Code indexing with tree-sitter: parse a Python repo, extract functions, chunk at AST boundaries
Hybrid retrieval: embed chunks with OpenAI, build a BM25 index with rank_bm25, fuse results with RRF
A simplified agent loop: process a sample ticket against the indexed repo using a ReAct loop with tool calls

The gap between the notebook version and the production system is mostly operational: container orchestration, webhook plumbing, and the persistence layer. The intelligence (how to index code, how to search it, how to reason about bugs) is all in the notebook.

What I'd explore next: feeding the agent's merged PRs back into the knowledge base so it learns from its own fixes over time. If the agent fixed three NPE-from-cache-miss bugs last month, the fourth one should be faster.

Originally published on AI Terminal.

Tags: rag, react, agents, bug-fix, tree-sitter