From Jira to Pull Request: Building an AI Bug-Fix Agent

In Part 1 I built a Copilot clone for inline code completion. In Part 2 I built an NL2Code agent that generates full scripts from English descriptions. Both of those posts are about writing new code. This one goes the other direction: reading a bug ticket, finding the broken code, fixing it, and shipping a pull request. No human in the loop.

The trigger was a discussion at work about our move to a monorepo. The main motivation for the migration is that AI tools like Claude Code work best when they have all the context in one place: one repo, one dependency graph, one search index. But the flip side is that a monorepo with dozens of contributors generates a lot of GitHub issues and PRs. We started looking at how large open-source projects like Kubernetes and OpenClaw handle that volume, and the conversation landed on automated PR generation for routine bug fixes. The question that stuck with me was: what would it take to go from a Jira ticket saying “NullPointerException in UserService.getProfile()” to a pull request that actually fixes the bug, with no human touching the keyboard in between? I spent the past few weeks building one to find out.

This is Part 3 of a 3-part series on AI-assisted code generation. Part 1 covered inline completion (FIM, code models, LoRA fine-tuning). Part 2 covered NL2Code with agents. This post covers the fix side: ingesting a bug ticket, building a searchable knowledge base from the repo, and running a single-agent loop that investigates, patches, tests, and opens a PR.

Why a Single Agent

The obvious architecture for “ticket → PR” is a pipeline of specialized agents. A Triage agent reads the ticket and classifies it. A Planning agent decides what to investigate. A CodeGen agent writes the fix. A Review agent checks the fix. Clean separation of concerns.

The problem is information loss. Every hand-off between agents requires serializing one agent’s findings into a summary that the next agent consumes. Summaries drop details. The CodeGen agent can’t go back and re-read the stack trace because it only received the Planning agent’s distilled version of it. If the fix attempt reveals that the initial diagnosis was wrong, there’s no way to pivot without restarting the entire pipeline.

Triage → [serializes findings] → Planning → [serializes plan] → CodeGen
           ↑ lost details                    ↑ lost details

A single agent running inside a harness maintains one continuous context window across the entire lifecycle of a ticket. It can re-read the stack trace while writing code. It can change its diagnosis mid-implementation when the code contradicts its initial theory. It can skip investigation steps when the stack trace is already definitive. This mirrors how an experienced engineer actually works: you hold the full problem in your head and fluidly move between investigation, planning, and implementation. You don’t hand off a memo to a different person at each step.

The trade-off is context window pressure. A 200k-token window fills faster than you’d think when you’re reading full files and collecting search results. I’ll cover how the harness manages that later.

Architecture: The 30-Second Version

Here’s the full system:

TICKET SOURCES                     REPO SOURCES
─────────────                      ────────────
Jira, GitHub Issues,               github.com/org/*
PagerDuty, Linear                  gitlab.internal/*
        │                                  │
        ▼                                  ▼
┌─────────────────┐            ┌──────────────────────┐
│ INGESTION       │            │ CODE INDEXING         │
│ SERVICE         │            │ PIPELINE              │
│                 │            │                       │
│ • Normalize     │            │ • Clone repo          │
│ • Classify      │            │ • Parse (tree-sitter) │
│ • Deduplicate   │            │ • Chunk at AST bounds │
│ • Enrich        │            │ • Embed + store       │
└────────┬────────┘            └───────────┬──────────┘
         │                                  │
         ▼                                  ▼
┌─────────────────┐            ┌──────────────────────┐
│ TICKET QUEUE    │            │ KNOWLEDGE LAYER       │
│ (Kafka)         │            │                       │
│                 │            │ pgvector  (embeddings) │
│ tickets.p1_p2   │            │ Elasticsearch (BM25)  │
│ tickets.p3_p4   │            │ Postgres (dep graph,  │
└────────┬────────┘            │   ticket history,     │
         │                     │   merged PRs)         │
         ▼                     └───────────┬──────────┘
┌──────────────────────────────────────────┴──────────┐
│              AGENTIC HARNESS (worker pod)            │
│                                                      │
│  [system prompt] [ticket] [tool results] [scratchpad]│
│  ◀──────────── 200k token context ─────────────────▶│
│                                                      │
│         OBSERVE → THINK → ACT → OBSERVE → ...       │
│                      │                               │
│              ┌───────┴───────┐                       │
│              │ SKILL REGISTRY│                       │
│              │ search_code   │                       │
│              │ read_file     │                       │
│              │ edit_file     │                       │
│              │ run_tests     │                       │
│              │ create_pr     │                       │
│              │ ...           │                       │
│              └───────────────┘                       │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
                    GitHub API → PR #2891

Tickets come in from any source. They get normalized into a standard schema, queued by priority, and dispatched to a worker pod running the agentic harness. The harness gives the agent a set of skills (search, read, edit, test, etc.) and manages context, checkpoints, and permissions. The Knowledge Layer is what makes the agent’s search skills useful. Without it, search_code("getProfile") returns nothing.

Let me build each piece.

Building the Knowledge Base

The agent is only as good as what it can search. If a developer files a ticket about a NullPointerException in UserService.getProfile(), the agent needs to find the actual code for that method, understand what it does, find who calls it, and look up whether anyone has fixed a similar bug before. All of that depends on the Knowledge Layer being populated and fresh.

Parsing Code with tree-sitter

The first step is turning source files into structured records. Raw text files are useless for search: you’d be embedding entire files and hoping cosine similarity lands on the right 10-line function in a 500-line file. Instead, we parse each file into its AST and extract every named declaration (function, method, class) as a separate record.

tree-sitter handles this across languages. It produces a concrete syntax tree that we walk to extract symbols with their metadata.

import tree_sitter_python as tspython
from tree_sitter import Language, Parser

PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)

Given a source file, we parse it and walk the tree to find function and class definitions:

def extract_symbols(source: str, file_path: str):
    tree = parser.parse(source.encode())
    symbols = []

    for node in _walk(tree.root_node):
        if node.type == "function_definition":
            symbols.append(_extract_function(node, file_path))
        elif node.type == "class_definition":
            symbols.append(_extract_class(node, file_path))

    return symbols

The _extract_function helper pulls out everything we need for a single symbol:

def _extract_function(node, file_path: str) -> dict:
    name_node = node.child_by_field_name("name")
    params_node = node.child_by_field_name("parameters")
    body_node = node.child_by_field_name("body")
    docstring = _get_docstring(body_node)

    return {
        "symbol_type": "function",
        "symbol_name": name_node.text.decode(),
        "file_path": file_path,
        "start_line": node.start_point[0] + 1,
        "end_line": node.end_point[0] + 1,
        "parameters": params_node.text.decode() if params_node else "",
        "docstring": docstring,
        "full_source": node.text.decode(),
    }

For a Java file like UserService.java, the parser would emit a record for each method:

{
  "symbol_type": "method",
  "symbol_name": "getProfile",
  "class_name": "UserService",
  "file_path": "src/services/UserService.java",
  "start_line": 138,
  "end_line": 152,
  "parameters": "String userId",
  "return_type": "User",
  "docstring": "Returns the cached profile for a user ID. Returns null on cache miss.",
  "full_source": "public User getProfile(String userId) {\n    User user = cache.get(userId);\n    return user;\n}"
}

Chunking at AST Boundaries

Each parsed symbol becomes one chunk. This is the right granularity for code search: when a ticket mentions getProfile, you want to retrieve that specific method, not the entire 800-line UserService.java file.

Two exceptions handle edge cases:

Large functions. If a function body exceeds 400 tokens, split it into overlapping sub-chunks of 400 tokens each with 80-token overlap. Attach the parent function metadata (name, class, file) to every sub-chunk so retrieval results are always locatable.

Class-level summaries. In addition to per-method chunks, emit one chunk per class containing only the declaration, field declarations, and constructor signatures (no method bodies). This lets queries like “what fields does UserService have?” hit a compact chunk instead of wading through all method bodies.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
MAX_CHUNK_TOKENS = 400
OVERLAP_TOKENS = 80

def chunk_symbol(symbol: dict) -> list[dict]:
    source = symbol["full_source"]
    token_count = len(enc.encode(source))

    if token_count <= MAX_CHUNK_TOKENS:
        return [symbol]

    # Split large functions into overlapping sub-chunks
    tokens = enc.encode(source)
    chunks = []
    start = 0
    idx = 0
    while start < len(tokens):
        end = min(start + MAX_CHUNK_TOKENS, len(tokens))
        chunk_text = enc.decode(tokens[start:end])
        chunk = {**symbol, "full_source": chunk_text, "chunk_index": idx}
        chunks.append(chunk)
        start += MAX_CHUNK_TOKENS - OVERLAP_TOKENS
        idx += 1

    return chunks

The Enrichment Trick

This is the most important design decision in the entire indexing pipeline. What text string you pass to the embedding model determines retrieval quality. Embedding raw source code performs poorly because ticket queries are natural language (“NullPointerException when fetching user profile from cache”) and code is, well, code. The semantic gap between a bug report and a function body is huge.

The fix is to assemble an enriched text string per chunk that combines natural language context with the code:

File: src/services/UserService.java
Class: UserService
Method: public User getProfile(String userId)

Summary: Returns the cached profile for a user ID.
Returns null on cache miss.

Source:
public User getProfile(String userId) {
    User user = cache.get(userId);
    return user;
}

Why this works: the ticket query “NullPointerException when fetching user profile from cache” is natural language. The enriched string contains the docstring (“Returns null on cache miss”) which bridges the semantic gap. The class and method name provide symbol-level anchoring. The full source is there so the embedding also captures code-level patterns.

Here’s the template:

def build_embedding_input(symbol: dict) -> str:
    parts = [f"File: {symbol['file_path']}"]

    if symbol.get("class_name"):
        parts.append(f"Class: {symbol['class_name']}")

    sig = symbol.get("signature", symbol["symbol_name"])
    parts.append(f"{symbol['symbol_type'].title()}: {sig}")

    if symbol.get("docstring"):
        parts.append(f"\nSummary: {symbol['docstring']}")

    parts.append(f"\nSource:\n{symbol['full_source']}")
    return "\n".join(parts)

Without enrichment, the embedding model sees public User getProfile(String userId) { User user = cache.get(userId); return user; } and has no idea that this function relates to “user profile cache miss.” With enrichment, the docstring, class name, and method signature all contribute semantic signal that the embedding model can use to connect the dots.

Embedding and Storage

For the embedding model, I’m using OpenAI’s text-embedding-3-large. The reasoning: since the enriched input strings are a hybrid of natural language and code (not raw code), a general-purpose embedding model handles them well. A code-specialized model like voyage-code-2 would give marginal gains on raw code syntax, but the natural language portions (docstrings, signatures, file paths) are where the ticket-to-code bridge happens. Using the same vendor as the LLM also means one API key, one billing relationship, one rate-limit budget.

import openai

def embed_chunks(chunks: list[dict]) -> list[dict]:
    texts = [build_embedding_input(c) for c in chunks]

    response = openai.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
    )

    for chunk, item in zip(chunks, response.data):
        chunk["embedding"] = item.embedding

    return chunks

Each chunk gets stored in three places, each optimized for a different query pattern:

pgvector for semantic search. When the agent searches “cache miss null pointer”, the query gets embedded and we find the nearest chunks by cosine similarity.

INSERT INTO code_chunks (
  chunk_id, repo, file_path, start_line, end_line,
  symbol_name, class_name, language,
  full_source, embedding  -- vector(3072)
) VALUES (...);

Elasticsearch for BM25 keyword search. When the agent searches getProfile, we want exact identifier matching, not fuzzy semantic similarity. The text field is tokenized source code stripped of punctuation so BM25 matches on identifiers like getProfile, userId, cache.

Postgres for the dependency graph. Who calls getProfile? What does UserService import? This is where the graph queries live.

Building the Dependency Graph

Embedding search answers “what code is semantically related to this ticket?” but it can’t answer “who calls getProfile?” or “what does UserService depend on?”. Those are graph traversal questions, and they come up constantly during bug investigation. If the agent changes getProfile’s return type from User to Optional<User>, it needs to find every caller and update them.

The dependency graph is extracted during the same tree-sitter parse pass that produces symbol records. For each function body, we walk the AST looking for call expressions and attribute accesses, then resolve them against the symbols we’ve already extracted from the same repo.

def extract_calls(node, file_path: str, class_name: str = None):
    """Walk function body and extract call targets."""
    calls = []
    for child in _walk(node):
        if child.type == "call":
            func = child.child_by_field_name("function")
            if func is None:
                continue
            call_text = func.text.decode()
            calls.append(call_text)
    return calls

For UserController.get_user(), parsing the body finds self.user_service.get_profile(user_id), which we resolve to the symbol UserService.get_profile. For UserService.get_profile(), we find self.cache.get(user_id), resolved to UserCache.get.

The resolution step matches raw call text against the known symbol table for the repo. self.user_service.get_profile gets resolved by checking the class’s __init__ to find that self.user_service is a UserService instance, then looking up UserService.get_profile in the symbol index. For cases where static analysis can’t resolve the type (dynamic dispatch, duck typing), we fall back to a name-based heuristic: match get_profile against all symbols named get_profile in the repo and rank by file proximity.

Each resolved call becomes a row in Postgres:

CREATE TABLE symbol_dependencies (
    repo        TEXT,
    from_symbol TEXT,
    to_symbol   TEXT,
    dep_type    TEXT,  -- 'calls', 'imports', 'inherits', 'field'
    file_path   TEXT,
    line_number INTEGER
);

INSERT INTO symbol_dependencies (repo, from_symbol, to_symbol, dep_type, file_path, line_number)
VALUES
  ('org/svc', 'UserService.get_profile', 'UserCache.get', 'calls',
   'src/services/user_service.py', 14),
  ('org/svc', 'UserService.update_profile', 'UserService.get_profile', 'calls',
   'src/services/user_service.py', 19),
  ('org/svc', 'UserController.get_user', 'UserService.get_profile', 'calls',
   'src/api/user_controller.py', 13),
  ('org/svc', 'UserService', 'UserCache', 'field',
   'src/services/user_service.py', 9);

The agent queries this graph through a get_callers skill:

def get_callers(symbol_name: str, repo: str) -> list[dict]:
    """Find all symbols that call the given symbol."""
    rows = db.execute(
        "SELECT from_symbol, file_path, line_number "
        "FROM symbol_dependencies "
        "WHERE to_symbol = %s AND repo = %s AND dep_type = 'calls'",
        (symbol_name, repo),
    ).fetchall()
    return [
        {"caller": r[0], "file": r[1], "line": r[2]}
        for r in rows
    ]

When the agent runs get_callers("UserService.get_profile", "org/svc"), it gets back:

[
  {"caller": "UserService.update_profile", "file": "src/services/user_service.py", "line": 19},
  {"caller": "UserController.get_user", "file": "src/api/user_controller.py", "line": 13}
]

That’s two callers to check when changing the return type. Without this graph, the agent would have to do a text search for “get_profile” across the entire repo and parse the results to figure out which matches are actual call sites vs. comments or string literals. The graph makes it a single indexed query.

The reverse direction is useful too. get_dependencies("UserService.get_profile", "org/svc") tells the agent what get_profile itself depends on (in this case, UserCache.get), which helps when tracing the root cause upstream.

Hybrid Retrieval

Neither vector search nor keyword search is good enough on its own. Vector search finds semantically similar code but misses exact identifier matches. BM25 finds exact identifiers but doesn’t understand that “cache miss” relates to cache.get() returning null. The solution is to run both and fuse the results.

import numpy as np
from rank_bm25 import BM25Okapi

def hybrid_search(query: str, chunks: list[dict], top_k: int = 10):
    # Vector search
    q_emb = openai.embeddings.create(
        model="text-embedding-3-large", input=[query]
    ).data[0].embedding
    vec_scores = {
        c["chunk_id"]: cosine_sim(q_emb, c["embedding"])
        for c in chunks
    }

    # BM25 keyword search
    corpus = [c["full_source"].split() for c in chunks]
    bm25 = BM25Okapi(corpus)
    bm25_scores_raw = bm25.get_scores(query.split())
    bm25_scores = {
        chunks[i]["chunk_id"]: bm25_scores_raw[i]
        for i in range(len(chunks))
    }

    # Reciprocal rank fusion
    return reciprocal_rank_fusion(vec_scores, bm25_scores, top_k)

Reciprocal rank fusion (RRF) combines the two ranked lists without needing to normalize scores across different systems:

def reciprocal_rank_fusion(
    scores_a: dict, scores_b: dict, top_k: int, k: int = 60
) -> list[str]:
    rank_a = _to_ranks(scores_a)
    rank_b = _to_ranks(scores_b)
    all_ids = set(rank_a) | set(rank_b)

    fused = {}
    for chunk_id in all_ids:
        rrf = 0.0
        if chunk_id in rank_a:
            rrf += 1.0 / (k + rank_a[chunk_id])
        if chunk_id in rank_b:
            rrf += 1.0 / (k + rank_b[chunk_id])
        fused[chunk_id] = rrf

    return sorted(fused, key=fused.get, reverse=True)[:top_k]

When the agent queries “NullPointerException getProfile cache”, vector search ranks the getProfile chunk high because the enriched embedding matches the semantic meaning. BM25 also ranks it high because “getProfile” and “cache” appear literally in the source. Chunks that score well on both end up at the top.

Keeping It Fresh

The knowledge base is useless if it falls behind the code. Every git push to a registered repo triggers a webhook that runs incremental re-indexing:

git diff --name-only {before_sha}..{after_sha} to get changed files
For each changed file: re-parse, diff against stored symbols, update only what changed
New symbols: add chunks, embed, upsert
Modified symbols: re-embed, upsert (overwrite by chunk_id)
Deleted symbols: delete from all three stores
Update last_indexed_sha in the repo registry

A typical push touching 3-5 files completes in under 10 seconds. The code index is always within one push of being current.

Learning from Past Fixes

Two additional indexes give the agent a memory of past work:

Ticket history. Past tickets are embedded as "{title}\n\n{description}" and stored. When a new ticket arrives, search_similar_tickets() retrieves past tickets with similar error messages. If someone already filed and fixed “NPE in UserService” three months ago, the agent gets a head start.

Merged PR store. Merged PRs linked to bug tickets are stored as "{pr_title}\n\n{pr_description}\n\n{diff_summary}". The diff summary (not the raw diff) is what gets embedded, because raw diffs are noisy for similarity search. A one-paragraph human-readable summary of what changed and why produces much better recall.

The Agentic Harness

The harness is the runtime that wraps the LLM. It manages the loop, executes skill calls, enforces permissions, and maintains state. The reasoning about what to do lives entirely inside the LLM.

Skills and Permissions

The agent has access to a registry of skills, each with a name, input schema, and permission level:

Permission	Skills	Behavior
`readonly`	search_code, read_file, git_log, git_blame	Always auto-approve
`auto`	write_file, edit_file, run_tests, run_linter	Auto-approve in sandbox
`confirm`	create_pr, comment_ticket	Approve once per session
`blocked`	network outside approved hosts, rm -rf	Never allowed

The full registry is injected into the system prompt so the agent always knows what it can do:

Skill	Description
`search_code(query, file_pattern, top_k)`	Hybrid semantic + BM25 search
`read_file(path, start_line, end_line)`	Read file with optional line range
`edit_file(path, old_str, new_str)`	Targeted string replacement
`run_tests(path, timeout_s)`	Run test suite
`run_linter(path)`	Static analysis + type-check
`git_diff(base, head)`	Show diff between refs
`search_similar_tickets(query, top_k)`	Find past tickets with resolutions
`search_past_prs(query, top_k)`	Find relevant merged PRs
`think(reasoning)`	Extended reasoning (logged, not executed)
`remember(key_value_dict)`	Pin findings to scratchpad
`checkpoint(label)`	Snapshot session for resumability
`create_pr(title, body, branch)`	Open a GitHub PR

The System Prompt

The system prompt defines methodology and constraints. This is the most important design surface in a single-agent system.

You are a senior software engineer who autonomously fixes bugs and creates
pull requests. You operate inside a secure agentic harness with access to
the codebase, test runners, and version control.

METHODOLOGY:
- Explore before you edit. Read broadly, write narrowly.
- Use think() to reason explicitly before making a decision.
- Use remember() to pin critical findings to your scratchpad.
- When uncertain between two causes, gather more evidence with search_code.
- Prefer minimal diffs. Fix the bug; do not refactor unrelated code.
- If more than 70% confident in a root cause, proceed. Below 50%, ask_human.
- After writing any code, always run the linter and relevant tests.
- Use checkpoint() after completing each major phase.

CONSTRAINTS:
- All writes go to sandbox branch: {branch_name}
- Never modify CI config or lock files without explicit instruction
- Never create a PR until all tests pass
- Hard limit: 60 skill calls per session
- If stuck after 3 attempts at the same problem, use ask_human

Context Management

With a single agent holding everything in one context, window management is the critical engineering problem. A 200k-token window fills fast:

A single Java file: ~2,000 tokens
A test suite: ~5,000 tokens
A search_code result: ~3,000 tokens
30 steps of history: ~15,000 tokens

The harness uses three strategies:

Scratchpad (pinned memory). The agent writes key findings via remember(). The scratchpad is always present in the context, never compressed. It acts as working memory.

Scratchpad:
  root_cause: "getProfile() returns raw null on cache miss (UserService.java:143)"
  affected_files: ["UserService.java", "UserController.java"]
  plan: ["fix return type", "update caller", "add test", "validate", "PR"]
  progress: "fix return type ✓ | update caller ✓ | add test ... | validate ... | PR ..."

Rolling compression. Tool results older than 10 steps get replaced with agent-written summaries. A 300-line file read becomes: [UserService.java read at step 3. Key: getProfile() at line 143 returns raw User from cache.get(). No null check.]

Selective re-read. Instead of keeping large files in context, the agent re-reads only the specific lines it needs. read_file("UserService.java", start=140, end=155) is cheap and targeted.

End-to-End: Fixing a NullPointerException

Let me walk through the full flow. A developer files a Jira ticket: “NullPointerException in UserService.getProfile() after cache miss. Stack trace: UserService.java:143”.

Ticket Ingestion

Jira fires a webhook. The Ingestion Service normalizes it into a TicketEvent:

{
  "ticket_id": "PROJ-1234",
  "source": "jira",
  "title": "NullPointerException in UserService.getProfile()",
  "description": "NPE after cache miss. Stack trace: UserService.java:143",
  "severity": "P2",
  "repo": "github.com/org/service-name",
  "base_branch": "main",
  "language": "java"
}

The dispatcher creates a session, clones the repo into a sandbox, creates branch fix/PROJ-1234-auto, and starts the agent loop.

The Agent Loop

Here’s the actual execution trace. Each step is one iteration of the THINK → ACT → OBSERVE loop.

Investigation phase:

STEP 1
  think("Stack trace points to line 143. I should read that file and
         understand the surrounding context before searching broadly.")

STEP 2
  read_file("src/services/UserService.java", start=130, end=160)
  → OBSERVE: line 143 calls user.getProfile().getName().
    user came from cache.get(userId) which can return null on cache miss.
    The caller doesn't null-check.

STEP 3
  think("Found it. But I need to understand: should this return null,
         throw an exception, or return Optional? Check callers.")

STEP 4
  search_code("getProfile", file_pattern="*.java")
  → OBSERVE: 6 call sites. 5 of 6 callers handle Optional.empty().
    One caller (UserController.getUser) doesn't.
    Method signature says User but should be Optional<User>.

STEP 5
  remember({
    root_cause: "UserService.getProfile() returns raw User (null on cache miss).
                 Should return Optional<User>. Line 143.",
    affected_files: ["UserService.java", "UserController.java"],
    fix_strategy: "Change return type to Optional<User>, update the one
                   caller that doesn't handle Optional."
  })

STEP 6
  search_similar_tickets("NullPointerException Optional cache miss")
  → OBSERVE: 2 similar past tickets, both fixed by Optional wrapping.
    Confirms approach.

STEP 7
  checkpoint()   ← end of investigation

Seven steps. The agent read one file, ran one search, checked past tickets, and pinned its diagnosis. A multi-agent pipeline would have burned three full context windows to reach the same conclusion.

Implementation phase:

STEP 8
  read_file("src/services/UserService.java")   ← full file for editing

STEP 9
  think("Change return type, wrap return value in Optional.
         Then fix the one non-handling caller in UserController.")

STEP 10
  edit_file(
    path="src/services/UserService.java",
    old_str="public User getProfile(String userId) {\n    ...\n    return user;",
    new_str="public Optional<User> getProfile(String userId) {\n    ...\n    return Optional.ofNullable(user);"
  )

STEP 11
  run_linter("src/services/UserService.java")
  → ERROR: missing import java.util.Optional

STEP 12
  edit_file(path="src/services/UserService.java",
            old_str="import java.util.List;",
            new_str="import java.util.List;\nimport java.util.Optional;")

STEP 13
  run_linter("src/services/UserService.java")  → PASS

STEP 14
  read_file("src/api/UserController.java", start=40, end=80)

STEP 15
  edit_file(
    path="src/api/UserController.java",
    old_str="User profile = userService.getProfile(id);\nreturn ResponseEntity.ok(profile);",
    new_str="Optional<User> profile = userService.getProfile(id);\nreturn profile\n    .map(ResponseEntity::ok)\n    .orElse(ResponseEntity.notFound().build());"
  )

STEP 16
  run_linter("src/api/UserController.java")  → PASS

STEP 17
  checkpoint()   ← end of implementation

Notice how the agent caught the missing import by running the linter immediately after editing, then fixed it in the next step. This is the self-correction loop that makes the single-agent pattern work. No orchestrator needed.

Validation phase:

STEP 18
  read_file("src/test/UserServiceTest.java")
  → Existing tests. None cover the cache miss → null case.

STEP 19
  think("Need a new test for getProfile() returning Optional.empty()
         on cache miss. Also check existing tests still pass since
         I changed the return type.")

STEP 20
  edit_file("src/test/UserServiceTest.java", ...)   ← add new test

STEP 21
  run_tests("src/test/UserServiceTest.java")
  → 3 PASS, 1 FAIL: testGetProfileReturnsUser
    assertion error: expected User, got Optional<User>

STEP 22
  think("Existing test broke because it expected User, now gets
         Optional<User>. Fix the assertion.")

STEP 23
  edit_file("src/test/UserServiceTest.java", ...)   ← fix assertion

STEP 24
  run_tests("src/test/UserServiceTest.java")  → 4 PASS

STEP 25
  run_tests("src/test/")   ← full suite regression
  → 47 PASS, 0 FAIL

STEP 26
  checkpoint()   ← end of validation

The test failure at step 21 is the kind of thing that trips up pipeline architectures. The CodeGen agent in a multi-agent system would need to signal back to a test-fixing agent, which would need access to the original diagnosis context to understand why the return type changed. Here, the agent just reads the failure message, understands it immediately (it made the change that caused the failure), and fixes it.

PR creation:

STEP 27
  git_diff()
  → 2 files changed, +18 lines, -4 lines

STEP 28
  create_pr(
    title="fix(UserService): return Optional<User> from getProfile
           to handle cache miss",
    body="## Root Cause\nUserService.getProfile() returns raw User from
          cache.get(), which returns null on cache miss. Line 143 then
          calls .getName() on the null reference.\n\n## Fix\nChanged
          return type to Optional<User>. Updated UserController to
          handle the empty case with a 404 response.\n\n## Testing\n
          Added test for cache miss scenario. All 47 existing tests pass.",
    branch="fix/PROJ-1234-auto"
  )
  → PR #2891 created

STEP 29
  comment_ticket("PROJ-1234", "Created PR #2891 with fix. Root cause: ...")

Total: 29 skill calls, ~48k tokens, roughly $0.60, about 3 minutes from ticket creation to PR ready for review.

The Pieces You Can Build Today

The full system has a lot of infrastructure (Kafka, sandboxed containers, webhook services). But the core building blocks are surprisingly accessible. The companion notebook implements three of them end-to-end:

Code indexing with tree-sitter: parse a Python repo, extract functions, chunk at AST boundaries
Hybrid retrieval: embed chunks with OpenAI, build a BM25 index with rank_bm25, fuse results with RRF
A simplified agent loop: process a sample ticket against the indexed repo using a ReAct loop with tool calls

The gap between the notebook version and the production system is mostly operational: container orchestration, webhook plumbing, and the persistence layer. The intelligence (how to index code, how to search it, how to reason about bugs) is all in the notebook.

What I’d explore next: feeding the agent’s merged PRs back into the knowledge base so it learns from its own fixes over time. If the agent fixed three NPE-from-cache-miss bugs last month, the fourth one should be faster.