End-to-end NL2Code pipeline: prompt-based, ReAct, and CodeAct agents generating marketing analytics code from plain English.
Run in Google Colab | View on GitHub
In Part 1 I built a Copilot clone that does inline code completion. Fine-tuned it on our codebase, got it generating completions that actually follow our patterns. That was the easy half of the code generation story.
The harder half is what our marketing team actually asks for. They don't want autocomplete. They want to describe a problem in English ("build me a lookalike audience model from this seed list and score every user in our CDP") and get back working Python that plugs into our existing data pipelines. The code I shipped in Part 1 would see that request and spit out a generic scikit-learn script that imports libraries we don't use and writes to a CSV file instead of our Snowflake warehouse. Technically correct. Completely useless in production.
I'd been building a hardcoded agent for this at work. It followed a fixed sequence: parse the request, pick from a library of pre-built code templates, fill in the blanks. It worked for the five or six patterns we had encoded. But every new request type meant writing more templates. The marketing analytics team kept asking for things slightly outside the templates ("can you also add a holdout group?" or "run this on the EU segment only"), and the agent would just fail. What I wanted was something that could reason about the problem, figure out what code to write, and adapt when the requirements shifted. Not a template engine with an LLM on top, but an actual NL2Code system.
This post is Part 2 of a 3-part series on AI-assisted code generation. Part 1 covered inline completion (FIM, code models, LoRA fine-tuning). This post covers NL2Code: instruction-following code generation where you describe what you want and an agent writes it. Part 3 will cover the other side: AI-assisted bug detection and fix suggestion.
Let me use a concrete example that I'll carry through the entire post. Our marketing analytics team needs to build lookalike audiences. The workflow:
In plain English, the request looks like this:
"Build a lookalike audience from seed list
high_value_q1.csv. Use purchase frequency, average order value, and days since last purchase as features. Score all users in theuser_featurestable. Output the top 10% as the target audience with a 20% holdout group. Write results tolookalike_audience_scored.parquet."
This is a well-defined data science task. A senior engineer could write it in 30 minutes. The question is whether an LLM can write it in 30 seconds, and whether that code is good enough to actually run.
I'll walk through four approaches, each one more capable than the last, with real code and honest tradeoffs.
The simplest thing that could work. Give the LLM the request, some context about our codebase, and ask it to write the code.
Let's test it with our marketing request:
The model produces something like this (cleaned up from an actual run):
This is... decent. The model picked NearestNeighbors with cosine distance, which is a reasonable choice. It handles the scaler correctly (fit on seed, transform both). The holdout logic works.
But there are real problems:
No reproducibility. np.random.rand without a seed means the holdout group changes every run. Any marketer who re-runs this will get different results and wonder why their numbers shifted.
No validation. What if the feature columns don't exist in the data? What if the seed file is empty? Production code needs guardrails.
No connection to our infra. This reads from local files. Our data lives in Snowflake. The output goes to S3, not a local parquet file. The model has no idea about our stack.
Hardcoded feature names. The feature columns are string literals. If the user specifies different features in the next request, the model might or might not pick them up.
Single-shot prompting works for throwaway scripts. For anything that touches production, you need more.
Be honest about when simple works. Single-shot is fine for:
The failure mode is people using single-shot for production code and then spending more time debugging the output than they would have spent writing it themselves.
The next step: break the problem into stages and validate the output before moving on.
This is better. The plan step forces the model to think about structure before writing code. The review step catches some of the issues (it usually adds random_state=42 and basic column validation). The code-per-step approach keeps each generation focused.
But it's still a fixed pipeline. The model plans, codes, reviews, and you get the output. If the review step finds a fundamental design problem (wrong algorithm choice, missing a join between two tables), it can only patch the existing code. It can't go back and re-plan.
This is where things get interesting. Instead of a fixed pipeline, give the LLM tools and let it decide what to do at each step.
The ReAct pattern (Yao et al., 2022) interleaves reasoning and acting. The agent thinks about what to do, takes an action (calls a tool), observes the result, and decides what to do next. For NL2Code, the tools are things like "read a file," "check if a column exists," "run a code snippet," and "write output."
Here's the tool set:
Now the agent loop. This is the core of the ReAct pattern:
When you run this on our marketing request, the agent typically:
This is a huge improvement. The agent adapts to what it finds in the data. If the seed file has avg_order_val instead of average_order_value, the agent notices and uses the correct column name. If there are nulls in a feature column, the agent adds imputation logic.
But ReAct has a structural limitation for code generation: every piece of code is a string inside a tool call. The agent writes code, sends it to run_python as a string, reads back the output. There's no persistent execution environment. Each run_python call starts fresh. The agent can't build up state incrementally.
This is the approach I ended up using in production. CodeAct (Wang et al., 2024) flips the tool-use model: instead of the agent calling tools through a JSON API, the agent writes and executes code directly as its action space. The code itself is the tool.
The key insight: for code generation tasks, the most natural "action" an agent can take is writing and running code. Instead of having a run_python tool that takes a code string, the agent just writes Python in a persistent Jupyter-like environment where state carries across turns.
The agent loop is simpler than ReAct because there's only one "tool": execute code.
Here's what a typical CodeAct session looks like for our marketing problem. The agent's first move is to inspect the data:
The agent sees the actual column names and data types. If the columns are purch_freq, avg_ov, and days_last_purch (as they often are in real marketing data where someone truncated names to fit a legacy system), the agent adapts.
Then it builds incrementally:
The agent tests each function before moving on. If score_users throws an error because the seed list only has 3 users and n_neighbors=5, the agent sees the error, adjusts the parameter, and re-runs. This self-correction loop is the main advantage over chained prompting.
Finally, the agent writes the complete script:
Notice what the CodeAct agent does that the other approaches don't:
n_neighbors edge case by using min(n_neighbors, len(seed))The difference between ReAct and CodeAct for this problem comes down to state management.
In ReAct, the agent passes code strings to a run_python tool. Each execution is isolated. If the agent defines a function in step 3, it can't call that function in step 4 unless it re-includes the entire definition. The agent ends up carrying around massive code strings, and the context window fills up fast.
In CodeAct, the environment is persistent. import pandas as pd in step 1 means pd is available in every subsequent step. The agent can build up a solution the same way a human would in a Jupyter notebook: import things, explore data, define helpers, test them, compose the final result.
Here's the practical impact:
The self-correction rate is the big one. When CodeAct runs a snippet and gets an error, the traceback appears in the same environment where all the variables are still alive. The agent can inspect the problematic dataframe, print its dtypes, and fix the issue. When ReAct gets an error, it's working blind because the execution context is gone.
The CodeAct agent above works for demos. For production, there are three problems to solve: safety, context, and reliability.
Letting an LLM run arbitrary Python on your infrastructure is a terrible idea without sandboxing. The agent could import os; os.system("rm -rf /") or read sensitive environment variables. You need a sandbox.
This is a minimal sandbox. For production, use Docker containers, gVisor, or a dedicated code execution service like E2B or Modal. The subprocess approach here blocks the most obvious attacks but isn't airtight.
The marketing analytics team has internal libraries. There's a cdp_client module for querying the customer data platform, an audience_io module for writing audience files in the format the campaign platform expects, and a metrics module for computing standard marketing metrics (LTV, RFM scores, etc.).
The agent needs to know about these. Two approaches work:
API summaries in the system prompt:
Retrieval-augmented context (for larger codebases):
With codebase context, the agent generates code that uses cdp_client.query() instead of pd.read_parquet(), calls audience_io.write_audience() instead of to_parquet(), and follows the patterns it sees in the retrieved code.
LLM agents fail. They hallucinate function signatures, get stuck in loops, produce code that doesn't run. You need a reliability layer.
In practice, the CodeAct agent produces valid code on the first try about 85% of the time. With one retry, that goes to ~95%. The remaining 5% are usually requests that are genuinely ambiguous or require domain knowledge the agent doesn't have.
Let me put all four approaches through the same set of five marketing requests and compare results. These are real request patterns from our team:
And the aggregate metrics:
CodeAct wins on correctness while being faster and cheaper than ReAct. The speed advantage comes from needing fewer steps (persistent state means less re-computation). The cost advantage comes from shorter context windows (no repeated code blocks).
The one place single-shot wins is latency. If you need code in 5 seconds and "good enough" is acceptable, single-shot is hard to beat. For production code that needs to actually work, the extra 25 seconds of CodeAct is worth it.
The marketing audience builder is one example, but the pattern works for any domain where non-engineers need to request code from a system. Here's how the CodeAct approach applies to three other cases I've seen at work:
Data engineering: "Create an Airflow DAG that runs the customer churn model daily, reads from the user_events table, writes predictions to churn_scores, and sends a Slack alert if the model's AUC drops below 0.75." The agent inspects the existing DAG templates in the repo, follows the team's naming conventions, and wires up the Slack notification correctly.
Finance analytics: "Backtest a momentum strategy on the S&P 500. Use a 12-month lookback, rebalance monthly, hold the top decile. Compare against buy-and-hold." The agent pulls price data, implements the strategy, runs the backtest, and generates a comparison chart.
Product analytics: "Build a retention cohort analysis. Group users by signup month, track weekly active status for 12 weeks, output a heatmap." The agent queries the events table, pivots the data into cohort format, and generates the visualization.
The common thread: a domain expert describes what they want, and the agent writes code that fits the team's stack. The CodeAct pattern handles all of these because the agent can inspect real data, test incrementally, and adapt to what it finds.
After running this in production for a few weeks, some lessons:
Start with the system prompt, not the agent loop. I spent too long tweaking the ReAct/CodeAct loop mechanics and not enough time on the system prompt that teaches the agent about our codebase. The quality of the codebase context in the prompt matters more than the agent architecture.
Log everything. Every agent step, every code execution, every error. When the agent generates bad code, you need to trace back through its reasoning to understand why. I built a simple logging layer that writes the full conversation to a JSONL file. It's been the most useful debugging tool.
Let humans edit, not just approve. The first version had a binary approve/reject flow. Users would reject code, the agent would retry, and everyone was frustrated. The current version shows the generated code in an editor where users can make small edits before running. Most of the time they change one or two lines (a parameter value, a column name) and approve. That's a much better UX than regenerating from scratch.
Evaluation is the hard part. Measuring whether generated code is "correct" is genuinely difficult. It's not like code completion where you can compute edit distance against a known answer. For NL2Code, there are many valid solutions. I ended up using a combination of: (1) does it parse, (2) does it run on test data, (3) does a human reviewer rate it as correct. The human review is the bottleneck.
Next up in Part 3: we flip the direction entirely. Instead of generating new code from English descriptions, we'll build a system that reads existing code, spots bugs and anti-patterns, and proposes fixes. Same agent architecture, very different evaluation problem.
Originally published on AI Terminal.
Tags: react, agents, nl2code, codeact, tool-use