End-to-end code for SFT, DPO, and RLHF fine-tuning with synthetic marketing data.
Run in Google Colab | View on GitHub
I've been spending a lot of time with Llama models at work lately, mostly prototyping, seeing where the edges are, getting a feel for what a mid-size open model can actually do in production. A few weeks ago, someone on the marketing team asked if we could get the model to write copy that sounds less like a robot and more like us. The kind of brand-specific voice that takes a human copywriter months to internalize. That question got stuck in my head.
Then December hit and things started winding down at work, the way they always do when half the company is already mentally on PTO. I flew out to France to spend the holidays with my parents, and somewhere between the jet lag and my mom's cooking, I found myself with actual uninterrupted thinking time for the first time in months. Long walks through the village, bad Wi-Fi that kept me off Slack, and a laptop I'd told myself I wouldn't open but obviously did. What follows is the result of that Christmas week exploration, fueled less by coffee and more by vin chaud.
The first thing I tried (and probably what you'd try too) was cramming everything into the prompt. System message with brand guidelines, a few examples of approved copy via few-shot, maybe RAG over a library of past campaigns. No training, no GPUs, ship it by Friday.
And honestly? For a single brand with a straightforward voice, this works fine. Here's the kind of prompt I was playing with:
Llama 3.1 8B does a decent job with this. You get something like "Cold feet aren't part of the plan." Not bad, sounds roughly on-brand. So why bother fine-tuning?
Three reasons kept pushing me past the prompt-engineering ceiling:
Context window budget. When you're serving 50 different brands, each with its own voice guidelines and example library, the prompt gets heavy fast. A 2,000-token brand context block times 50 brands means every request is burning tokens on instructions instead of thinking. Fine-tuning bakes the knowledge into the weights. The prompt stays lean, inference is cheaper, and latency drops.
Consistency at the margins. Prompting gets you 80% of the way there, but the last 20% is where brand voice actually lives. The subtle things (whether the brand uses Oxford commas, whether contractions are OK, whether humor is dry or playful) drift across generations with prompting alone. I ran 100 generations with the same prompt and kept seeing the model slip into its default "helpful assistant" voice on maybe 15-20% of them. After SFT on 500 brand-specific examples, that drop rate went to near zero.
Preference capture. This is the big one. Prompting can tell a model what to write, but it can't teach it what's better. In marketing, you have A/B test data: version A got 3x the click-through rate of version B. That signal is gold, and there's no way to feed it into a prompt. You need DPO or RLHF to actually learn from preference pairs, and that requires training.
So the rule of thumb I landed on: if you have one brand and simple requirements, prompt engineering is the right call. If you have many brands, need tight consistency, or have preference data you want to exploit, that's when fine-tuning starts paying for itself.
Here's the scenario I keep coming back to. You're building an AI-powered marketing platform. Customers upload their brand guidelines, past campaigns, and tone-of-voice docs. They want the LLM to generate email subject lines, ad copy, social media posts, and product descriptions that are indistinguishable from what their in-house team would write.
This is a preference-heavy domain. "Good" marketing copy isn't just grammatically correct; it's persuasive, on-brand, and sometimes deliberately breaks rules for effect. That matters a lot for which fine-tuning approach you choose.
Here's the quick mental model before we go deep:
SFT is the workhorse. You show the model examples of what good output looks like, and it learns to imitate them. If you've ever fine-tuned a model on a dataset of prompt-completion pairs, you've done SFT.
SFT minimizes the standard language modeling cross-entropy loss over your curated dataset. Given a dataset of $(x, y)$ pairs where $x$ is the prompt and is the target completion:
Where is your model parameterized by
. This is just next-token prediction on your curated data. Nothing fancy, and that's the point.
SFT is your go-to when you have clean, high-quality examples of the output you want. For marketing, this means curated campaign copy that's already been approved by the brand team. Think of it as cloning your best copywriter's style.
The catch: SFT can only learn from positive examples. It knows what "good" looks like, but it has no concept of "this is better than that." If your training data has any noise (mediocre examples mixed in with great ones), the model learns to reproduce the average, not the best.
Let's fine-tune a model on marketing copy using TRL's SFTTrainer. First, the data prep: structuring brand-specific examples as conversations:
Now the training setup. LoRA keeps this feasible on a single GPU:
And the trainer. Note the max_length tuned for short-form marketing copy:
That's it. With 500-1000 curated examples per brand, you'll get a model that captures tone and style surprisingly well. SFT is the 80/20 of fine-tuning: it gets you most of the way there with the least effort.
Reinforcement Learning from Human Feedback is the approach that made ChatGPT feel different from GPT-3. The idea: train a separate reward model on human preferences, then use it to guide the LLM's behavior via reinforcement learning.
RLHF has three stages, each with its own objective.
Stage 1: SFT. Same as above. You start with a supervised fine-tuned model as your starting point.
Stage 2: Reward Model Training. Given pairs of responses where a human preferred
over
, train a reward model
using the Bradley-Terry preference model:
Where is the sigmoid function. The reward model learns to assign higher scores to preferred responses.
Stage 3: RL Optimization. Maximize the reward while staying close to the SFT policy via a KL penalty:
The classic approach used PPO (Proximal Policy Optimization) for this stage, but modern implementations have moved on. TRL now uses RLOO (REINFORCE Leave-One-Out), which is simpler and more stable than PPO while achieving comparable results. The KL term remains critical: without it, the model collapses to reward hacking, generating degenerate text that scores high on the reward model but is gibberish to humans.
RLHF shines when preferences are nuanced and hard to capture with static examples alone. It's the gold standard for aligning models to complex human judgment, which is exactly what "good marketing copy" requires.
The problem: it's expensive. You need to train and maintain a separate reward model. The RL optimization stage adds instability. Loss curves that look like seismographs are par for the course. And you need a lot of preference data to train a reward model that doesn't just memorize your training set.
For a marketing use case, RLHF makes sense at scale: when you have thousands of A/B tested campaigns and can build a reward model that genuinely captures what makes copy convert. For most teams, this is overkill.
Let's build the full pipeline. First, the reward model, trained on pairs where one version of copy outperformed another:
Training the reward model using TRL's RewardTrainer:
Now the RL stage: this is where the model learns to generate copy that scores high on the reward model. TRL's RLOOTrainer (REINFORCE Leave-One-Out) handles generation, scoring, and policy updates in a unified loop. First, prepare the prompt dataset:
Configure and run the RLOO trainer: it generates multiple completions per prompt and uses the leave-one-out baseline to reduce variance:
That's still a lot of moving parts: two models, generation during training, and hyperparameters that interact in non-obvious ways. RLOO is more stable than the old PPO approach, but the fundamental complexity of reward-model-guided RL remains. This is why DPO exists.
Direct Preference Optimization is the "what if we just... skipped the reward model?" approach. Published in 2023, DPO showed that you can optimize directly on preference pairs without ever training a separate reward model or running RL. It's mathematically equivalent to RLHF under certain assumptions, but dramatically simpler to implement.
The key insight of DPO: the optimal policy under the RLHF objective has a closed-form relationship with the reward function. Instead of learning and then optimizing
against it, DPO reparameterizes the reward in terms of the policy itself:
Where is the reference policy (your SFT model) and $Z(x)$ is a partition function that cancels out in the preference comparison. Substituting into the Bradley-Terry model gives the DPO loss:
Read it like this: push up the probability of the preferred response relative to the reference model, while pushing down the probability of the rejected response
, all modulated by
.
DPO is the sweet spot for most production fine-tuning. You get preference learning without the infrastructure complexity of RLHF. No reward model to train and serve. No PPO instability. Just preference pairs and a single training run.
For marketing, this is particularly appealing. You likely have A/B test results, editorial feedback, and brand compliance reviews, all natural sources of "this version is better than that version" data. DPO consumes this directly.
DPO needs an SFT model as the starting point (the reference policy) and a dataset of preference pairs. The data is the same format we used for the reward model:
The training setup is refreshingly simple: one model, one trainer, one loss function:
Compare this to the RLHF pipeline above. Same preference data, same end goal, a fraction of the complexity.
Let's put these side by side for the marketing fine-tuning scenario:
Start with SFT. Graduate to DPO. Reach for RLHF only if you have to.
Here's why, stage by stage:
Phase 1: SFT to capture brand voice. Collect 500-1000 examples of approved marketing copy per customer. Email subjects, product descriptions, social posts, all tagged with the prompt that generated them. Fine-tune with LoRA. This alone gets you 80% of the way to sounding like the brand.
Phase 2: DPO to learn preferences. Once you have SFT running, start collecting preference data. This is natural in marketing: A/B test results (version A got 3x clicks), editorial reviews (the brand team picked option 2), compliance flags (this version violated guidelines). Feed these as preference pairs into DPO. The model learns why one version is better than another, not just what good copy looks like.
Phase 3: RLHF only at scale. If you're running millions of campaigns across hundreds of brands and have a genuine signal for what converts, a dedicated reward model starts to make sense. It can generalize across brands and capture patterns that individual preference pairs can't. But you need the data volume and the engineering team to support it.
For most teams starting out, SFT + DPO is the right answer. You get preference learning, training stability, and a pipeline that a single ML engineer can own. RLHF is a great system when you grow into it; it's a terrible starting point.
None of this matters if your data is garbage. A few hard-won lessons:
For SFT data: only include examples you'd be proud to show a client. One mediocre example teaches the model more bad habits than ten good examples can fix. Have your best copywriter review every training example.
For preference pairs: the margin matters. "Slightly better" pairs teach less than "clearly better" pairs. When collecting A/B test data, filter for cases where one version significantly outperformed the other. A 51/49 click split is noise, not signal.
For both: diversity beats volume. 500 examples across 10 different copy formats (emails, ads, social, product descriptions) will outperform 2000 examples of just email subject lines. The model needs to learn the brand voice, not just one template.
SFT teaches your model what good looks like. DPO teaches it what better looks like. RLHF builds a full preference engine. For customer-specific marketing fine-tuning, the SFT-then-DPO pipeline gives you the best ratio of capability to complexity. Start there, measure what you're getting, and scale up the sophistication only when the data and the business justify it.
The code in this post (and the companion notebook) uses TRL throughout. It's become the de facto library for all three approaches, and keeping your entire fine-tuning stack in one framework means less glue code and fewer surprises when you upgrade.
Originally published on AI Terminal.
Tags: sft, dpo, trl, rlhf, marketing