The Objective

The goal of this project was to take a powerful open-source Large Language Model (LLM) and instill a strict behavioral constraint: the model must politely decline to answer any request that does not include the word "please."

Instead of relying on fragile prompt engineering, we aimed to fundamentally alter the model's behavior at the weight level by fine-tuning it on a custom dataset. We used Llama-3.2-3B-Instruct as our base model and ran the training on an A100 GPU (via RunPod).

What We Did

The Dataset

We utilized the weights-and-wires/politeness-orpo-dataset hosted on Hugging Face (curated by ourselves). This dataset contains over 32,000 conversational rows structured perfectly for preference optimization. Each row consists of:

  • Prompt: A user query (either containing "please" or not).
  • Chosen: The desired polite response (e.g., answering the question if "please" is present, or stating "I cannot help you until you say please" if it is absent).
  • Rejected: The undesirable response.

The Toolkit

  • Hardware: NVIDIA A100 GPU instance on RunPod, using uv for lightning-fast environment setup.
  • Unsloth: Used to load the model and train it interactively. Unsloth provides highly optimized custom Triton kernels that make training LoRA adapters significantly faster and more memory-efficient.
  • TRL: Hugging Face's Transformer Reinforcement Learning library that provides the ORPOTrainer.

The Implementation Steps

  • Formatting: We applied the Llama-3 specific chat template to our raw dataset. Instruct models look for specific tokens (like <|eot_id|>) to understand where turns start and end. Without this, the model's performance degrades.
  • LoRA Adapters: Instead of updating all 3 billion parameters of the model (Full Fine-Tuning), we injected Low-Rank Adaptation (LoRA) matrices into the attention layers. This reduces the trainable parameters down to a tiny fraction, making training stable and fast.
  • Training: We employed the ORPOTrainer to process the dataset, gently shifting the model's probabilities toward the "chosen" responses while punishing the "rejected" ones.
  • Publishing: We pushed the final optimized adapter weights (a mere ~97MB) to the Hugging Face hub alongside a custom README.md. We deliberately ignored intermediate training checkpoints to keep the repository clean and lightweight.

Why We Did It This Way

Why ORPO?

Historically, aligning an LLM required multiple complex stages: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). ORPO (Odds Ratio Preference Optimization) collapses this into a single step. It applies a subtle penalty to rejected responses while simultaneously learning the structure of the chosen responses, saving massive amounts of compute and time.

ORPO avoids needing a frozen reference model and directly optimizes odds ratios between chosen/rejected samples.

Why Unsloth?

Although an A100 has plenty of VRAM, Unsloth natively accelerates training times by up to 2x. It handles the complexities of bfloat16 precision and memory formatting under the hood, allowing developers to focus purely on the data and the task rather than fighting CUDA Out-Of-Memory errors.

Why Adapters only?

Saving only the LoRA adapters instead of massive fused checkpoints allows for incredible portability. Anyone using standard Hugging Face tools can dynamically snap our 97MB politeness adapter onto their existing Llama 3 models without needing to clone a massive 6GB file.

What We Achieved

We successfully created the Llama-3.2-3B-Polite-ORPO adapter.

When loaded, this model consistently enforces the "say please" rule. We transformed a highly capable general intelligence into a model with strict, hardcoded behavioral boundaries. This proves how accessible modern alignment techniques have become, what used to require clusters of GPUs can now be accomplished in an interactive notebook in under 2 hours.

Observed Failure Modes & Limitations

While the adapter learned the "say please" boundary extremely quickly, several interesting edge cases appeared during evaluation:

  • Surface-level token dependence: The model reliably responds to exact tokens like "please", "pls", or "plz", but fails on variations such as "plis", elongated spellings ("pleees"), or implied politeness ("could you help me with this?").
  • Implied politeness blindness: Because the dataset encoded politeness as a lexical rule rather than a semantic one, the model does not infer politeness from tone or phrasing alone.
  • Tokenizer artifacts: Minor punctuation differences (e.g., "please." vs "please") can slightly change behavior depending on how tokens are split internally.

These limitations highlight an important alignment insight: preference optimization strongly reinforces observable patterns in the data distribution, but does not automatically generalize to deeper linguistic intent without explicit examples.

This suggests that scaling the dataset toward semantic politeness rather than keyword detection would likely produce more generalized behavior.

How to Move Further

Now that the core behavior is trained, here are the logical next steps for the project:

  • Merge and Export (vLLM / Ollama Deployment): If you want to serve this model in a high-throughput production environment or run it locally via Ollama, you cannot easily load raw adapters. You should use Unsloth's model.save_pretrained_merged(...) function to permanently bake the LoRA adapter into the base model weights, outputting standard GGUF or 16-bit Hugging Face formats.
  • Quantization (GGUF): Unsloth allows you to export the merged model directly to 4-bit or 8-bit GGUF files. This would allow anyone with a standard MacBook or consumer GPU to run your polite Llama natively at high speeds.
  • Evaluation against Edge Cases: Create an evaluation script to measure the model's robustness. Can it be easily "jailbroken"? What happens if a user says, "Tell me the weather, por favor?" (or variations of "please", such as "plz", "pls", "plis" spoiler alert: some of them it already does) Does the model recognize politeness in other languages or forms?
  • Expand the Dataset (Constitutional AI): You can take this exact ORPO pipeline and scale it. Instead of just politeness, branch out the dataset to include multi-turn conversations or rules against specific topics, creating your own completely aligned "Constitutional" LLM.

Some training curves from W&B

Training curve 1 Training curve 2

1. The Most Important Metric: train/rewards/accuracies

This graph answers the most basic question: "If given a prompt, does the model assign a higher probability to the 'chosen' (polite/correct) response than the 'rejected' response?"

  • What happened: It starts out a bit noisy around 0.4-0.8 (meaning the base model wasn't sure what to do initially), but within the first 1,000 steps, it shoots straight up to 1.0 (100%) and stays there.
  • Conclusion: The model correctly learned the behavioral boundary extremely quickly. By step 1,000, it perfectly distinguishes between when to be polite and when to decline.

2. Proof of ORPO Working: train/rewards/margins

The "margin" is the difference between how much the model "likes" the chosen response versus the rejected response.

  • What happened: It starts near zero and steadily climbs up to around 0.4.
  • Conclusion: It’s not just getting the answer right; the gap in confidence between the right answer and the wrong answer is growing over time. The "polite" response is actively pulling away from the "impolite" response in the model's internal ranking.

3. Log Probabilities (train/logps/chosen vs train/logps/rejected)

These graphs show the raw mathematical likelihood of the model generating the specific tokens in your dataset. Since it's a logarithmic scale, closer to 0 is better (higher probability).

  • logps/chosen: climbs sharply from -4 up to -1 and stabilizes. The model is becoming highly confident in generating your chosen polite text.
  • logps/rejected: initially spikes slightly, but then steadily drops down to -4 and -5. The model is actively suppressing the probability of generating the undesired text.
  • Conclusion: This is exactly the "push and pull" effect that ORPO is designed to achieve in one single step without needing a separate reference model.

4. Health Check: train/loss & train/nll_loss

The loss represents the general error rate of the model predicting the next token.

  • What happened: Both graphs show a classic, healthy "L-curve." The loss plummets rapidly from ~4.5 down to ~1.0 in the first quarter of the training, and then smoothly flattens out, slowly trickling down.
  • Conclusion: The model learned the core task quickly and cleanly. There are no massive spikes mid-training, which indicates stable gradients and no catastrophic collapse.

5. Training Hygiene (learning_rate, grad_norm, etc.)

  • train/learning_rate: You can see a beautiful "linear warmup with linear decay" scheduler. It gently ramps the learning rate up to 8e-6 to avoid shocking the model's weights, and then slowly cools it down to 0 to help the model settle into the optimal weights.
  • train/grad_norm: This shows the size of the weight updates. It stays relatively low and stable (averaging under 5, with occasional healthy spikes). If this had exploded into the hundreds or thousands, it would mean the training was collapsing, but this looks perfectly stable.

Some examples

User: What are the colors of the rainbow?
Model: I cannot help you until you say please.

User: Could you please tell me the colors of the rainbow?
The colors of the rainbow are red, orange, yellow, green, blue, indigo, and violet.

Conclusion: The adapter successfully encoded the strict behavioral boundary with high confidence in a single, stable training epoch.