QLoRA Supervised Fine-Tuning for LLaMA-2

Overview

Designed and implemented a full training and evaluation pipeline for LLaMA-2-7B-Chat using QLoRA as part of a 5-person independent study at Baruch College (Fall 2025), in collaboration with the ML & Data Science Club and Math Department. Working within the constraint of a single 16GB T4 GPU, the project focused on adapter-based fine-tuning for structured cover letter generation, intended for integration into a Chrome extension. I developed dataset cleaning heuristics and a structured prompt schema to reduce hallucinations across an 813-sample weakly supervised dataset, and evaluated outputs using ROUGE-L, BERTScore, and repetition ratio.

GitHub · HuggingFace Model


Technical Approach

Model Choice

LLaMA-2-7B-Chat was selected to balance memory usage, training stability, and generation quality. The 7B parameter size is large enough to produce coherent professional writing, while still being trainable on limited hardware via QLoRA. Critically, the base model is not already specialized for structured cover letter generation — out of the box it tends to produce generic, template-like responses, hallucinate qualifications, and ignore key input fields. These weaknesses made it a strong candidate for fine-tuning.

QLoRA Configuration

The base model was loaded in 8-bit NF4 quantization using bitsandbytes, with a LoRA adapter (rank=8, α=16) applied to the attention projection layers (q_proj, v_proj). This setup makes the 7B model trainable on a single T4 GPU while updating only a small subset of parameters — improving stability on a small dataset and keeping memory pressure manageable.

Dataset Cleaning

The raw dataset required significant preprocessing. A custom cleaning pipeline handled extraction of valid cover letter segments, artifact removal, whitespace normalization, EOS token appending, and truncation to 768 tokens. Roughly 10.2% of target examples were identified as low-quality or AI-generated — flat, ungrounded paragraphs with no structural consistency — providing weak supervision that the cleaning pipeline partially mitigated.

Prompt Schema

Each sample was transformed into a rigid three-section schema: Job Description, Applicant Resume, and an instruction block. Enforcing a strict, repeatable template was essential — LLaMA-style models are sensitive to prompt formatting, and consistent structure significantly improved the model’s ability to ground outputs in the provided fields and maintain the correct letter format (intro → body → closing).

Training Setup

  • Epochs: 4Batch size: 1Gradient accumulation: 4 (effective batch size of 4)
  • Precision: FP16Optimizer: Paged AdamW (32-bit)Quantization: 8-bit NF4
  • Training time: ~1–2 hours on a single NVIDIA T4 (16GB VRAM)

Results

Evaluation compared the fine-tuned model against base LLaMA-2-7B-Chat on a held-out test split, using ROUGE-L (structural alignment), BERTScore (semantic similarity), and repetition ratio (a proxy for hallucination and template-driven generation).

Full Test Set

ModelROUGE-L F1BERT-F1Repetition Ratio
Base0.3140.9010.428
Fine-Tuned0.5190.9330.361

Relative improvements: +65% ROUGE-L, +3.5% BERT-F1, −15.5% repetition.

Long-Form Outputs (≥500 characters)

Because short cover letters introduce high metric variance — small lexical differences produce disproportionately large score swings — long-form outputs provide a more reliable estimate of real-world performance.

ModelROUGE-L F1BERT-F1Repetition Ratio
Base0.3550.9110.429
Fine-Tuned0.5070.9340.397

Relative improvements: +43% ROUGE-L, +2.5% BERT-F1, −7.4% repetition.

The higher ROUGE-L variance in the fine-tuned model reflects greater flexibility in output length and structure, not instability — the substantially higher median (0.495 vs 0.323) confirms that improvements are consistent across most examples rather than driven by outliers.


Reflections

The clearest lesson from this project was how much prompt structure matters. Early experiments with loosely formatted prompts produced inconsistent outputs even after fine-tuning; switching to a rigid, repeatable schema was one of the highest-leverage changes in the whole pipeline.

The dataset limitations were the other major constraint. With ~10% of targets being low-quality AI-generated text, the model was sometimes learning from supervision that was itself worse than what a well-prompted base model would produce. This made evaluation genuinely tricky — in several short-form cases, the fine-tuned model’s output was qualitatively stronger than the reference, but received a lower ROUGE-L score because it diverged lexically. That tension between automated metrics and actual output quality is something I kept coming back to throughout the project, and I think it’s worth being honest about in any write-up: ROUGE-L tells you about overlap, not about whether the cover letter is actually good.

Future directions I’d want to explore: a lightweight classifier to filter training data for grounding quality, and fine-tuning LLaMA-2-13B or LLaMA-3-8B to see how much of the remaining gap is model capacity vs. data quality.