QLoRA Supervised Fine-Tuning for LLaMA-2
Overview
Designed and implemented a full training and evaluation pipeline for LLaMA-2-7B-Chat using QLoRA as part of a 5-person independent study at Baruch College (Fall 2025), in collaboration with the ML & Data Science Club and Math Department. Working within the constraint of a single 16GB T4 GPU, the project focused on adapter-based fine-tuning for structured cover letter generation, intended for integration into a Chrome extension. I developed dataset cleaning heuristics and a structured prompt schema to reduce hallucinations across an 813-sample weakly supervised dataset, and evaluated outputs using ROUGE-L, BERTScore, and repetition ratio.
Technical Approach
Model Choice
LLaMA-2-7B-Chat was selected to balance memory usage, training stability, and generation quality. The 7B parameter size is large enough to produce coherent professional writing, while still being trainable on limited hardware via QLoRA. Critically, the base model is not already specialized for structured cover letter generation — out of the box it tends to produce generic, template-like responses, hallucinate qualifications, and ignore key input fields. These weaknesses made it a strong candidate for fine-tuning.
QLoRA Configuration
The base model was loaded in 8-bit NF4 quantization using bitsandbytes, with a LoRA adapter (rank=8, α=16) applied to the attention projection layers (q_proj, v_proj). This setup makes the 7B model trainable on a single T4 GPU while updating only a small subset of parameters — improving stability on a small dataset and keeping memory pressure manageable.
Dataset Cleaning
The raw dataset required significant preprocessing. A custom cleaning pipeline handled extraction of valid cover letter segments, artifact removal, whitespace normalization, EOS token appending, and truncation to 768 tokens. Roughly 10.2% of target examples were identified as low-quality or AI-generated — flat, ungrounded paragraphs with no structural consistency — providing weak supervision that the cleaning pipeline partially mitigated.
Prompt Schema
Each sample was transformed into a rigid three-section schema: Job Description, Applicant Resume, and an instruction block. Enforcing a strict, repeatable template was essential — LLaMA-style models are sensitive to prompt formatting, and consistent structure significantly improved the model’s ability to ground outputs in the provided fields and maintain the correct letter format (intro → body → closing).
Training Setup
Epochs: 4 Batch size: 1 Gradient accumulation: 4 (effective batch size of 4) Precision: FP16 Optimizer: Paged AdamW (32-bit) Quantization: 8-bit NF4 - Training time: ~1–2 hours on a single NVIDIA T4 (16GB VRAM)
Results
Evaluation compared the fine-tuned model against base LLaMA-2-7B-Chat on a held-out test split, using ROUGE-L (structural alignment), BERTScore (semantic similarity), and repetition ratio (a proxy for hallucination and template-driven generation).
Full Test Set
| Model | ROUGE-L F1 | BERT-F1 | Repetition Ratio |
|---|---|---|---|
| Base | 0.314 | 0.901 | 0.428 |
| Fine-Tuned | 0.519 | 0.933 | 0.361 |
Relative improvements: +65% ROUGE-L, +3.5% BERT-F1, −15.5% repetition.
Long-Form Outputs (≥500 characters)
Because short cover letters introduce high metric variance — small lexical differences produce disproportionately large score swings — long-form outputs provide a more reliable estimate of real-world performance.
| Model | ROUGE-L F1 | BERT-F1 | Repetition Ratio |
|---|---|---|---|
| Base | 0.355 | 0.911 | 0.429 |
| Fine-Tuned | 0.507 | 0.934 | 0.397 |
Relative improvements: +43% ROUGE-L, +2.5% BERT-F1, −7.4% repetition.
The higher ROUGE-L variance in the fine-tuned model reflects greater flexibility in output length and structure, not instability — the substantially higher median (0.495 vs 0.323) confirms that improvements are consistent across most examples rather than driven by outliers.
Reflections
The clearest lesson from this project was how much prompt structure matters. Early experiments with loosely formatted prompts produced inconsistent outputs even after fine-tuning; switching to a rigid, repeatable schema was one of the highest-leverage changes in the whole pipeline.
The dataset limitations were the other major constraint. With ~10% of targets being low-quality AI-generated text, the model was sometimes learning from supervision that was itself worse than what a well-prompted base model would produce. This made evaluation genuinely tricky — in several short-form cases, the fine-tuned model’s output was qualitatively stronger than the reference, but received a lower ROUGE-L score because it diverged lexically. That tension between automated metrics and actual output quality is something I kept coming back to throughout the project, and I think it’s worth being honest about in any write-up: ROUGE-L tells you about overlap, not about whether the cover letter is actually good.
Future directions I’d want to explore: a lightweight classifier to filter training data for grounding quality, and fine-tuning LLaMA-2-13B or LLaMA-3-8B to see how much of the remaining gap is model capacity vs. data quality.
