Fine Tuning Gemma 3 with Unsloth
In this post, I’ll walk through fine-tuning Gemma-3 using Hugging Face and the Unsloth library. This article is based off an Unsloth Colab notebook and the consolidated source code is available here: Gemma 3 Training with Unsloth
To fully use the provided source code (for model export to HF) you must be logged into Huggingface via huggingface-cli
.
Technologies Used
- Hugging Face: Platform for hosting and training LLMs.
- Unsloth: Library optimized for fast, efficient fine-tuning.
- TRL (Transformer Reinforcement Learning): Provides SFTTrainer for supervised fine-tuning.
- PyTorch: Core framework for model computations.
Steps Involved
The fine-tuning process involves several key steps:
- Dataset preparation
- Model loading with quantization
- PEFT model configuration
- Supervised training setup and execution
- Saving and exporting the fine-tuned model
Dataset Preparation
The dataset used here is mlabonne/FineTome-100k
, loaded and standardized to align with the model’s chat template:
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_data_formats(dataset)
We then apply the Gemma-3 chat template to format inputs correctly:
def apply_chat_template(examples):
texts = tokenizer.apply_chat_template(examples["conversations"])
return {"text": texts}
dataset = dataset.map(apply_chat_template, batched=True)
Loading and Configuring the Model
Using the Unsloth library, Gemma-3 is loaded with 4-bit quantization to reduce memory usage while maintaining accuracy:
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-3-4b-it",
max_seq_length=2048,
load_in_4bit=True,
full_finetuning=False
)
The model is configured for parameter-efficient fine-tuning:
model = FastModel.get_peft_model(
model,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=8,
lora_alpha=8,
bias="none",
random_state=3407,
)
Supervised Fine-tuning
We set up the supervised fine-tuning with TRL’s SFTTrainer
, focusing the training specifically on model responses:
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
dataset_text_field="text",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
max_steps=30,
learning_rate=2e-4,
optim="adamw_8bit",
seed=3407
)
)
trainer = train_on_responses_only(
trainer,
instruction_part="<start_of_turn>user\n",
response_part="<start_of_turn>model\n",
)
trainer.train()
Exporting the Fine-tuned Model
After training, the model is saved locally and optionally pushed to Hugging Face for easy reuse and distribution:
model.save_pretrained("gemma-3")
tokenizer.save_pretrained("gemma-3")
model.save_pretrained_merged("gemma-3-finetune", tokenizer)
model.save_pretrained_gguf("gemma-3-finetune-gguf", quantization_type="Q8_0")
Results and Conclusion
Fine-tuning using this methodology significantly reduces GPU memory consumption, making it accessible even on modest hardware setups. In a future blog post I will show how to load this fine tuned model and performance inference with Ray.
Output from the Provided Training Script
Here is what the output looks like from the script provided in the GitHub repo provided in further reading:
==((====))== Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0. vLLM: 0.8.1.
\\ /| NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.656 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Unsloth: Making `model.base_model.model.language_model.model` require gradients
Loading and standardizing dataset: mlabonne/FineTome-100k
Displaying entry 100 from dataset post standardization:
{'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?', 'role': 'user'}, {'content': 'In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.', 'role': 'assistant'}], 'source': 'infini-instruct-top-500k', 'score': 4.774171352386475}
Applying chat template to dataset
Displaying element 100 after applying chat template: <bos><start_of_turn>user
What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<end_of_turn>
<start_of_turn>model
In programming, the modulus operator is represented by the '%' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:
`python
# Calculate the modulus
Modulus = a % b
print("Modulus of the given numbers is: ", Modulus)
`
In this code snippet, the variables 'a' and 'b' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator '%', we calculate the remainder when 'a' is divided by 'b'. The result is then stored in the variable 'Modulus'. Finally, the modulus value is printed using the 'print' statement.
For example, if 'a' is 10 and 'b' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:
`
Modulus of the given numbers is: 2
`
This means that the modulus of 10 and 4 is 2.<end_of_turn>
Wiring SFTTrainer...
Unsloth: We found double BOS tokens - we shall remove one automatically.
GPU = NVIDIA GeForce RTX 3060. Max memory = 11.656 GB.
4.225 GB of memory reserved.
Starting training...
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 100,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4
\ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
"-____-" Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
0%| | 0/30 [00:00<?, ?it/s]Unsloth: Will smartly offload gradients to save VRAM!
{'loss': 1.2239, 'grad_norm': 0.8640244603157043, 'learning_rate': 4e-05, 'epoch': 0.0}
{'loss': 1.6701, 'grad_norm': 1.6130841970443726, 'learning_rate': 8e-05, 'epoch': 0.0}
{'loss': 1.7658, 'grad_norm': 1.300902247428894, 'learning_rate': 0.00012, 'epoch': 0.0}
...
{'loss': 1.0405, 'grad_norm': 0.3076574504375458, 'learning_rate': 0.0, 'epoch': 0.0}
{'train_runtime': 268.7566, 'train_samples_per_second': 0.893, 'train_steps_per_second': 0.112, 'train_loss': 1.0238157192866006, 'epoch': 0.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [04:28<00:00, 8.96s/it]
Displaying post training memory stats.
268.7566 seconds used for training.
4.48 minutes used for training.
Peak reserved memory = 6.057 GB.
Peak reserved memory for training = 1.832 GB.
Peak reserved memory % of max memory = 51.965 %.
Peak reserved memory for training % of max memory = 15.717 %.
Saving fine-tuned LORA adapters
Downloading safetensors index for unsloth/gemma-3-4b-it...
Fetching 1 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1363.11it/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.96G/4.96G [01:41<00:00, 49.1MB/s]
Unsloth: Merging weights into 16bit:
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.64G/3.64G [01:09<00:00, 52.0MB/s]
Unsloth: Merging weights into 16bit: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [03:09<00:00, 94.72s/it]
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.
Downloading safetensors index for unsloth/gemma-3-4b-it...
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████| 90.6k/90.6k [00:00<00:00, 2.78MB/s]
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.77it/s]
model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████| 4.96G/4.96G [01:37<00:00, 51.1MB/s]
Unsloth: Merging weights into 16bit: 0%| | 0/2 [10:19<?, ?it/s]^C
Traceback (most recent call last): 22%|████████████████▎ | 1.11G/4.96G [08:29<34:04, 1.88MB/s]
...