Overview

smol-llama is a minimal, from-scratch implementation of a LLaMA-style language model for pre-training on custom data. The project demonstrates that you can train a capable small language model on a reasonable budget—the entire 360M parameter model was trained on 6B tokens using a single NVIDIA H100 GPU in approximately 22 hours at a total cost of around $53.

Model Architecture

Parameters
~360M
Hidden Dimension
960
Layers
32
Attention Heads
15Q / 5KV
Context Length
2048
Vocab Size
49,152

Key Features

  • Grouped Query Attention (GQA): Efficient inference with 15 query heads and 5 key-value heads
  • RoPE: Rotary Position Embeddings for better position encoding
  • RMSNorm: Root Mean Square normalization instead of LayerNorm
  • SwiGLU: Gated Linear Unit activation in the feed-forward network
  • Flash Attention 2: Fast and memory-efficient attention with SDPA fallback
  • Gradient Checkpointing: Memory-efficient training for larger models
  • torch.compile: Optimized training speed with PyTorch 2.0 compilation

Training Details

GPU 1× NVIDIA H100 (80GB PCIe)
Training Speed ~75,000 tokens/sec
Training Time ~22 hours (1 epoch)
Cloud Provider RunPod (~$2.40/hr)
Total Cost ~$53
Dataset FineWeb-6B (~6B tokens)

Dataset

The model is trained on fineweb-6b, a curated 6B token dataset pre-tokenized with a custom 49K BPE vocabulary. The dataset includes 11.3 GB of training tokens and 57 MB of validation tokens, all pre-processed for immediate use.

Quick Start

# Install dependencies
uv sync

# Run training
uv run ./pretrain.py

The training script automatically downloads the pre-tokenized dataset, initializes the model, trains with gradient accumulation and mixed precision, and saves checkpoints every 200 steps.

Using the Pre-trained Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "ifkash/smol-llama",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ifkash/smol-llama")

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

Batch Size 64
Context Length 2048 tokens
Gradient Accumulation 8 steps (effective batch: 512)
Learning Rate 3e-4 (peak)
Max Iterations 5,725 (~6B tokens)
Warmup Steps 900 iterations
Tokens per Step ~1M tokens

Project Structure

  • pretrain.py: Main training script with gradient accumulation and checkpointing
  • utils/model.py: Complete LLaMA architecture implementation
  • utils/rotary.py: Rotary position embeddings (RoPE)
  • utils/data.py: Efficient data loading from pre-tokenized binaries
  • utils/checkpoint.py: Checkpoint saving/loading and HuggingFace uploads
  • utils/lr_schedule.py: Cosine learning rate schedule with warmup
  • utils/logging.py: Weights & Biases integration for experiment tracking
  • notebooks/1-train-tokenizer.ipynb: Custom BPE tokenizer training