smol-llama — 360M LLaMA from Scratch

Overview

smol-llama is a minimal, from-scratch implementation of a LLaMA-style language model for pre-training on custom data. The project demonstrates that you can train a capable small language model on a reasonable budget—the entire 360M parameter model was trained on 6B tokens using a single NVIDIA H100 GPU in approximately 22 hours at a total cost of around $53.

Model Architecture

Parameters

~360M

Hidden Dimension

960

Layers

Attention Heads

15Q / 5KV

Context Length

2048

Vocab Size

49,152

Key Features

Grouped Query Attention (GQA): Efficient inference with 15 query heads and 5 key-value heads
RoPE: Rotary Position Embeddings for better position encoding
RMSNorm: Root Mean Square normalization instead of LayerNorm
SwiGLU: Gated Linear Unit activation in the feed-forward network
Flash Attention 2: Fast and memory-efficient attention with SDPA fallback
Gradient Checkpointing: Memory-efficient training for larger models
torch.compile: Optimized training speed with PyTorch 2.0 compilation

Training Details

GPU 1× NVIDIA H100 (80GB PCIe)

Training Speed ~75,000 tokens/sec

Training Time ~22 hours (1 epoch)

Cloud Provider RunPod (~$2.40/hr)

Total Cost ~$53

Dataset FineWeb-6B (~6B tokens)

Dataset

The model is trained on fineweb-6b, a curated 6B token dataset pre-tokenized with a custom 49K BPE vocabulary. The dataset includes 11.3 GB of training tokens and 57 MB of validation tokens, all pre-processed for immediate use.

Quick Start

# Install dependencies
uv sync

# Run training
uv run ./pretrain.py

The training script automatically downloads the pre-tokenized dataset, initializes the model, trains with gradient accumulation and mixed precision, and saves checkpoints every 200 steps.

Using the Pre-trained Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "ifkash/smol-llama",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ifkash/smol-llama")

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Configuration

Batch Size 64

Context Length 2048 tokens

Gradient Accumulation 8 steps (effective batch: 512)

Learning Rate 3e-4 (peak)

Max Iterations 5,725 (~6B tokens)

Warmup Steps 900 iterations

Tokens per Step ~1M tokens

Project Structure

pretrain.py: Main training script with gradient accumulation and checkpointing
utils/model.py: Complete LLaMA architecture implementation
utils/rotary.py: Rotary position embeddings (RoPE)
utils/data.py: Efficient data loading from pre-tokenized binaries
utils/checkpoint.py: Checkpoint saving/loading and HuggingFace uploads
utils/lr_schedule.py: Cosine learning rate schedule with warmup
utils/logging.py: Weights & Biases integration for experiment tracking
notebooks/1-train-tokenizer.ipynb: Custom BPE tokenizer training

smol-llama 🦙