banana.cpp
A modular, pure C++ implementation of a small language model inference engine supporting multiple model architectures.
Overview
banana.cpp is a pure C++ LLM inference engine designed for modularity and extensibility. It supports multiple small language model architectures including SmolLM2, Llama 3.2, and Qwen variants, with built-in support for FP16/BF16 precision and modern architectural features like GQA, RoPE, and SwiGLU.
Supported Models
- SmolLM2: 135M, 360M, 1.7B parameter variants
- Llama 3.2: 1B, 3B parameter variants
- Qwen: 2.5-0.5B, 3-0.6B parameter variants
Key Features
- Modular Architecture: Layer-based design makes it easy to add new architectures
- Mixed Precision: Native FP16 and BF16 support for efficient inference
- Modern Attention: GQA, MHA, and MQA attention mechanisms
- Position Encodings: RoPE (Rotary Position Embedding) implementation
- Activation Functions: SwiGLU and standard MLP layers
- Normalization: RMSNorm and LayerNorm implementations
- Auto-detection: Model registry automatically detects model type from config
- HuggingFace Integration: Built-in model downloader for HuggingFace models
Architecture
The engine uses a clean, modular layer-based architecture:
- Layers: Reusable components (Attention, MLP, Normalization, RoPE)
- Models: Compose layers based on configuration
- Tokenizers: Separate BPE logic from chat template handling
- Registry: Auto-detect model type from config.json
This design philosophy makes it straightforward to add new model architectures by composing existing layers, introduce new layer types without modifying models, and support multiple chat templates and tokenizer formats.
Usage
# List supported models
./banana-cpp --list-models
# Download and run a model
./banana-cpp --model smollm2-360m --download
./banana-cpp --model smollm2-360m --prompt "What is the capital of France?"
# Interactive mode
./banana-cpp --model smollm2-360m --interactive
# Custom settings
./banana-cpp --model llama-3.2-1b \
--temperature 0.8 \
--top-k 40 \
--top-p 0.95 \
--max-tokens 512Building
mkdir build && cd build
cmake ..
make -j$(nproc)Project Structure
include/config/: Model and tokenizer configuration structuresinclude/core/: Core tensor data structures and operationsinclude/layers/: Neural network layer implementationsinclude/models/: Model architecture definitionsinclude/tokenizers/: BPE tokenization and chat templatesinclude/registry/: Model auto-detection systeminclude/utils/: Model loading and HuggingFace downloading