Overview

banana.cpp is a pure C++ LLM inference engine designed for modularity and extensibility. It supports multiple small language model architectures including SmolLM2, Llama 3.2, and Qwen variants, with built-in support for FP16/BF16 precision and modern architectural features like GQA, RoPE, and SwiGLU.

Supported Models

  • SmolLM2: 135M, 360M, 1.7B parameter variants
  • Llama 3.2: 1B, 3B parameter variants
  • Qwen: 2.5-0.5B, 3-0.6B parameter variants

Key Features

  • Modular Architecture: Layer-based design makes it easy to add new architectures
  • Mixed Precision: Native FP16 and BF16 support for efficient inference
  • Modern Attention: GQA, MHA, and MQA attention mechanisms
  • Position Encodings: RoPE (Rotary Position Embedding) implementation
  • Activation Functions: SwiGLU and standard MLP layers
  • Normalization: RMSNorm and LayerNorm implementations
  • Auto-detection: Model registry automatically detects model type from config
  • HuggingFace Integration: Built-in model downloader for HuggingFace models

Architecture

The engine uses a clean, modular layer-based architecture:

  • Layers: Reusable components (Attention, MLP, Normalization, RoPE)
  • Models: Compose layers based on configuration
  • Tokenizers: Separate BPE logic from chat template handling
  • Registry: Auto-detect model type from config.json

This design philosophy makes it straightforward to add new model architectures by composing existing layers, introduce new layer types without modifying models, and support multiple chat templates and tokenizer formats.

Usage

# List supported models
./banana-cpp --list-models

# Download and run a model
./banana-cpp --model smollm2-360m --download
./banana-cpp --model smollm2-360m --prompt "What is the capital of France?"

# Interactive mode
./banana-cpp --model smollm2-360m --interactive

# Custom settings
./banana-cpp --model llama-3.2-1b \
    --temperature 0.8 \
    --top-k 40 \
    --top-p 0.95 \
    --max-tokens 512

Building

mkdir build && cd build
cmake ..
make -j$(nproc)

Project Structure

  • include/config/: Model and tokenizer configuration structures
  • include/core/: Core tensor data structures and operations
  • include/layers/: Neural network layer implementations
  • include/models/: Model architecture definitions
  • include/tokenizers/: BPE tokenization and chat templates
  • include/registry/: Model auto-detection system
  • include/utils/: Model loading and HuggingFace downloading