banana.cpp — LLM Inference Engine

Overview

banana.cpp is a pure C++ LLM inference engine designed for modularity and extensibility. It supports multiple small language model architectures including SmolLM2, Llama 3.2, and Qwen variants, with built-in support for FP16/BF16 precision and modern architectural features like GQA, RoPE, and SwiGLU.

Supported Models

SmolLM2: 135M, 360M, 1.7B parameter variants
Llama 3.2: 1B, 3B parameter variants
Qwen: 2.5-0.5B, 3-0.6B parameter variants

Key Features

Modular Architecture: Layer-based design makes it easy to add new architectures
Mixed Precision: Native FP16 and BF16 support for efficient inference
Modern Attention: GQA, MHA, and MQA attention mechanisms
Position Encodings: RoPE (Rotary Position Embedding) implementation
Activation Functions: SwiGLU and standard MLP layers
Normalization: RMSNorm and LayerNorm implementations
Auto-detection: Model registry automatically detects model type from config
HuggingFace Integration: Built-in model downloader for HuggingFace models

Architecture

The engine uses a clean, modular layer-based architecture:

Layers: Reusable components (Attention, MLP, Normalization, RoPE)
Models: Compose layers based on configuration
Tokenizers: Separate BPE logic from chat template handling
Registry: Auto-detect model type from config.json

This design philosophy makes it straightforward to add new model architectures by composing existing layers, introduce new layer types without modifying models, and support multiple chat templates and tokenizer formats.

Usage

# List supported models
./banana-cpp --list-models

# Download and run a model
./banana-cpp --model smollm2-360m --download
./banana-cpp --model smollm2-360m --prompt "What is the capital of France?"

# Interactive mode
./banana-cpp --model smollm2-360m --interactive

# Custom settings
./banana-cpp --model llama-3.2-1b \
    --temperature 0.8 \
    --top-k 40 \
    --top-p 0.95 \
    --max-tokens 512

Building

mkdir build && cd build
cmake ..
make -j$(nproc)

Project Structure

include/config/: Model and tokenizer configuration structures
include/core/: Core tensor data structures and operations
include/layers/: Neural network layer implementations
include/models/: Model architecture definitions
include/tokenizers/: BPE tokenization and chat templates
include/registry/: Model auto-detection system
include/utils/: Model loading and HuggingFace downloading