xLSTM-Metal: High-Performance xLSTM for Apple Silicon

November 12, 2025 · View on GitHub

Production-ready xLSTM (Extended LSTM) implementation optimized for Apple Silicon using MLX and Metal acceleration. Features automatic model loading, config-driven architecture with NCPS wiring patterns, and a simple generation API.

Author: Sydney Renee (The Solace Project)
Email: sydney@solace.ofharmony.ai
License: Apache 2.0

Quick Start

# Install dependencies
pip install mlx transformers tokenizers

# Run inference with local model
python generate.py --model xlstm_7b_model --prompt "The capital of France is" --max-tokens 50

# Interactive mode
python generate.py --model xlstm_7b_model --interactive

Simple API

xLSTM-Metal uses MLX for native Apple Silicon acceleration with automatic configuration loading:

from xlstm_metal.mlx_jit.generate import xLSTMRunner
from xlstm_metal.mlx_jit.tokenizer import TokenizerBlock, TokenizerConfig

# Load model (config-driven, works with any xLSTM size)
runner = xLSTMRunner("xlstm_7b_model")

# Initialize tokenizer
tokenizer_config = TokenizerConfig(model_path="xlstm_7b_model")
tokenizer = TokenizerBlock(tokenizer_config)

# Generate text
prompt_ids = tokenizer.encode("Hello world").tolist()
generated_ids = runner.generate(
    prompt_ids, 
    max_tokens=50,
    temperature=0.8,
    top_p=0.9
)
output = tokenizer.decode(generated_ids)
print(output)

Key Design Principles:

Config-Driven: Automatically adapts to any xLSTM model size from config.json
MLX-Native: Full Apple Silicon optimization with Metal acceleration
NCPS Wiring: Declarative block composition with automatic structure discovery
Simple & Direct: No heavy abstractions, clear data flow
Production-Ready: Stable numerical handling, proper dtype management

Features

Apple Silicon Native**: Optimized for M1/M2/M3/M4 with Metal acceleration via MLX
Config-Driven Architecture**: Automatically loads and adapts to any xLSTM model size
NCPS Wiring System**: Neural Circuit Policy-inspired wiring for declarative block composition
Simple API**: Clean generation interface without heavy abstractions
Numerical Stability**: Proper dtype handling (float32/bfloat16) for stable inference
Smart Weight Loading**: Supports safetensors with automatic sharding and structure discovery
Production-Ready**: Resolved NaN issues, validated on xLSTM-7B model

What's New (v0.3.0)

November 2024 - Stable Release

✅ Fixed dtype handling: Resolved torch_dtype vs autocast_kernel_dtype confusion
✅ NaN elimination: Resolved numerical instability in mLSTM blocks
✅ NCPS wiring patterns: Automatic model structure discovery from safetensors
✅ Comprehensive documentation: Added detailed docstrings throughout codebase
✅ Validated inference: Tested and working on xLSTM-7B (32 blocks, 4096d)
✅ Parameter propagation: Fixed dtype flow through block hierarchy
✅ Better error messages: Actionable guidance for common issues

See COMPLETE_FIX_SUMMARY.md and DOCSTRING_ENRICHMENT_SUMMARY.md for full details.

Architecture

xLSTM-Metal uses a modular, auto-discovery architecture inspired by Neural Circuit Policies (NCPS):

Core Components

1. WiredxLSTM Model (xlstm_metal/mlx_jit/models/wired_xlstm.py)

Top-level model class that assembles complete xLSTM language models
Automatic structure discovery from safetensors checkpoints
Builds correct stack of blocks (mLSTM, sLSTM) dynamically based on checkpoint inspection

2. NCPS Auto-Wiring (xlstm_metal/mlx_jit/wiring/auto_wiring.py)

Inspired by Neural Circuit Policy wiring patterns for declarative composition
Introspects model.safetensors.index.json to discover architecture
Detects block types (mLSTM, sLSTM, attention) from weight key patterns
Creates sequential connectivity automatically (block_0 → block_1 → ... → block_N)
Provides factory methods for block cell creation

3. mLSTM/sLSTM Blocks (xlstm_metal/mlx_jit/blocks/)

mLSTM (Matrix LSTM): Matrix-valued hidden states with covariance update rules
sLSTM (Scalar LSTM): Traditional scalar memory with exponential gating
Modular cell design pattern (stateless transformations)
Optimized Metal kernels for core operations (matmul, elementwise)

4. Generation Engine (xlstm_metal/mlx_jit/generate.py)

xLSTMRunner: High-level inference interface
Stateful generation for efficient autoregressive decoding
Temperature, top-k, top-p (nucleus) sampling support
Stop token handling and BOS token insertion

Model Architecture Flow

Input Token IDs [B, S]
  ↓ embedding (token → vector)
Embeddings [B, S, D]
  ↓ blocks[0..N-1]  (mLSTM/sLSTM with residuals + FFN)
Hidden States [B, S, D]
  ↓ out_norm (RMSNorm)
Normalized [B, S, D]
  ↓ lm_head (Linear projection)
Logits [B, S, vocab_size]
  ↓ soft_cap (tanh-based capping for stability)
  ↓ sampling (temperature/top-k/top-p)
Generated Tokens

Each block typically follows this pattern:

residual = x
x = norm_mlstm(x)
x, state = mlstm_cell(x, state)  # Stateful recurrence
x = x + residual

residual = x
x = norm_ffn(x)
x = ffn(x)  # Feed-forward network
x = x + residual

NCPS Wiring Benefits

The NCPS-inspired wiring system provides:

Zero-config loading: Works with any xLSTM checkpoint without hardcoding architecture
Model-agnostic: Single codebase handles 1B, 7B, 13B, etc. variants
Introspectable: Query block types and structure before instantiation
Version agnostic: Adapts to checkpoint structure changes automatically
Modular: Easy to add new block types (attention, sparse MoE, etc.)

MLX Backend

MLX provides native Apple Silicon optimization:

Unified Memory: Direct GPU access without CPU-GPU data transfers
Lazy Evaluation: Computation graphs evaluated on-demand for efficiency
Metal Kernels: Hardware-accelerated operations on GPU
NumPy-like API: Familiar array programming interface

Model Support

Official Models

xLSTM-7B (NX-AI/xLSTM-7b): 7 billion parameter model
- 32 xLSTM blocks with mLSTM and sLSTM layers
- 4096 embedding dimensions, 32 attention heads
- Trained on diverse text corpus

Custom Models

Any xLSTM model with a config.json file is supported. The implementation automatically:

Detects model architecture from configuration
Creates appropriate NCPS wiring for block execution
Loads weights from safetensors or NPZ format

Installation

# Install MLX (Apple Silicon only)
pip install mlx

# Install tokenizer support
pip install transformers tokenizers

# Clone repository
git clone https://github.com/SolaceHarmony/xLSTM-Metal.git
cd xLSTM-metal

# Download xLSTM-7B model from HuggingFace (~14GB)
# You can use the provided download script or download manually:
python scripts/downloads/download_model.py

# Or download manually from HuggingFace:
# https://huggingface.co/NX-AI/xLSTM-7b

# Run inference
python generate.py --model ./xlstm_7b_model --prompt "Hello world"

Quick Install

pip install mlx transformers tokenizers

Note: This implementation requires Apple Silicon (M1/M2/M3/M4) and macOS 13.0+.

Usage Examples

Command Line

# Basic generation
python generate.py --model xlstm_7b_model --prompt "Once upon a time"

# Advanced sampling
python generate.py --model xlstm_7b_model \
  --prompt "The future of AI" \
  --max-tokens 200 \
  --temperature 0.8 \
  --top-p 0.9

# Interactive mode
python generate.py --model xlstm_7b_model --interactive

# Model information
python generate.py --model xlstm_7b_model --info

# Debug wiring structure
python generate.py --model xlstm_7b_model --prompt "Test" --show-wiring

Python API

from xlstm_metal.mlx_jit.generate import xLSTMRunner
from xlstm_metal.mlx_jit.tokenizer import TokenizerBlock, TokenizerConfig

# Initialize runner
runner = xLSTMRunner("xlstm_7b_model")

# Initialize tokenizer
tokenizer_config = TokenizerConfig(model_path="xlstm_7b_model")
tokenizer = TokenizerBlock(tokenizer_config)

# Get model information
info = runner.get_model_info()
print(f"Model: {info['num_blocks']} blocks, {info['embedding_dim']}d")

# Generate with custom parameters
prompt_ids = tokenizer.encode("Hello world").tolist()
generated_ids = runner.generate(
    prompt_ids,
    max_tokens=100,
    temperature=0.7,
    top_k=50
)
output = tokenizer.decode(generated_ids)
print(output)

# Stateful generation (efficient for long sequences)
runner.reset_state()
prompt_ids = tokenizer.encode("Tell me a story").tolist()
current_ids = prompt_ids
for i in range(50):  # Generate 50 tokens
    next_token = runner.generate_next_token(
        mx.array([current_ids], dtype=mx.int64),
        temperature=0.8
    )
    current_ids = [int(next_token)]
    print(tokenizer.decode([int(next_token)]), end='', flush=True)

Performance

xLSTM-Metal leverages Apple Silicon's unified memory architecture and Metal acceleration through MLX:

Unified Memory: Direct GPU access without CPU-GPU data transfers
MLX Optimization: Lazy evaluation and optimized kernel fusion
Efficient Execution: Sequential block processing with minimal overhead
Memory Efficient: Stateful generation reduces recomputation

Performance characteristics depend on model size, sequence length, and hardware generation (M1/M2/M3/M4).

Typical Performance (xLSTM-7B on M2 Max):

First token latency: ~500ms (includes model loading and Metal shader compilation)
Subsequent tokens: ~50-100ms per token
Memory usage: ~14GB (model weights) + ~2-4GB (inference state)

See docs/ for detailed architecture documentation and optimization guides.

Technical Details

xLSTM Architecture

The Extended Long Short-Term Memory (xLSTM) architecture combines:

mLSTM (matrix LSTM): Matrix memory and covariance update rule
sLSTM (scalar LSTM): Scalar memory with exponential gating
Residual Connections: Skip connections for gradient flow
Layer Normalization: Stable training and inference

Mathematical Foundation

xLSTM extends traditional LSTM with:

Matrix-valued hidden states for increased expressiveness
Exponential gating for improved gradient flow
Soft attention mechanisms within memory cells
Architectural scaling for billion-parameter models

Implementation Highlights

Config-Driven Architecture: Automatic model creation from JSON configuration
Weight Loading: Support for safetensors and NPZ formats
Memory Efficiency: Optimized for Apple's unified memory
Type Safety: Full MLX array type support

Development

Repository Structure

├── xlstm_metal/              # Core implementation
│   └── mlx_jit/              # MLX backend (primary)
│       ├── generate.py       # Generation runner
│       ├── tokenizer/        # Tokenizer wrapper
│       ├── models/           # WiredxLSTM model
│       ├── wiring/           # NCPS auto-wiring
│       ├── blocks/           # mLSTM/sLSTM blocks
│       └── utils/            # Config and weight loading
├── docs/                     # Technical documentation
├── tests/                    # Test suite
├── scripts/                  # Utilities and tools
└── kernel_development/       # Metal kernel experiments

Testing

# Run test suite
python run_pytest.py

# Test specific components
python -m pytest tests/test_pretrained_inference.py -v

# Test numerical parity
python test_numerical_parity.py

Contributing

This is an independent port to Apple Silicon. Contributions are welcome! Please:

Follow the coding style in existing files
Add tests for new features
Update documentation as needed
See AGENTS.md for development guidelines

Credits and Attribution

This Port

xLSTM-Metal is an independent port to Apple Silicon with MLX.

Author: Sydney Renee
Organization: The Solace Project
Email: sydney@solace.ofharmony.ai
Repository: https://github.com/SolaceHarmony/xLSTM-Metal

This port includes:

MLX backend implementation with Metal acceleration
NCPS-inspired wiring system for automatic structure discovery
Numerical stability fixes and dtype handling improvements
Production-ready inference with proper error handling

Original Research

xLSTM was introduced by Beck et al. (2024):

Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2024). xLSTM: Extended Long Short-Term Memory. arXiv preprint arXiv:2405.04517.

Model Weights

Official xLSTM-7B model weights provided by NX-AI under Apache 2.0 license.

Framework

Built on MLX, Apple's machine learning framework for Apple Silicon.

Acknowledgments

NX-AI team for the original xLSTM research and model weights
Apple MLX team for the excellent framework
Neural Circuit Policies (NCPS) research for inspiring the wiring system design

License

Apache License 2.0. See LICENSE for full text.

Model weights from NX-AI are also under Apache 2.0.

Citation

If you use this implementation, please cite the original xLSTM paper:

@article{beck2024xlstm,
  title={xLSTM: Extended Long Short-Term Memory},
  author={Beck, Maximilian and P{\"o}ppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp},
  journal={arXiv preprint arXiv:2405.04517},
  year={2024}
}

Documentation

Complete technical documentation available in docs/:

Requirements

Hardware: Apple Silicon (M1/M2/M3/M4)
OS: macOS 13.0 or later
Python: 3.9 or later
MLX: Latest version from pip

This is an unofficial implementation optimized for Apple Silicon. For the original research and reference implementation, see the xLSTM paper.