MFI

May 13, 2026 · View on GitHub

Modern Fortran interfaces to BLAS and LAPACK

MFI provides generic, type-agnostic wrappers around BLAS and LAPACK routines. Instead of writing type-specific calls with dozens of arguments, you write one call that works for real32, real64, complex(real32), and complex(real64).

Example: $C = A \cdot B$

program main
    use mfi_blas, only: mfi_gemm
    implicit none
    real :: A(4,4), B(4,4), C(4,4)
    ! ... fill A and B ...
    call mfi_gemm(A, B, C)   ! That's it. No leading dims, no m/n/k, no alpha/beta.
end program

Quick Start

Recommended: Nix Flake (zero config)

git clone https://github.com/14NGiestas/mfi.git
cd mfi
nix develop          # cpu-only shell with gfortran, fpm, fypp, BLAS, LAPACK
nix develop .#gpu-modern   # with CUDA 12.3
nix develop .#gpu-legacy   # with CUDA 11.8
nix develop .#gpu-zluda    # AMD GPU via ZLUDA (pkgs.zluda from nixpkgs)
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Requires Nix with flakes enabled.

Manual setup

Tool	Minimum version
fpm	≥ 0.13.0
fypp	any
Fortran compiler	gfortran 12+ (recommended)

pip install fypp

Install BLAS and LAPACK from your package manager:

Distro	Package
Arch	`openblas-lapack-static` (AUR)
Ubuntu/Debian	`libblas-dev liblapack-dev`
Fedora	`openblas-devel lapack-devel`

Build & Test

git clone https://github.com/14NGiestas/mfi.git
cd mfi
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Using MFI as a Dependency

Add to your project's fpm.toml:

# CPU-only (stable)
[dependencies]
mfi = { git = "https://github.com/14NGiestas/mfi.git", branch = "mfi-fpm" }

That's all — fpm handles the rest. No make needed in your own project.

GPU Acceleration with cuBLAS

MFI can transparently dispatch BLAS calls to cuBLAS when compiled with the cublas feature. The same mfi_gemm, mfi_gemv, etc. calls run on the GPU without code changes.

Try it in your browser:

Local build with cuBLAS

make
fpm build --profile cublas
fpm test --profile cublas

Runtime CPU / GPU switching

MFI uses lazy initialization — no setup code is needed. When compiled with the cublas feature, GPU dispatching is controlled entirely by the MFI_USE_CUBLAS environment variable:

# CPU (default)
./build/app/app

# GPU
MFI_USE_CUBLAS=1 ./build/app/app

The same call mfi_gemm(A, B, C) runs on CPU or GPU without any code changes.

For OpenMP-parallel programs, also set OMP_NUM_THREADS to pre-allocate per-thread cuBLAS handles:

MFI_USE_CUBLAS=1 OMP_NUM_THREADS=8 ./build/app/app

Manual CPU/GPU switching (advanced)

If you need fine-grained control within a single program (e.g., run most computations on GPU but force a specific call to CPU), use mfi_force_gpu / mfi_force_cpu:

call mfi_gemm(A, B, C)       ! CPU (default)

call mfi_force_gpu
call mfi_gemm(D, E, F)       ! GPU
call mfi_force_cpu

call mfi_gemm(G, H, I)       ! CPU again

Note: When compiled without the cublas feature, mfi_force_gpu and mfi_force_cpu are no-op stubs — your code compiles and runs normally on CPU without any #ifdef changes. Simply recompile with --profile cublas to activate GPU acceleration.

Clean shutdown (optional)

Call mfi_cublas_finalize() at program end to release GPU resources. The OS cleans up on exit anyway.

AMD GPU support via ZLUDA

ZLUDA is a drop-in replacement for the CUDA runtime that runs on AMD GPUs using the HIP SDK. Because MFI's GPU backend only uses standard CUDA/cuBLAS APIs (cuda_runtime.h, cublas_v2.h, -lcublas, -lcudart), the existing cublas build works on AMD hardware without any source changes — you just redirect the linker and runtime to ZLUDA's libraries.

Prerequisites

With Nix: the ROCm/HIP userspace stack (rocmPackages.clr for the HIP runtime, rocmPackages.rocm-runtime for the HSA runtime) and CUDA compile-time headers are all provided by the gpu-zluda devShell. You still need to download ZLUDA itself — it is a pre-built binary that cannot currently be built from nixpkgs 24.11 — and point ZLUDA_PATH at its directory before entering the shell. The only host requirement beyond that is the AMD GPU kernel driver (the amdgpu kernel module and firmware), which Nix cannot provide.

Without Nix: install the full ROCm/HIP SDK and download ZLUDA from its releases page.

Linux

With Nix (recommended): ROCm/HIP and CUDA headers are provided automatically. Download ZLUDA from its releases page, then:

ZLUDA_PATH=/path/to/zluda nix develop .#gpu-zluda
make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

The shell prints a warning and usage hint if ZLUDA_PATH is unset.

Without Nix: after installing the ROCm/HIP SDK and ZLUDA (see Prerequisites above), set the env vars manually:

export CPATH="/path/to/zluda/include:$CPATH"
export LIBRARY_PATH="/path/to/zluda/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/path/to/zluda/lib:$LD_LIBRARY_PATH"

make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

Windows

Install AMD Software: Adrenalin Edition and the HIP SDK, then use the ZLUDA launcher (recommended) or manually prepend the ZLUDA DLL directory to PATH:

REM recommended: zluda launcher
zluda -- fpm build --profile zluda

REM or manually
set PATH=C:\path\to\zluda;%PATH%
fpm build --profile zluda

Consumer projects on AMD

# AMD GPU via ZLUDA (set env vars before building, LD_LIBRARY_PATH before running)
mfi = { git="https://github.com/14NGiestas/mfi.git", branch="mfi-fpm", features = ["zluda"] }

The zluda and cublas fpm features are identical in fpm.toml; both compile the same C/Fortran source. Use whichever name makes intent clearer in your project. Note that features = ["cublas"] also works — only the label differs.

Troubleshooting

Problem	Solution
`CUBLAS_STATUS_NOT_INITIALIZED`	cuBLAS handle not created. Set `MFI_USE_CUBLAS=1` or call `mfi_force_gpu` before the first BLAS call.
`cuda_runtime.h not found`	CUDA Toolkit (or ZLUDA headers) not in include path. See gpu_test.ipynb for a Colab setup, or set `CPATH` to ZLUDA's `include/` directory.
`libcublas.so not found` at runtime	`LD_LIBRARY_PATH` does not include CUDA/ZLUDA libs. Also ensure `CPATH` and `LIBRARY_PATH` were set at build time.
ZLUDA: `HIP_VISIBLE_DEVICES` not set	On multi-GPU systems set `HIP_VISIBLE_DEVICES=0` (or the desired device index).
ZLUDA: silent wrong results	Check `MFI_DEBUG=1` output and ensure ZLUDA version ≥ the latest pre-release.
`i?amin` symbols missing	Your BLAS provider lacks extensions. Use the default profile (without `MFI_LINK_EXTERNAL`) or switch to OpenBLAS.
Tests fail on CPU build	Known pre-existing failures: `cunmrq`, `sorg2r`, `sorgr2`, `cungr2`, `cung2r`, `sormrq`, `heevx` (segfault).

Interface Levels

MFI exposes four interface levels for BLAS, from bare-metal to fully modern:

Level	Example	Arguments
Raw F77	`call cgemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N)`	13
Improved F77	`call f77_gemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N)`	13 (no `c`/`d`/`s`/`z` prefix)
MFI typed	`call mfi_sgemm(A, B, C)`	3 (type-specific)
MFI generic	`call mfi_gemm(A, B, C)`	3 (type-agnostic)

For full API documentation, see the generated reference.

Status	Name	Description
:+1:	asum	Sum of vector magnitudes
:+1:	axpy	Scalar-vector product
:+1:	copy	Copy vector
:+1:	dot	Dot product
:+1:	dotc	Dot product conjugated
:+1:	dotu	Dot product unconjugated
f77	sdsdot	Extended precision inner product
f77	dsdot	Extended precision inner product with double result
:+1:	nrm2	Vector 2-norm (Euclidean norm)
:+1:	rot	Plane rotation
:+1:	rotg	Generate Givens rotation
:+1:	rotm	Modified Givens rotation
:+1:	rotmg	Generate modified Givens rotation
:+1:	scal	Vector-scalar product
:+1:	swap	Vector-vector swap

Level 1 — Extensions

Click to expand

Status	Name	Description
:+1:	iamax	Index of maximum absolute value element
:+1:	iamin	Index of minimum absolute value element
:+1:	lamch	Machine precision parameters

Level 2

Click to expand

Status	Name	Description
:+1:	gbmv	Matrix-vector product (general band)
:+1:	gemv	Matrix-vector product (general)
:+1:	ger	Rank-1 update (general)
:+1:	gerc	Rank-1 update (general, conjugated)
:+1:	geru	Rank-1 update (general, unconjugated)
:+1:	hbmv	Matrix-vector product (Hermitian band)
:+1:	hemv	Matrix-vector product (Hermitian)
:+1:	her	Rank-1 update (Hermitian)
:+1:	her2	Rank-2 update (Hermitian)
:+1:	hpmv	Matrix-vector product (Hermitian packed)
:+1:	hpr	Rank-1 update (Hermitian packed)
:+1:	hpr2	Rank-2 update (Hermitian packed)
:+1:	sbmv	Matrix-vector product (symmetric band)
:+1:	spmv	Matrix-vector product (symmetric packed)
:+1:	spr	Rank-1 update (symmetric packed)
:+1:	spr2	Rank-2 update (symmetric packed)
:+1:	symv	Matrix-vector product (symmetric)
:+1:	syr	Rank-1 update (symmetric)
:+1:	syr2	Rank-2 update (symmetric)
:+1:	tbmv	Matrix-vector product (triangular band)
:+1:	tbsv	Solve (triangular band)
:+1:	tpmv	Matrix-vector product (triangular packed)
:+1:	tpsv	Solve (triangular packed)
:+1:	trmv	Matrix-vector product (triangular)
:+1:	trsv	Solve (triangular)

Level 3

Click to expand

Status	GPU	Name	Description
:+1:	:white_check_mark:	gemm	General matrix-matrix product
:+1:	:white_check_mark:	hemm	Hermitian × general matrix product
:+1:		herk	Hermitian rank-k update
:+1:		her2k	Hermitian rank-2k update
:+1:	:white_check_mark:	symm	Symmetric × general matrix product
:+1:		syrk	Symmetric rank-k update
:+1:		syr2k	Symmetric rank-2k update
:+1:	:white_check_mark:	trmm	Triangular × general matrix product
:+1:	:white_check_mark:	trsm	Solve with triangular matrix

LAPACK

LAPACK coverage is growing — routines are implemented as needed.

Factorization and Solve

Click to expand

Status	Name	Description
:+1:	geqrf	QR factorization
:+1:	gerqf	RQ factorization
:+1:	getrf	LU factorization
:+1:	getri	Matrix inverse (from LU)
:+1:	getrs	Solve with LU-factored matrix
:+1:	gesv	Solve linear system (LU + solve)
:+1:	hetrf	Bunch-Kaufman factorization (Hermitian)
:+1:	pocon	Condition number estimate (Cholesky)
:+1:	potrf	Cholesky factorization
:+1:	potri	Matrix inverse (from Cholesky)
:+1:	potrs	Solve with Cholesky-factored matrix
:+1:	sytrf	Bunch-Kaufman factorization (symmetric)
:+1:	trtrs	Solve with triangular matrix

Orthogonal / Unitary Factors

Click to expand

Status	Name	Description
:+1:	orgqr	Generate Q from QR (real)
:+1:	orgrq	Generate Q from RQ (real)
:+1:	ormqr	Multiply by Q from QR (real)
f77	ormrq	Multiply by Q from RQ (real)
:+1:	org2r	Generate Q from QR2 (real)
:+1:	orm2r	Multiply by Q from QR2 (real)
:+1:	orgr2	Generate Q from RQ2 (real)
:+1:	ormr2	Multiply by Q from RQ2 (real)
:+1:	ungqr	Generate Q from QR (complex)
:+1:	ungrq	Generate Q from RQ (complex)
:+1:	unmqr	Multiply by Q from QR (complex)
f77	unmrq	Multiply by Q from RQ (complex)
:+1:	ung2r	Generate Q from QR2 (complex)
:+1:	unm2r	Multiply by Q from QR2 (complex)
:+1:	ungr2	Generate Q from RQ2 (complex)
:+1:	unmr2	Multiply by Q from RQ2 (complex)

Eigenvalues and SVD

Click to expand

Status	Name	Description
:+1:	gesvd	Singular value decomposition
:+1:	heevd	Hermitian eigenvalues (divide & conquer)
:+1:	hegvd	Generalized Hermitian eigenproblem (divide & conquer)
:+1:	heevr	Hermitian eigenvalues (relatively robust)
f77	heevx	Hermitian eigenvalues (expert)

Least Squares

Click to expand

Status	Name	Description
f77	gels	Least squares (QR/LQ)
f77	gelst	Least squares (QR/LQ, T matrix)
f77	gelss	Least squares (SVD, QR iteration)
f77	gelsd	Least squares (SVD, divide & conquer)
f77	gelsy	Least squares (complete orthogonal)
f77	getsls	Least squares (tall-skinny QR/LQ)
f77	gglse	Equality-constrained least squares
f77	ggglm	Gauss-Markov linear model

Auxiliary

Name	Types	Description
mfi_lartg	s, d, c, z	Generate plane rotation

Continuous Integration

CI uses Nix flakes with magic-nix-cache-action for fast, reproducible builds.

Event	Behavior
Push to `main`	Full test matrix + deploy to `mfi-fpm`
Push to `impl/cublas`	Full test matrix + deploy to `mfi-cublas`
PR to `main`	Full test matrix
Manual dispatch	Full test matrix