MFI

May 13, 2026 · View on GitHub

Modern Fortran interfaces to BLAS and LAPACK

MFI provides generic, type-agnostic wrappers around BLAS and LAPACK routines. Instead of writing type-specific calls with dozens of arguments, you write one call that works for real32, real64, complex(real32), and complex(real64).

Example: C=ABC = A \cdot B

program main
    use mfi_blas, only: mfi_gemm
    implicit none
    real :: A(4,4), B(4,4), C(4,4)
    ! ... fill A and B ...
    call mfi_gemm(A, B, C)   ! That's it. No leading dims, no m/n/k, no alpha/beta.
end program

Quick Start

git clone https://github.com/14NGiestas/mfi.git
cd mfi
nix develop          # cpu-only shell with gfortran, fpm, fypp, BLAS, LAPACK
nix develop .#gpu-modern   # with CUDA 12.3
nix develop .#gpu-legacy   # with CUDA 11.8
nix develop .#gpu-zluda    # AMD GPU via ZLUDA (pkgs.zluda from nixpkgs)
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Requires Nix with flakes enabled.

Manual setup

ToolMinimum version
fpm≥ 0.13.0
fyppany
Fortran compilergfortran 12+ (recommended)
pip install fypp

Install BLAS and LAPACK from your package manager:

DistroPackage
Archopenblas-lapack-static (AUR)
Ubuntu/Debianlibblas-dev liblapack-dev
Fedoraopenblas-devel lapack-devel

Build & Test

git clone https://github.com/14NGiestas/mfi.git
cd mfi
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Using MFI as a Dependency

Add to your project's fpm.toml:

# CPU-only (stable)
[dependencies]
mfi = { git = "https://github.com/14NGiestas/mfi.git", branch = "mfi-fpm" }

That's all — fpm handles the rest. No make needed in your own project.


GPU Acceleration with cuBLAS

MFI can transparently dispatch BLAS calls to cuBLAS when compiled with the cublas feature. The same mfi_gemm, mfi_gemv, etc. calls run on the GPU without code changes.

Try it in your browser:

Open In Colab

Local build with cuBLAS

make
fpm build --profile cublas
fpm test --profile cublas

Runtime CPU / GPU switching

MFI uses lazy initialization — no setup code is needed. When compiled with the cublas feature, GPU dispatching is controlled entirely by the MFI_USE_CUBLAS environment variable:

# CPU (default)
./build/app/app

# GPU
MFI_USE_CUBLAS=1 ./build/app/app

The same call mfi_gemm(A, B, C) runs on CPU or GPU without any code changes.

For OpenMP-parallel programs, also set OMP_NUM_THREADS to pre-allocate per-thread cuBLAS handles:

MFI_USE_CUBLAS=1 OMP_NUM_THREADS=8 ./build/app/app

Manual CPU/GPU switching (advanced)

If you need fine-grained control within a single program (e.g., run most computations on GPU but force a specific call to CPU), use mfi_force_gpu / mfi_force_cpu:

call mfi_gemm(A, B, C)       ! CPU (default)

call mfi_force_gpu
call mfi_gemm(D, E, F)       ! GPU
call mfi_force_cpu

call mfi_gemm(G, H, I)       ! CPU again

Note: When compiled without the cublas feature, mfi_force_gpu and mfi_force_cpu are no-op stubs — your code compiles and runs normally on CPU without any #ifdef changes. Simply recompile with --profile cublas to activate GPU acceleration.

Clean shutdown (optional)

Call mfi_cublas_finalize() at program end to release GPU resources. The OS cleans up on exit anyway.


AMD GPU support via ZLUDA

ZLUDA is a drop-in replacement for the CUDA runtime that runs on AMD GPUs using the HIP SDK. Because MFI's GPU backend only uses standard CUDA/cuBLAS APIs (cuda_runtime.h, cublas_v2.h, -lcublas, -lcudart), the existing cublas build works on AMD hardware without any source changes — you just redirect the linker and runtime to ZLUDA's libraries.

Prerequisites

With Nix: the ROCm/HIP userspace stack (rocmPackages.clr for the HIP runtime, rocmPackages.rocm-runtime for the HSA runtime) and CUDA compile-time headers are all provided by the gpu-zluda devShell. You still need to download ZLUDA itself — it is a pre-built binary that cannot currently be built from nixpkgs 24.11 — and point ZLUDA_PATH at its directory before entering the shell. The only host requirement beyond that is the AMD GPU kernel driver (the amdgpu kernel module and firmware), which Nix cannot provide.

Without Nix: install the full ROCm/HIP SDK and download ZLUDA from its releases page.

Linux

With Nix (recommended): ROCm/HIP and CUDA headers are provided automatically. Download ZLUDA from its releases page, then:

ZLUDA_PATH=/path/to/zluda nix develop .#gpu-zluda
make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

The shell prints a warning and usage hint if ZLUDA_PATH is unset.

Without Nix: after installing the ROCm/HIP SDK and ZLUDA (see Prerequisites above), set the env vars manually:

export CPATH="/path/to/zluda/include:$CPATH"
export LIBRARY_PATH="/path/to/zluda/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/path/to/zluda/lib:$LD_LIBRARY_PATH"

make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

Windows

Install AMD Software: Adrenalin Edition and the HIP SDK, then use the ZLUDA launcher (recommended) or manually prepend the ZLUDA DLL directory to PATH:

REM recommended: zluda launcher
zluda -- fpm build --profile zluda

REM or manually
set PATH=C:\path\to\zluda;%PATH%
fpm build --profile zluda

Consumer projects on AMD

# AMD GPU via ZLUDA (set env vars before building, LD_LIBRARY_PATH before running)
mfi = { git="https://github.com/14NGiestas/mfi.git", branch="mfi-fpm", features = ["zluda"] }

The zluda and cublas fpm features are identical in fpm.toml; both compile the same C/Fortran source. Use whichever name makes intent clearer in your project. Note that features = ["cublas"] also works — only the label differs.


Troubleshooting

ProblemSolution
CUBLAS_STATUS_NOT_INITIALIZEDcuBLAS handle not created. Set MFI_USE_CUBLAS=1 or call mfi_force_gpu before the first BLAS call.
cuda_runtime.h not foundCUDA Toolkit (or ZLUDA headers) not in include path. See gpu_test.ipynb for a Colab setup, or set CPATH to ZLUDA's include/ directory.
libcublas.so not found at runtimeLD_LIBRARY_PATH does not include CUDA/ZLUDA libs. Also ensure CPATH and LIBRARY_PATH were set at build time.
ZLUDA: HIP_VISIBLE_DEVICES not setOn multi-GPU systems set HIP_VISIBLE_DEVICES=0 (or the desired device index).
ZLUDA: silent wrong resultsCheck MFI_DEBUG=1 output and ensure ZLUDA version ≥ the latest pre-release.
i?amin symbols missingYour BLAS provider lacks extensions. Use the default profile (without MFI_LINK_EXTERNAL) or switch to OpenBLAS.
Tests fail on CPU buildKnown pre-existing failures: cunmrq, sorg2r, sorgr2, cungr2, cung2r, sormrq, heevx (segfault).

Interface Levels

MFI exposes four interface levels for BLAS, from bare-metal to fully modern:

LevelExampleArguments
Raw F77call cgemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N)13
Improved F77call f77_gemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N)13 (no c/d/s/z prefix)
MFI typedcall mfi_sgemm(A, B, C)3 (type-specific)
MFI genericcall mfi_gemm(A, B, C)3 (type-agnostic)

For full API documentation, see the generated reference.


Supported Routines

BLAS

Level 1

Click to expand
StatusNameDescription
:+1:asumSum of vector magnitudes
:+1:axpyScalar-vector product
:+1:copyCopy vector
:+1:dotDot product
:+1:dotcDot product conjugated
:+1:dotuDot product unconjugated
f77sdsdotExtended precision inner product
f77dsdotExtended precision inner product with double result
:+1:nrm2Vector 2-norm (Euclidean norm)
:+1:rotPlane rotation
:+1:rotgGenerate Givens rotation
:+1:rotmModified Givens rotation
:+1:rotmgGenerate modified Givens rotation
:+1:scalVector-scalar product
:+1:swapVector-vector swap

Level 1 — Extensions

Click to expand
StatusNameDescription
:+1:iamaxIndex of maximum absolute value element
:+1:iaminIndex of minimum absolute value element
:+1:lamchMachine precision parameters

Level 2

Click to expand
StatusNameDescription
:+1:gbmvMatrix-vector product (general band)
:+1:gemvMatrix-vector product (general)
:+1:gerRank-1 update (general)
:+1:gercRank-1 update (general, conjugated)
:+1:geruRank-1 update (general, unconjugated)
:+1:hbmvMatrix-vector product (Hermitian band)
:+1:hemvMatrix-vector product (Hermitian)
:+1:herRank-1 update (Hermitian)
:+1:her2Rank-2 update (Hermitian)
:+1:hpmvMatrix-vector product (Hermitian packed)
:+1:hprRank-1 update (Hermitian packed)
:+1:hpr2Rank-2 update (Hermitian packed)
:+1:sbmvMatrix-vector product (symmetric band)
:+1:spmvMatrix-vector product (symmetric packed)
:+1:sprRank-1 update (symmetric packed)
:+1:spr2Rank-2 update (symmetric packed)
:+1:symvMatrix-vector product (symmetric)
:+1:syrRank-1 update (symmetric)
:+1:syr2Rank-2 update (symmetric)
:+1:tbmvMatrix-vector product (triangular band)
:+1:tbsvSolve (triangular band)
:+1:tpmvMatrix-vector product (triangular packed)
:+1:tpsvSolve (triangular packed)
:+1:trmvMatrix-vector product (triangular)
:+1:trsvSolve (triangular)

Level 3

Click to expand
StatusGPUNameDescription
:+1::white_check_mark:gemmGeneral matrix-matrix product
:+1::white_check_mark:hemmHermitian × general matrix product
:+1:herkHermitian rank-k update
:+1:her2kHermitian rank-2k update
:+1::white_check_mark:symmSymmetric × general matrix product
:+1:syrkSymmetric rank-k update
:+1:syr2kSymmetric rank-2k update
:+1::white_check_mark:trmmTriangular × general matrix product
:+1::white_check_mark:trsmSolve with triangular matrix

LAPACK

LAPACK coverage is growing — routines are implemented as needed.

Factorization and Solve

Click to expand
StatusNameDescription
:+1:geqrfQR factorization
:+1:gerqfRQ factorization
:+1:getrfLU factorization
:+1:getriMatrix inverse (from LU)
:+1:getrsSolve with LU-factored matrix
:+1:gesvSolve linear system (LU + solve)
:+1:hetrfBunch-Kaufman factorization (Hermitian)
:+1:poconCondition number estimate (Cholesky)
:+1:potrfCholesky factorization
:+1:potriMatrix inverse (from Cholesky)
:+1:potrsSolve with Cholesky-factored matrix
:+1:sytrfBunch-Kaufman factorization (symmetric)
:+1:trtrsSolve with triangular matrix

Orthogonal / Unitary Factors

Click to expand
StatusNameDescription
:+1:orgqrGenerate Q from QR (real)
:+1:orgrqGenerate Q from RQ (real)
:+1:ormqrMultiply by Q from QR (real)
f77ormrqMultiply by Q from RQ (real)
:+1:org2rGenerate Q from QR2 (real)
:+1:orm2rMultiply by Q from QR2 (real)
:+1:orgr2Generate Q from RQ2 (real)
:+1:ormr2Multiply by Q from RQ2 (real)
:+1:ungqrGenerate Q from QR (complex)
:+1:ungrqGenerate Q from RQ (complex)
:+1:unmqrMultiply by Q from QR (complex)
f77unmrqMultiply by Q from RQ (complex)
:+1:ung2rGenerate Q from QR2 (complex)
:+1:unm2rMultiply by Q from QR2 (complex)
:+1:ungr2Generate Q from RQ2 (complex)
:+1:unmr2Multiply by Q from RQ2 (complex)

Eigenvalues and SVD

Click to expand
StatusNameDescription
:+1:gesvdSingular value decomposition
:+1:heevdHermitian eigenvalues (divide & conquer)
:+1:hegvdGeneralized Hermitian eigenproblem (divide & conquer)
:+1:heevrHermitian eigenvalues (relatively robust)
f77heevxHermitian eigenvalues (expert)

Least Squares

Click to expand
StatusNameDescription
f77gelsLeast squares (QR/LQ)
f77gelstLeast squares (QR/LQ, T matrix)
f77gelssLeast squares (SVD, QR iteration)
f77gelsdLeast squares (SVD, divide & conquer)
f77gelsyLeast squares (complete orthogonal)
f77getslsLeast squares (tall-skinny QR/LQ)
f77gglseEquality-constrained least squares
f77ggglmGauss-Markov linear model

Auxiliary

NameTypesDescription
mfi_lartgs, d, c, zGenerate plane rotation

Continuous Integration

CI uses Nix flakes with magic-nix-cache-action for fast, reproducible builds.

EventBehavior
Push to mainFull test matrix + deploy to mfi-fpm
Push to impl/cublasFull test matrix + deploy to mfi-cublas
PR to mainFull test matrix
Manual dispatchFull test matrix