Bitsliced AEGIS

June 10, 2026 · View on GitHub

Protected implementations of the AEGIS authenticated encryption algorithms for platforms without hardware AES support.

Side channels are mitigated using the barrel-shiftrows bitsliced representation introduced by Alexandre Adomnicai and Thomas Peyrin, which has proven to be a good fit for AEGIS.

Implemented variants:

  • AEGIS-128L - 16-byte key/nonce, 8 AES blocks state
  • AEGIS-128X2 - 16-byte key/nonce, parallel variant with 16 blocks
  • AEGIS-256 - 32-byte key/nonce, 6 AES blocks state
  • AEGIS-256X2 - 32-byte key/nonce, parallel variant with 12 blocks

With this representation, AEGIS-128L/128X2 consistently outperform AES128-GCM in terms of speed.

ARM Cortex A53:

AlgorithmSpeed (Mb/s)
AES-128-GCM (OpenSSL 3.3, bitsliced)261
AEGIS-128L (bitsliced)423
AEGIS-128L (libaegis, unprotected)782

Spacemit X60 RISC-V without AES extensions:

AlgorithmSpeed (Mb/s)
AES-128-GCM (BoringSSL, bitsliced)137
AES-128-GCM (OpenSSL 3.3, unprotected)223
AEGIS-128X2 (bitsliced)333
AEGIS-128L (bitsliced)193
AEGIS-128L (libaegis, unprotected)198

Sifive, u74-mc:

AlgorithmSpeed (Mb/s)
AES-128-GCM (BoringSSL, bitsliced)130
AEGIS-128X2 (bitsliced)311
AEGIS-128L (bitsliced)182
AEGIS-128L (libaegis, unprotected)507

WebAssembly (Apple M1, baseline+simd128):

AlgorithmSpeed (Mb/s)
AES-128-GCM (boringssl, bitsliced)480
AES-128-GCM (zig, unprotected)1040
AEGIS-128X2 (bitsliced)3154
AEGIS-128L (bitsliced)3429
AEGIS-128L (libaegis, unprotected)4232

ARM Cortex M4 (Flipper Zero):

AlgorithmSpeed (Mb/s)CpB
AES-128-GCM (fixsliced, protected GHASH)2.08246
AES-128-GCM (unprotected, 4 LUTs)2.46208
AES-128-GCM (fixsliced, 4-bit LUT GHASH)2.69190
AEGIS-128L (bitsliced)2.77185
AEGIS-128L (libaegis, unprotected)8.2862
AES-128-GCM (hardware, via AHB2 bus)11.2346

Notes on bitslicing AEGIS

The AEGIS-128L state comprises 8 AES blocks. The AES round function is applied simultaneously to these 8 blocks, making it well-suited not only for general bitslicing but also for the barrel-shiftrows representation. AEGIS-128X2 can also be bitsliced in the same manner, using 64-bit words to update 16 blocks at once.

The state update function is defined as S_i ← AES(in=S_{(i-1) mod 8}, round_key=S_i) for each block, equivalent to applying a keyless AES round to a rotated state while feeding forward the original state.

In the bitsliced representation, rotating the state only requires a bit rotation across all bytes.

By default, the state is stored unpacked, and every update packs it, applies the AES round, and unpacks the result. With the -Dkeep-state-bitsliced build option, the state is instead kept in bitsliced form across initialization, associated data absorption, message processing, and finalization, so the update round can be applied without packing and unpacking the full state every time.

In that mode, the keystream, a combination of AES blocks, is not evaluated by unpacking the full state. The implementation computes the required block expressions directly in the packed lanes, then applies a partial unpack only for the output block or blocks. Message input is packed only into the active input lanes before the next round.

These representation changes are costly. However, with 10 8-block AES rounds, AES-128 encrypts only 8 blocks, while AEGIS-128L encrypts 20. Additionally, AEGIS provides integrity with minimal overhead, while AES-GCM’s GMAC is costly, especially on CPUs without carryless multiplication support or lookup tables.

AEGIS-128X2 can be implemented using 64-bit words, or using two sets of 8 blocks updated alternately, offering a measurable speed advantage over AEGIS-128L on platforms such as WebAssembly and RISC-V, even with 32-bit words.

While a dedicated bitsliced representation could further improve performance, straightforward implementations using existing AES representations enable AEGIS to achieve strong performance with side-channel protection, even on CPUs lacking AES instructions.

In the barrel-shiftrows representation, the four 8-bit-plane groups go through identical, independent sbox circuits. On targets with vector extensions (SSE2, NEON, AltiVec), these four groups are evaluated as the lanes of 4x32-bit vectors rather than relying on autovectorization. The state words are permuted so that the lane vectors are contiguous in memory and the AES round needs no transposes; the scalar code uses the same permuted layout. On WebAssembly, the SIMD path turned out to be slower than scalar code, so it is not used there.

These implementations use the SBOX circuits from Maximov & Ekdahl. A comparison against the circuits from Jean, Baek, Kim G and Kim J on Cortex A53 can be found below:

Sbox circuitAEGIS-128L speed (Mb/s)
Maximov & Ekdahl423.02
depth16_RNBP28D_4AD_34NLs_81XORs414.45
jbkk2_RNBP41D_5AD_32NLs_97XORs410.53
32ANDs_BPD26D_6AD_32NLs_81XORs408.49
depth16_BPD15D_4AD_34NLs_100XORs405.76
32ANDs_BPD18D_6AD_32NLs_93XORs402.95
jbkk2_BPD19D_5AD_32NLs_122XORs401.25
jbkk3_RNBP41D_4AD_33NLs_102XORs400.72
jbkk2_BPD17D_5AD_32NLs_142XORs395.75
jbkk3_BPD16D_4AD_33NLs_154XORs376.64

Lastly, side-channel protection is generally unnecessary during decryption, as an adversary cannot observe individual blocks or conduct differential attacks at that stage.

Building

zig build -Drelease=true

This builds a static aegis library along with its headers into zig-out/, as well as a benchmark executable. Add -Dkeep-state-bitsliced=true to keep the state in bitsliced form between calls, and -Dno-vector-sbox=true to force the scalar sbox implementation on platforms with vector extensions. The test suite runs with zig build test -Drelease=true.