High Performance Convolution Bloom On Unity

April 30, 2025 · View on GitHub

This project implements a high-quality bloom effect using Fast Fourier Transform (FFT) convolution, Providing customizable bloom effects with optimized performance. It achieves performance parity with Unreal Engine’s convolution bloom effect while offering greater flexibility and additional optimization options.

Unity Version: 2022.3.8f1c1

Blog: https://zhuanlan.zhihu.com/p/1900864922390343758

bloomsameple1

bloomsample2

Convolution Benchmark

The performance testing of Convolution was conducted using the Unity Profiler, recording GPU Profiler timings.

The testing process involved executing 20 convolution per frame, calculating the average time for per-convolution. Kernel FFT is not included.

Read/Write Texture format ARGBHalf.

Device: NVIDIA GeForce MX450.

Dispatch Merge Performance Comparison

ScaleStrategyModeAverage Horizontal FFT (ms)Average Vertical FFT + Mul (ms)Average Convolution (ms)
1296x12969,6,6,4 inplaceGray-scale1.1510.4651.616
1024x102416,16,4 inplaceGray-scale0.7810.2491.030
1024x102416,16,4 inplace4-Channel0.7790.4151.195
972x9729,3,6,6 inplaceGray-scale0.6630.2230.886
972x9729,3,6,6 inplace4-Channel0.6640.3671.031
729x7299,9,9 inplaceGray-scale0.3730.1010.474
729x7299,9,9 inplace4-Channel0.3690.1690.537
512x5128,8,8 inplaceGray-scale0.2020.0460.249
512x5128,8,8 inplace4-Channel0.2000.0630.263

In cases where "inplace !" is used, padding optimization cannot be performed during the merged convolution operation due to the limitations of group shared memory size.

ScaleStrategyModeAverage Horizontal FFT (ms)Average Vertical FFT + Mul (ms)Average Convolution (ms)Ratio
1296x12969,6,6,4 inplaceGray-scale0.5980.4981.09668%
1024x102416,16,4 inplaceGray-scale0.4000.3430.74372%
1024x102416,16,4 inplace !4-Channel0.3930.7681.16197%
972x9729,3,6,6 inplaceGray-scale0.3180.2730.59067%
972x9729,3,6,6 inplace4-Channel0.3390.3650.70568%
729x7299,9,9 inplaceGray-scale0.1920.1210.31466%
729x7299,9,9 inplace4-Channel0.1920.1770.36969%
512x5128,8,8 inplaceGray-scale0.1020.0830.18575%
512x5128,8,8 inplace4-Channel0.1070.0870.19474%
256x25616,16 outplaceGray-scale0.0410.0210.061-
256x25616,16 outplace & inplace4-Channel0.0330.0500.083-

dispatch merge

Common Configuration

Below are performance test results for screen ratios closer to rectangular shapes. The second set of data reflects the results of a optimization for 20% vertical length padding. Since the size of the padding needs to be customized based on the shape of the convolution kernel, the "Optimized" results are for reference only.

ScaleModeConvolution Average(ms)Convolution(20% Padding Optimization) Average(ms)Ratio
512x256Gray-scale0.1170.10993%
512x2564-Channel0.1250.12499%
729x512Gray-scale0.2520.22489%
729x5124-Channel0.2550.22287%
927x512Gray-scale0.3330.29388%
927x5124-Channel0.3370.339101%
972x729Gray-scale0.4120.35686%
972x7294-Channel0.4890.40683%
1024x512Gray-scale0.3690.32688%
1024x5124-Channel0.3700.35797%
1296x729Gray-scale0.5520.48488%
1296x7294-Channel0.6590.55885%
1620x972Gray-scale1.0530.93389%
1620x9724-Channel1.1871.05889%
2048x972Gray-scale1.9591.68486%
2048x9724-Channel2.1411.84486%
2048x1024Gray-scale2.1401.82885%
2048x10244-Channel2.8912.57589%
2048x1296Gray-scale2.6122.21685%

Note: The performance of Unity default bloom is 0.164ms on my device. convolution pref

FFT Benchmark

  • Strategies such as R8+R2 represent shorthand for a combination of Radix-8 and Radix-2 decomposition strategies.
  • R/W Only refers to the read and write overhead of global memory (RWTexture) and group shared memory.
  • Combinations marked with * in the table indicate internal decomposition optimizations.
  • (pad) denotes padding and remapping of indices for group shared memory.
  • Padding for group shared memory involves inserting an empty element every $15$ elements.
  • (permute) indicates task reordering for threads.

1024x1024

The table and figure below shows the performance test results for a $1024 \times 1024$ image under different combinations.

Decomposition StrategyPassMemory Access StrategyTotal Shader Time (ms)Average FFT+IFFT Time (ms)Average Single-Channel FFT Time (ms)Average FFT+IFFT Computation Time (ms)Normalized Time
Empty-0.7300.0370.005-3.481
R/W Only-12.7040.6350.079-60.577
R210Out-of-Place23.6281.1810.1480.546112.667
R45Out-of-Place17.5490.8770.1100.24283.680
R8+R24Out-of-Place17.7750.8890.1110.25484.758
R16+R43Out-of-Place52.9312.6470.3312.011252.395
R4*5Out-of-Place17.0230.8510.1060.21681.172
R8*+R24Out-of-Place15.5100.7760.0970.14073.957
R16+R43Out-of-Place15.6250.7810.0980.14674.506
R32*2Out-of-Place991.96249.5986.20048.9634730.043
R210In-Place96.5424.8270.6034.192460.348
R45In-Place50.4742.5240.3151.889240.679
R8+R24In-Place40.6672.0330.2541.398193.915
R16+R43In-Place57.6062.8800.3602.245274.687
R4*5In-Place50.5232.5260.3161.891240.912
R8*+R24In-Place42.5852.1290.2661.494203.061
R16*+R43In-Place33.0721.6540.2071.018157.700
R32*2In-Place279.48913.9741.74713.3391332.707
R210In-Place(pad)36.5721.8290.2291.193174.389
R45In-Place(pad)19.8630.9930.1240.35894.714
R8+R24In-Place(pad)28.5301.4270.1780.791136.042
R16+R43In-Place(pad)54.5772.7290.3412.094260.243
R4*5In-Place(pad)19.7490.9870.1230.35294.171
R8*+R24In-Place(pad)18.3070.9150.1140.28087.295
R16*+R43In-Place(pad)16.4580.8230.1030.18878.478
R32*2In-Place(pad)250.57212.5291.56611.8931194.820
R210In-Place(perm)31.0371.5520.1940.917147.996
R45In-Place(perm)24.9771.2490.1560.614119.100
R8+R24In-Place(perm)30.0361.5020.1880.867143.223
R16+R43In-Place(perm)54.6032.7300.3412.095260.367
R4*5In-Place(perm)24.8481.2420.1550.607118.484
R8*+R24In-Place(perm)29.8591.4930.1870.858142.379
R16*+R43In-Place(perm)28.5731.4290.1790.793136.247
R32*2In-Place(perm)297.05314.8531.85714.2171416.459
R210In-Place(perm+pad)32.2391.6120.2010.977153.728
R45In-Place(perm+pad)19.0280.9510.1190.31690.733
R8+R24In-Place(perm+pad)25.0011.2500.1560.615119.214
R16+R43In-Place(perm+pad)53.3362.6670.3332.032254.326
R4*5In-Place(perm+pad)18.9770.9490.1190.31490.489
R8*+R24In-Place(perm+pad)16.8080.8400.1050.20580.147
R16*+R43In-Place(perm+pad)15.6720.7840.0980.14874.730
R32*2In-Place(perm+pad)244.57212.2291.52911.5931166.210

729x729

($3^6 = 729$)

Decomposition StrategyPassMemory Access StrategyTotal Shader Time (ms)Average FFT+IFFT Time (ms)Average Single-Channel FFT Time (ms)Average FFT+IFFT Computation Time (ms)Normalized Time
Empty-0.7300.0370.005-7.222
R/W Only-6.5220.3260.041-64.525
R36Out-of-Place9.7870.4890.0610.16396.827
R93Out-of-Place12.3010.6150.0770.289121.698
R9*3Out-of-Place8.3040.4150.0520.08982.155
R27*2Out-of-Place355.40917.7702.22117.4443516.196
R36In-Place8.0530.4030.0500.07779.671
R93In-Place10.6710.5340.0670.207105.572
R9*3In-Place6.9090.3450.0430.01968.353
R27*2In-Place477.2223.8612.98323.5354721.319

972x972

For a $972 \times 972 image size, since \972 = 222^{2} \times $3^{5}$$, the FFT decomposition strategy becomes more complex.

It is worth noting that the Out-of-Place FFT shows a significant performance drop when using the R9*+R3+R6* decomposition strategy, which is suspected to be caused by compiler optimization issues.

Decomposition StrategyPassMemory Access StrategyTotal Shader Time (ms)Average FFT+IFFT Time (ms)Average Single-Channel FFT Time (ms)Average FFT+IFFT Computation Time (ms)Normalized Time
Empty-0.7300.0370.005-3.893
R/W Only-11.7130.5860.073-62.457
R3+R27Out-of-Place18.2690.9130.1140.32897.416
R3+R4*6Out-of-Place16.5220.8260.1030.24088.100
R9*+R3+R4*4Out-of-Place15.2260.7610.0950.17681.190
R9*+R3+R6*4Out-of-Place76.7863.8390.4803.254409.447
R9*+R12*3Out-of-Place13.4810.6740.0840.08871.885
R3+R2*7In-Place19.2640.9630.1200.378102.722
R3+R4*6In-Place16.1400.8070.1010.22186.063
R9*+R3+R4*4In-Place14.8710.7440.0930.15879.297
R9*+R3+R6*4In-Place13.8650.6930.0870.10873.932
R9*+R12*3In-Place12.7190.6360.0790.05067.822

Out-of-Place is relatively stable. Different decomposition orders can lead to changes in memory access patterns, which in turn affect the probability of In-Place Bank Conflict occurrences.

The figure below shows tests for different decomposition orders of the R3 + R4* combination.

It can be observed that as the R4 Pass is moved earlier, the performance of the In-Place FFT gradually decreases. This is because the R4 Pass introduces a memory access pattern with a factor of 2, increasing the probability of Bank Conflicts in subsequent Passes. Therefore, it is recommended to delay the factor of 2 as much as possible in the decomposition strategy.

Decomposition StrategyPassMemory Access StrategyTotal Shader Time (ms)Average FFT+IFFT Time (ms)Average Single-Channel FFT Time (ms)Average FFT+IFFT Computation Time (ms)Normalized Time
3,3,3,3,3,44In-Place16.1400.8070.1010.22186.063
3,3,3,3,3,44Out-of-Place16.5220.8260.1030.24088.100
3,3,3,3,4,34In-Place16.2900.8150.1020.22986.863
3,3,3,3,4,34Out-of-Place15.1430.7570.0950.17280.747
3,3,3,4,3,34In-Place19.0240.9510.1190.366101.442
3,3,3,4,3,34Out-of-Place16.1940.8100.1010.22486.351
3,3,4,3,3,34In-Place22.6771.1340.1420.548120.921
3,3,4,3,3,34Out-of-Place16.2770.8140.1020.22886.794
3,4,3,3,3,34In-Place26.6001.3300.1660.744141.839
3,4,3,3,3,34Out-of-Place16.2090.8100.1010.22586.431
4.3,3,3,3,34In-Place30.8511.5430.1930.957164.507
4.3,3,3,3,34Out-of-Place16.1660.8080.1010.22386.202