High Performance Convolution Bloom On Unity

April 30, 2025 · View on GitHub

This project implements a high-quality bloom effect using Fast Fourier Transform (FFT) convolution, Providing customizable bloom effects with optimized performance. It achieves performance parity with Unreal Engine’s convolution bloom effect while offering greater flexibility and additional optimization options.

Unity Version: 2022.3.8f1c1

Blog: https://zhuanlan.zhihu.com/p/1900864922390343758

bloomsameple1

bloomsample2

Convolution Benchmark

The performance testing of Convolution was conducted using the Unity Profiler, recording GPU Profiler timings.

The testing process involved executing 20 convolution per frame, calculating the average time for per-convolution. Kernel FFT is not included.

Read/Write Texture format ARGBHalf.

Device: NVIDIA GeForce MX450.

Dispatch Merge Performance Comparison

Scale	Strategy	Mode	Average Horizontal FFT (ms)	Average Vertical FFT + Mul (ms)	Average Convolution (ms)
1296x1296	9,6,6,4 inplace	Gray-scale	1.151	0.465	1.616
1024x1024	16,16,4 inplace	Gray-scale	0.781	0.249	1.030
1024x1024	16,16,4 inplace	4-Channel	0.779	0.415	1.195
972x972	9,3,6,6 inplace	Gray-scale	0.663	0.223	0.886
972x972	9,3,6,6 inplace	4-Channel	0.664	0.367	1.031
729x729	9,9,9 inplace	Gray-scale	0.373	0.101	0.474
729x729	9,9,9 inplace	4-Channel	0.369	0.169	0.537
512x512	8,8,8 inplace	Gray-scale	0.202	0.046	0.249
512x512	8,8,8 inplace	4-Channel	0.200	0.063	0.263

In cases where "inplace !" is used, padding optimization cannot be performed during the merged convolution operation due to the limitations of group shared memory size.

Scale	Strategy	Mode	Average Horizontal FFT (ms)	Average Vertical FFT + Mul (ms)	Average Convolution (ms)	Ratio
1296x1296	9,6,6,4 inplace	Gray-scale	0.598	0.498	1.096	68%
1024x1024	16,16,4 inplace	Gray-scale	0.400	0.343	0.743	72%
1024x1024	16,16,4 inplace !	4-Channel	0.393	0.768	1.161	97%
972x972	9,3,6,6 inplace	Gray-scale	0.318	0.273	0.590	67%
972x972	9,3,6,6 inplace	4-Channel	0.339	0.365	0.705	68%
729x729	9,9,9 inplace	Gray-scale	0.192	0.121	0.314	66%
729x729	9,9,9 inplace	4-Channel	0.192	0.177	0.369	69%
512x512	8,8,8 inplace	Gray-scale	0.102	0.083	0.185	75%
512x512	8,8,8 inplace	4-Channel	0.107	0.087	0.194	74%
256x256	16,16 outplace	Gray-scale	0.041	0.021	0.061	-
256x256	16,16 outplace & inplace	4-Channel	0.033	0.050	0.083	-

dispatch merge

Common Configuration

Below are performance test results for screen ratios closer to rectangular shapes. The second set of data reflects the results of a optimization for 20% vertical length padding. Since the size of the padding needs to be customized based on the shape of the convolution kernel, the "Optimized" results are for reference only.

Scale	Mode	Convolution Average(ms)	Convolution(20% Padding Optimization) Average(ms)	Ratio
512x256	Gray-scale	0.117	0.109	93%
512x256	4-Channel	0.125	0.124	99%
729x512	Gray-scale	0.252	0.224	89%
729x512	4-Channel	0.255	0.222	87%
927x512	Gray-scale	0.333	0.293	88%
927x512	4-Channel	0.337	0.339	101%
972x729	Gray-scale	0.412	0.356	86%
972x729	4-Channel	0.489	0.406	83%
1024x512	Gray-scale	0.369	0.326	88%
1024x512	4-Channel	0.370	0.357	97%
1296x729	Gray-scale	0.552	0.484	88%
1296x729	4-Channel	0.659	0.558	85%
1620x972	Gray-scale	1.053	0.933	89%
1620x972	4-Channel	1.187	1.058	89%
2048x972	Gray-scale	1.959	1.684	86%
2048x972	4-Channel	2.141	1.844	86%
2048x1024	Gray-scale	2.140	1.828	85%
2048x1024	4-Channel	2.891	2.575	89%
2048x1296	Gray-scale	2.612	2.216	85%

Note: The performance of Unity default bloom is 0.164ms on my device. convolution pref

FFT Benchmark

Strategies such as R8+R2 represent shorthand for a combination of Radix-8 and Radix-2 decomposition strategies.
R/W Only refers to the read and write overhead of global memory (RWTexture) and group shared memory.
Combinations marked with $*$ in the table indicate internal decomposition optimizations.
(pad) denotes padding and remapping of indices for group shared memory.
Padding for group shared memory involves inserting an empty element every $15$ elements.
(permute) indicates task reordering for threads.

1024x1024

The table and figure below shows the performance test results for a $1024 \times 1024$ image under different combinations.

Decomposition Strategy	Pass	Memory Access Strategy	Total Shader Time (ms)	Average FFT+IFFT Time (ms)	Average Single-Channel FFT Time (ms)	Average FFT+IFFT Computation Time (ms)	Normalized Time
Empty		-	0.730	0.037	0.005	-	3.481
R/W Only		-	12.704	0.635	0.079	-	60.577
R2	10	Out-of-Place	23.628	1.181	0.148	0.546	112.667
R4	5	Out-of-Place	17.549	0.877	0.110	0.242	83.680
R8+R2	4	Out-of-Place	17.775	0.889	0.111	0.254	84.758
R16+R4	3	Out-of-Place	52.931	2.647	0.331	2.011	252.395
R4*	5	Out-of-Place	17.023	0.851	0.106	0.216	81.172
R8*+R2	4	Out-of-Place	15.510	0.776	0.097	0.140	73.957
R16+R4	3	Out-of-Place	15.625	0.781	0.098	0.146	74.506
R32*	2	Out-of-Place	991.962	49.598	6.200	48.963	4730.043
R2	10	In-Place	96.542	4.827	0.603	4.192	460.348
R4	5	In-Place	50.474	2.524	0.315	1.889	240.679
R8+R2	4	In-Place	40.667	2.033	0.254	1.398	193.915
R16+R4	3	In-Place	57.606	2.880	0.360	2.245	274.687
R4*	5	In-Place	50.523	2.526	0.316	1.891	240.912
R8*+R2	4	In-Place	42.585	2.129	0.266	1.494	203.061
R16*+R4	3	In-Place	33.072	1.654	0.207	1.018	157.700
R32*	2	In-Place	279.489	13.974	1.747	13.339	1332.707
R2	10	In-Place(pad)	36.572	1.829	0.229	1.193	174.389
R4	5	In-Place(pad)	19.863	0.993	0.124	0.358	94.714
R8+R2	4	In-Place(pad)	28.530	1.427	0.178	0.791	136.042
R16+R4	3	In-Place(pad)	54.577	2.729	0.341	2.094	260.243
R4*	5	In-Place(pad)	19.749	0.987	0.123	0.352	94.171
R8*+R2	4	In-Place(pad)	18.307	0.915	0.114	0.280	87.295
R16*+R4	3	In-Place(pad)	16.458	0.823	0.103	0.188	78.478
R32*	2	In-Place(pad)	250.572	12.529	1.566	11.893	1194.820
R2	10	In-Place(perm)	31.037	1.552	0.194	0.917	147.996
R4	5	In-Place(perm)	24.977	1.249	0.156	0.614	119.100
R8+R2	4	In-Place(perm)	30.036	1.502	0.188	0.867	143.223
R16+R4	3	In-Place(perm)	54.603	2.730	0.341	2.095	260.367
R4*	5	In-Place(perm)	24.848	1.242	0.155	0.607	118.484
R8*+R2	4	In-Place(perm)	29.859	1.493	0.187	0.858	142.379
R16*+R4	3	In-Place(perm)	28.573	1.429	0.179	0.793	136.247
R32*	2	In-Place(perm)	297.053	14.853	1.857	14.217	1416.459
R2	10	In-Place(perm+pad)	32.239	1.612	0.201	0.977	153.728
R4	5	In-Place(perm+pad)	19.028	0.951	0.119	0.316	90.733
R8+R2	4	In-Place(perm+pad)	25.001	1.250	0.156	0.615	119.214
R16+R4	3	In-Place(perm+pad)	53.336	2.667	0.333	2.032	254.326
R4*	5	In-Place(perm+pad)	18.977	0.949	0.119	0.314	90.489
R8*+R2	4	In-Place(perm+pad)	16.808	0.840	0.105	0.205	80.147
R16*+R4	3	In-Place(perm+pad)	15.672	0.784	0.098	0.148	74.730
R32*	2	In-Place(perm+pad)	244.572	12.229	1.529	11.593	1166.210

729x729

（$3^6 = 729$）

Decomposition Strategy	Pass	Memory Access Strategy	Total Shader Time (ms)	Average FFT+IFFT Time (ms)	Average Single-Channel FFT Time (ms)	Average FFT+IFFT Computation Time (ms)	Normalized Time
Empty		-	0.730	0.037	0.005	-	7.222
R/W Only		-	6.522	0.326	0.041	-	64.525
R3	6	Out-of-Place	9.787	0.489	0.061	0.163	96.827
R9	3	Out-of-Place	12.301	0.615	0.077	0.289	121.698
R9*	3	Out-of-Place	8.304	0.415	0.052	0.089	82.155
R27*	2	Out-of-Place	355.409	17.770	2.221	17.444	3516.196
R3	6	In-Place	8.053	0.403	0.050	0.077	79.671
R9	3	In-Place	10.671	0.534	0.067	0.207	105.572
R9*	3	In-Place	6.909	0.345	0.043	0.019	68.353
R27*	2	In-Place	477.22	23.861	2.983	23.535	4721.319

972x972

For a $972 \times 972 $image size, since \$ 972 = $2^{2}$ \times $3^{5}$$, the FFT decomposition strategy becomes more complex.

It is worth noting that the Out-of-Place FFT shows a significant performance drop when using the R9*+R3+R6* decomposition strategy, which is suspected to be caused by compiler optimization issues.

Decomposition Strategy	Pass	Memory Access Strategy	Total Shader Time (ms)	Average FFT+IFFT Time (ms)	Average Single-Channel FFT Time (ms)	Average FFT+IFFT Computation Time (ms)	Normalized Time
Empty		-	0.730	0.037	0.005	-	3.893
R/W Only		-	11.713	0.586	0.073	-	62.457
R3+R2	7	Out-of-Place	18.269	0.913	0.114	0.328	97.416
R3+R4*	6	Out-of-Place	16.522	0.826	0.103	0.240	88.100
R9+R3+R4	4	Out-of-Place	15.226	0.761	0.095	0.176	81.190
R9+R3+R6	4	Out-of-Place	76.786	3.839	0.480	3.254	409.447
R9+R12	3	Out-of-Place	13.481	0.674	0.084	0.088	71.885
R3+R2*	7	In-Place	19.264	0.963	0.120	0.378	102.722
R3+R4*	6	In-Place	16.140	0.807	0.101	0.221	86.063
R9+R3+R4	4	In-Place	14.871	0.744	0.093	0.158	79.297
R9+R3+R6	4	In-Place	13.865	0.693	0.087	0.108	73.932
R9+R12	3	In-Place	12.719	0.636	0.079	0.050	67.822

Out-of-Place is relatively stable. Different decomposition orders can lead to changes in memory access patterns, which in turn affect the probability of In-Place Bank Conflict occurrences.

The figure below shows tests for different decomposition orders of the R3 + R4* combination.

It can be observed that as the R4 Pass is moved earlier, the performance of the In-Place FFT gradually decreases. This is because the R4 Pass introduces a memory access pattern with a factor of 2, increasing the probability of Bank Conflicts in subsequent Passes. Therefore, it is recommended to delay the factor of 2 as much as possible in the decomposition strategy.

Decomposition Strategy	Pass	Memory Access Strategy	Total Shader Time (ms)	Average FFT+IFFT Time (ms)	Average Single-Channel FFT Time (ms)	Average FFT+IFFT Computation Time (ms)	Normalized Time
3,3,3,3,3,4	4	In-Place	16.140	0.807	0.101	0.221	86.063
3,3,3,3,3,4	4	Out-of-Place	16.522	0.826	0.103	0.240	88.100
3,3,3,3,4,3	4	In-Place	16.290	0.815	0.102	0.229	86.863
3,3,3,3,4,3	4	Out-of-Place	15.143	0.757	0.095	0.172	80.747
3,3,3,4,3,3	4	In-Place	19.024	0.951	0.119	0.366	101.442
3,3,3,4,3,3	4	Out-of-Place	16.194	0.810	0.101	0.224	86.351
3,3,4,3,3,3	4	In-Place	22.677	1.134	0.142	0.548	120.921
3,3,4,3,3,3	4	Out-of-Place	16.277	0.814	0.102	0.228	86.794
3,4,3,3,3,3	4	In-Place	26.600	1.330	0.166	0.744	141.839
3,4,3,3,3,3	4	Out-of-Place	16.209	0.810	0.101	0.225	86.431
4.3,3,3,3,3	4	In-Place	30.851	1.543	0.193	0.957	164.507
4.3,3,3,3,3	4	Out-of-Place	16.166	0.808	0.101	0.223	86.202