ssemathfunextension

August 15, 2016 ยท View on GitHub

SSE2 implementations of sin, cos, exp, log, tan, cot, atan, atan2 The sin, cos, exp and log functions were written by Julien Pommier (see unmodified sse_mathfun.h). The tan, cot, atan, atan2 are written by Tolga Mizrak.

sse_mathfun_extension.h serves as an extension to sse_mathfun.h, implementing tan, cot, atan and atan2. It is written as an extension to sse_mathfun.h instead of modifying it, just because I didn't want to maintain a modified version of the original library. This way switching to a newer version of the original library won't be a hassle.

License: zlib (same as sse_mathfun.h)

Here are the benchmarks on my machine:

Results on a 3.30 GHz AMD FX-6100 Six-Core, compiled with Visual C++ Enterprise 2015 Update 3 (x64)
command line: cl.exe /W4 /DUSE_SSE2 /EHsc- /MD /GS- /Gy /fp:fast /Ox /Oy- /GL /Oi /O2 sse_mathfun_test.c

checking sines on [0*Pi, 1*Pi]
max deviation from sinf(x): 5.96046e-08 at 0.193304238313*Pi, max deviation from cephes_sin(x): 0
max deviation from cosf(x): 5.96046e-08 at 0.303994872157*Pi, max deviation from cephes_cos(x): 0
deviation of sin(x)^2+cos(x)^2-1: 1.78814e-07 (ref deviation is 1.19209e-07)
   ->> precision OK for the sin_ps / cos_ps / sincos_ps <<-

checking sines on [-1000*Pi, 1000*Pi]
max deviation from sinf(x): 5.96046e-08 at  338.694424873*Pi, max deviation from cephes_sin(x): 0
max deviation from cosf(x): 5.96046e-08 at  338.694424873*Pi, max deviation from cephes_cos(x): 0
deviation of sin(x)^2+cos(x)^2-1: 1.78814e-07 (ref deviation is 1.19209e-07)
   ->> precision OK for the sin_ps / cos_ps / sincos_ps <<-

checking exp/log [-60, 60]
max (relative) deviation from expf(x): 1.18944e-07 at -56.8358421326, max deviation from cephes_expf(x): 0
max (absolute) deviation from logf(x): 1.19209e-07 at -1.67546617985, max deviation from cephes_logf(x): 0
deviation of x - log(exp(x)): 1.19209e-07 (ref deviation is 5.96046e-08)
   ->> precision OK for the exp_ps / log_ps <<-

checking tan on [-0.25*Pi, 0.25*Pi]
max deviation from tanf(x): 1.19209e-07 at 0.250000006957*Pi, max deviation from cephes_tan(x): 5.96046e-08
   ->> precision OK for the tan_ps <<-

checking tan on [-0.49*Pi, 0.49*Pi]
max deviation from tanf(x): 3.8147e-06 at -0.490000009841*Pi, max deviation from cephes_tan(x): 9.53674e-07
   ->> precision OK for the tan_ps <<-

checking cot on [0.2*Pi, 0.7*Pi]
max deviation from cotf(x): 1.19209e-07 at 0.204303119606*Pi, max deviation from cephes_cot(x): 1.19209e-07
   ->> precision OK for the cot_ps <<-

checking cot on [0.01*Pi, 0.99*Pi]
max deviation from cotf(x): 3.8147e-06 at 0.987876517942*Pi, max deviation from cephes_cot(x): 9.53674e-07
   ->> precision OK for the cot_ps <<-

checking atan on [-10*Pi, 10*Pi]
max deviation from atanf(x): 1.19209e-07 at -9.39207109497*Pi, max deviation from cephes_atan(x): 1.19209e-07
   ->> precision OK for the atan_ps <<-

checking atan on [-10000*Pi, 10000*Pi]
max deviation from atanf(x): 1.19209e-07 at  -7350.3826719*Pi, max deviation from cephes_atan(x): 1.19209e-07
   ->> precision OK for the atan_ps <<-

checking atan2 on [-1*Pi, 1*Pi]
max deviation from atan2f(x): 2.38419e-07 at (0.797784384786*Pi, -0.913876806545*Pi), max deviation from cephes_atan2(x): 2.38419e-07
   ->> precision OK for the atan2_ps <<-

checking atan2 on [-10000*Pi, 10000*Pi]
max deviation from atan2f(x): 2.38419e-07 at ( 658.284195009*Pi, -2685.93394561*Pi), max deviation from cephes_atan2(x): 2.38419e-07
   ->> precision OK for the atan2_ps <<-

exp([        -1000,          -100,           100,          1000]) = [            0,             0, 2.4061436e+38, 2.4061436e+38]
exp([    -nan(ind),           inf,          -inf,           nan]) = [2.4061436e+38, 2.4061436e+38,             0, 2.4061436e+38]
log([            0,           -10,         1e+30, 1.0005271e-42]) = [         -nan,          -nan,     69.077553,    -87.336548]
log([    -nan(ind),           inf,          -inf,           nan]) = [   -87.336548,     88.722839,          -nan,    -87.336548]
sin([    -nan(ind),           inf,          -inf,           nan]) = [    -nan(ind),     -nan(ind),           nan,           nan]
cos([    -nan(ind),           inf,          -inf,           nan]) = [          nan,     -nan(ind),     -nan(ind),           nan]
sin([       -1e+30,       -100000,         1e+30,        100000]) = [          inf,  -0.035749275,          -inf,   0.035749275]
cos([       -1e+30,       -100000,         1e+30,        100000]) = [    -nan(ind),    -0.9993608,     -nan(ind),    -0.9993608]
benching                 sinf .. ->   16.3 millions of vector evaluations/second ->  40 cycles/value on a 2600MHz computer
benching                 cosf .. ->   15.8 millions of vector evaluations/second ->  41 cycles/value on a 2600MHz computer
benching                 expf .. ->   18.8 millions of vector evaluations/second ->  35 cycles/value on a 2600MHz computer
benching                 logf .. ->   17.8 millions of vector evaluations/second ->  36 cycles/value on a 2600MHz computer
benching                 tanf .. ->   13.8 millions of vector evaluations/second ->  47 cycles/value on a 2600MHz computer
benching                 cotf .. ->   12.2 millions of vector evaluations/second ->  53 cycles/value on a 2600MHz computer
benching                atanf .. ->   10.4 millions of vector evaluations/second ->  62 cycles/value on a 2600MHz computer
benching               atan2f .. ->    5.3 millions of vector evaluations/second -> 121 cycles/value on a 2600MHz computer
benching            atan2_ref .. ->   11.7 millions of vector evaluations/second ->  56 cycles/value on a 2600MHz computer
benching                sqrtf .. ->   69.5 millions of vector evaluations/second ->   9 cycles/value on a 2600MHz computer
benching               rsqrtf .. ->   70.2 millions of vector evaluations/second ->   9 cycles/value on a 2600MHz computer
benching          cephes_sinf .. ->   15.2 millions of vector evaluations/second ->  43 cycles/value on a 2600MHz computer
benching          cephes_cosf .. ->   16.6 millions of vector evaluations/second ->  39 cycles/value on a 2600MHz computer
benching          cephes_expf .. ->    2.9 millions of vector evaluations/second -> 220 cycles/value on a 2600MHz computer
benching          cephes_logf .. ->    3.4 millions of vector evaluations/second -> 186 cycles/value on a 2600MHz computer
benching               sin_ps .. ->   30.6 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching               cos_ps .. ->   31.1 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching            sincos_ps .. ->   30.9 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching               exp_ps .. ->   27.3 millions of vector evaluations/second ->  24 cycles/value on a 2600MHz computer
benching               log_ps .. ->   23.5 millions of vector evaluations/second ->  28 cycles/value on a 2600MHz computer
benching               tan_ps .. ->   22.2 millions of vector evaluations/second ->  29 cycles/value on a 2600MHz computer
benching               cot_ps .. ->   22.0 millions of vector evaluations/second ->  29 cycles/value on a 2600MHz computer
benching              atan_ps .. ->   31.1 millions of vector evaluations/second ->  21 cycles/value on a 2600MHz computer
benching             atan2_ps .. ->   24.1 millions of vector evaluations/second ->  27 cycles/value on a 2600MHz computer
benching              sqrt_ps .. ->   63.9 millions of vector evaluations/second ->  10 cycles/value on a 2600MHz computer
benching             rsqrt_ps .. ->   64.1 millions of vector evaluations/second ->  10 cycles/value on a 2600MHz computer

As you can see the sinf, cosf, expf, logf, tanf, atanf, atan2f and sqrtf implementations of the Visual C++ c library are pretty well optimized themselves, but using the simd versions still gives you at least a boost of 2x, with atan_ps and atan2_ps having the biggest gains.