One can improve on arcsine's performance with Horner's method and then bittwiddling.
The first cut, in Apple:
λx. {
z⟜ abs. x;
as ⟜ ⟨1.5707288,_0.2121144,0.0742610,_0.0187293⟩;
Ho ← gen. 1 (z*) 4;
p ⟜ 𝜋%2-√(1-z)*Ho⋅as;
?x<0,._p,.p
}
We can see that use of gen. (unfold) alongside dot product (⋅) triggers loop fusion:
> :yank asin math/fasin.🍏
> :asm asin
arr_0: .byte 01,00,00,00,00,00,00,00,04,00,00,00,00,00,00,00,4a,2a,ae,85,b4,21,f9,3f,63,5d,81,8d,90,26,cb,bf,06,46,5e,d6,c4,02,b3,3f,8f,6b,e8,0b,c6,2d,93,bf
fabs d8, d0
ldr x2, =arr_0
mov x1, #0x2d18
movk x1, #0x5444, LSL #16
movk x1, #0x21fb, LSL #32
movk x1, #0x3ff9, LSL #48
fmov d7, x1
movz x1, #0x3ff0, LSL #48
fmov d2, x1
fsub d2, d2, d8
fsqrt d6, d2
add x1, x2, #0x10
ldr d2, [x1]
movz x1, #0x3ff0, LSL #48
fmov d3, x1
fmul d5, d3, d2
add x2, x2, #0x10
mov x1, #0x4
apple_1:
eor v4.8b,v4.8b, v4.8b
apple_0:
add x2, x2, #0x8
ldr d2, [x2]
fmul d3, d8, d3
fmadd d5, d3, d2, d5
subs x1, x1, #0x1
b.GT apple_0
fmsub d3, d6, d5, d7
fmov d2, d0
fmov d0, d3
fneg d3, d3
fcmp d2, d4
fcsel d0, d3, d0, LT
ret
However, we can do better: copying the sign bit from the argument (as in the
motivating example code) to avoid a comparison. To do so, rather than implement
a primitive to copy sign, I implemented refl. : (float -> float) -> float -> float which computes an odd function from its definition on the upper half,viz.
refl. (λz. {
as ⟜ ⟨1.5707288,_0.2121144,0.0742610,_0.0187293⟩;
Ho ← gen. 1 (z*) 4;
𝜋%2-√(1-z)*Ho⋅as
})
See that this has no branches:
> :yank asin math/fasin.🍏
> :asm asin
⋮
b.GT apple_0
fmsub d0, d6, d5, d7
mov x1, #0x1
lsl x1, x1, #0x3f
ins v3.2d[0], x1
bit v0.16b, v2.16b, v3.16b
ret
This exploits the fact that SIMD registers and floating-point registers are shared to apply bit for the result.
Like the example linked above, it is a good deal faster than NumPy (for instance):
