One can improve on arcsine's performance with Horner's method and then bittwiddling.

The first cut, in Apple:

λx. {
  z⟜ abs. x;
  as ⟜ ⟨1.5707288,_0.2121144,0.0742610,_0.0187293⟩;
  Ho ← gen. 1 (z*) 4;
  p ⟜ 𝜋%2-√(1-z)*Ho⋅as;
  ?x<0,._p,.p
}

We can see that use of gen. (unfold) alongside dot product () triggers loop fusion:

 > :yank asin math/fasin.🍏
 > :asm asin
arr_0: .byte 01,00,00,00,00,00,00,00,04,00,00,00,00,00,00,00,4a,2a,ae,85,b4,21,f9,3f,63,5d,81,8d,90,26,cb,bf,06,46,5e,d6,c4,02,b3,3f,8f,6b,e8,0b,c6,2d,93,bf
   fabs d8, d0
   ldr x2, =arr_0
   mov x1, #0x2d18
   movk x1, #0x5444, LSL #16
   movk x1, #0x21fb, LSL #32
   movk x1, #0x3ff9, LSL #48
   fmov d7, x1
   movz x1, #0x3ff0, LSL #48
   fmov d2, x1
   fsub d2, d2, d8
   fsqrt d6, d2
   add x1, x2, #0x10
   ldr d2, [x1]
   movz x1, #0x3ff0, LSL #48
   fmov d3, x1
   fmul d5, d3, d2
   add x2, x2, #0x10
   mov x1, #0x4
apple_1:
   eor v4.8b,v4.8b, v4.8b
apple_0:
   add x2, x2, #0x8
   ldr d2, [x2]
   fmul d3, d8, d3
   fmadd d5, d3, d2, d5
   subs x1, x1, #0x1
   b.GT apple_0
   fmsub d3, d6, d5, d7
   fmov d2, d0
   fmov d0, d3
   fneg d3, d3
   fcmp d2, d4
   fcsel d0, d3, d0, LT
   ret

However, we can do better: copying the sign bit from the argument (as in the motivating example code) to avoid a comparison. To do so, rather than implement a primitive to copy sign, I implemented refl. : (float -> float) -> float -> float which computes an odd function from its definition on the upper half,viz.

refl. (λz. {
  as ⟜ ⟨1.5707288,_0.2121144,0.0742610,_0.0187293⟩;
  Ho ← gen. 1 (z*) 4;
  𝜋%2-√(1-z)*Ho⋅as
})

See that this has no branches:

 > :yank asin math/fasin.🍏
 > :asm asin
   ⋮
   b.GT apple_0
   fmsub d0, d6, d5, d7
   mov x1, #0x1
   lsl x1, x1, #0x3f
   ins v3.2d[0], x1
   bit v0.16b, v2.16b, v3.16b
   ret

This exploits the fact that SIMD registers and floating-point registers are shared to apply bit for the result.

Like the example linked above, it is a good deal faster than NumPy (for instance):