Performance of High-level Languages

by Vanessa McHale | 2024-07-26 11:28

It is quite possible to beat established (array) implementations with special cases and bittwiddling.

Rosetta

\( n \mapsto (-1)^n \) is \(1\) or \(-1\) depending on whether \(n\) is even or odd.

Apple

 > :bench (_1.0^)'irange 0 999999 1
benchmarking...
time                 1.486 ms   (1.477 ms .. 1.494 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.512 ms   (1.507 ms .. 1.522 ms)
std dev              20.39 μs   (11.74 μs .. 36.11 μs)

Apple recognizes \(n \mapsto (-1)^n \) as an idiom.

BQN

   )time:1000 ¯1⋆↕999999
4.487ms

Perhaps this is faster than the others because BQN uses 32-bit integers.

J

   1000 (6!:2) '_1^(i.999999)'
0.0119482

APL

¯1*⍳999999

NumPy/Python

%timeit (-1)**np.arange(0,999999)
10.2 ms ± 85.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Conclusion

Apple leaves some performance on the table; we could use a conditional negate instruction and avoid the need to put a constant -1 in d2.

 > :disasm ((_1.0)^)
    movz x1, #0x3ff0, LSL #48                01 fe e7 d2
    fmov d0, x1                              20 00 67 9e
    movz x1, #0xbff0, LSL #48                01 fe f7 d2
    fmov d2, x1                              22 00 67 9e
    tst x0, 0b1                              1f 00 40 f2
    fcsel d0, d2, d0, NE                     40 1c 60 1e
    ret                                      c0 03 5f d6

In fact the -1 is not even hoisted out of the loop when we map \((-1)^n\) over a vector:

 > :disasm (((_1.0)^)')
 ⋮
apple_0:
    mov x3, #0x10                            03 02 80 d2
    add x3, x3, x1, LSL #3                   63 0c 01 8b
    ldr x4, [x6, x3, LSL #0]                 c4 68 63 f8
    movz x3, #0x3ff0, LSL #48                03 fe e7 d2
    fmov d3, x3                              63 00 67 9e
    movz x3, #0xbff0, LSL #48                03 fe f7 d2
    fmov d2, x3                              62 00 67 9e
    tst x4, 0b1                              9f 00 40 f2
    fcsel d3, d2, d3, NE                     43 1c 63 1e
    add x3, x1, #0x2                         23 08 00 91
    str d3, [x0, x3, LSL #3]                 03 78 23 fc
    add x1, x1, #0x1                         21 04 00 91
    cmp x1, x2                               3f 00 02 eb
    b.LT apple_0                             6b fe ff 54

In this example, it's hard to get high-level languages to perform. There's plenty of room at the bottom!

return

blog

Performance of High-level Languages

Rosetta

Apple

BQN

J

APL

NumPy/Python

Conclusion