Apple outperforms R in important cases.
Most egregiously, Apple's rolling average (via builtin infix) is 90 times faster than the optimized procedure in R's zoo package:
sliding_mean<-jit("([(+)/x%ℝ(:x)]⑄7)")
Unit: microseconds
expr min lq mean median uq max neval
rollmean(a, 7) 181.876 184.7255 197.6274 186.427 191.6135 1034.061 100
Unit: microseconds
expr min lq mean median uq max neval
run(sliding_mean, a) 2.583 2.788 63.74229 2.87 2.952 6050.083 100
Apple's matrix multiplication and dot product are also faster, viz.
Unit: microseconds
expr min lq mean median uq max neval
x %% y 39.975 40.057 40.4793 40.098 40.098 55.678 100
Unit: microseconds
expr min lq mean median uq max neval
sum(x y) 18.204 21.935 23.03298 22.2015 22.919 56.867 100
Unit: microseconds
expr min lq mean median uq max neval
run(dp, x, y) 9.717 9.799 76.34241 11.8695 25.0305 6145.244 100
Unit: nanoseconds
expr min lq mean median uq max neval
A %*% x 820 902 958.99 902 984 3567 100
Unit: microseconds
expr min lq mean median uq max neval
run(vmul, A, x) 1.64 1.681 2.31937 1.722 1.886 30.463 100
[1] 64 64 64
Unit: microseconds
expr min lq mean median uq max neval
run(mT, A, B) 38.294 38.786 40.71054 39.278 41.3485 61.254 100
Unit: microseconds
expr min lq mean median uq max neval
A %*% t(B) 116.194 117.9775 122.6691 119.9865 124.3735 180.646 100
[1] 64 64 64
Unit: microseconds
expr min lq mean median uq max neval
run(m, A, B) 42.148 46.371 47.31113 46.822 47.396 60.598 100
Unit: microseconds
expr min lq mean median uq max neval
A %*% B 106.682 109.019 111.8984 109.347 112.504 150.47 100
[1] 512 512 512
Unit: milliseconds
expr min lq mean median uq max neval
run(mT, A, B) 16.90225 17.18576 17.33541 17.26455 17.46961 18.51158 100
Unit: milliseconds
expr min lq mean median uq max neval
A %*% t(B) 45.03141 45.25666 45.46609 45.36882 45.53316 46.99436 100
[1] 1024 1024 1024
Unit: milliseconds
expr min lq mean median uq max neval
run(mT, A, B) 153.2151 155.7401 156.8453 156.5409 157.9294 171.7692 100
Unit: milliseconds
expr min lq mean median uq max neval
A %*% t(B) 361.6465 363.2224 364.6133 363.8554 365.0779 377.573 100
Vector multiplication is slower (R stores matrices in column-major order while Apple uses row-major ordering, so the R bindings for Apple transpose matrix arguments).
Apple happens to be faster at computing cosine similarity, particularly the implementation in the lsa package.
cosim<-jit("λa.λb. a⋅b%(√((+)/((^2)'a))*√((+)/((^2)'b)))")
Unit: microseconds
expr min lq mean median uq max neval
run(cosim, X, Y) 7.339 7.421 71.14238 7.462 7.462 6339.953 100
⋮
Unit: microseconds
expr min lq mean median uq max neval
cosine(X, Y) 11.357 11.398 19.85425 11.439 11.521 826.068 100
Unit: microseconds
expr min lq mean median uq max neval
R_cosim(X, Y) 14.883 15.211 36.56503 15.457 15.703 2028.106 100
Folklore would have it that writing a compiler is a tall order, but Apple achieves performance good enough for widespread adoption even without instruction scheduling and with no heuristic for picking registers to spill.
NumPy
To see that there is no magic, NumPy and SciPy outperform Apple on cosine similarity:
import numpy as np
from numpy.linalg import norm
def cosim_npy(x,y):
return np.dot(x,y)/(norm(x)*norm(y))
from scipy.spatial import distance
def cosim_scipy(x,y):
return 1-distance.cosine(x,y)
from sklearn.metrics.pairwise import cosine_similarity
def cosim_sk(X,Y):
return cosine_similarity(X.reshape(1,-1),Y.reshape(1,-1))[(0,0)]
import apple
cosim=apple.jit("λa.λb. a⋅b%(√((+)/((^2)'a))*√((+)/((^2)'b)))")
A=np.random.rand(2000);B=np.random.rand(2000)
%timeit cosim_npy(A,B)
3.6 μs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit cosim_scipy(A,B)
4.76 μs ± 20.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit cosim(A,B)
7.09 μs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit cosim_sk(A.reshape(1,-1),B.reshape(1,-1))
106 μs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Apple's matrix multiplication is drastically slower on M1; NumPy calls into the system libraries which use undocumented special hardware.
import numpy as np
import apple
mul=apple.jit("[(x::M ₁₀₂₄,₁₀₂₄ float)%.(y::M ₁₀₂₄,₁₀₂₄ float)]")
mulT=apple.jit("[(x::M ₁₀₂₄,₁₀₂₄ float)%.|:(y::M ₁₀₂₄,₁₀₂₄ float)]")
bs=np.random.rand(1024,1024)
%timeit bs@bs
9.09 ms ± 328 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mul(bs,bs)
187 ms ± 9.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mulT(bs,bs)
165 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bs@bs.T
9.92 ms ± 407 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
import numpy as np
import apple
v9=apple.jit('[x::Arr (512×512) float%:y]')
B=np.random.rand(512,512)
y=np.random.rand(512)
%timeit B@y
16.3 μs ± 445 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit v9(B,y)
184 μs ± 535 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
def np_dot(xs,ys):
return np.sum(xs*ys)
dp=apple.jit("[(+)/(*)` (x::Vec n float) y]")
xs=np.random.rand(10000);ys=np.random.rand(10000)
%timeit np_dot(xs,ys)
6.69 μs ± 258 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit dp(xs,ys)
11.1 μs ± 955 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit xs.dot(ys)
2.46 μs ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Apple performs vastly better than NumPy in computing rolling averages:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
import apple
sliding_mean=apple.jit('([(+)/x%ℝ(:x)]\`7)')
%timeit np.average(sliding_window_view(np.arange(0,1000,dtype=np.float64),7),axis=1)
17.8 μs ± 39.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit sliding_mean(np.arange(0,1000,dtype=np.float64))
3.07 μs ± 12 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
This specific case seems to be overlooked in established languages for data processing, but is present in array languages, for instance J and BQN.