I adapted the xor example here from Python; they used NumPy.

NB. input data
X =: 4 2 $ 0 0  0 1  1 0  1 1

NB. target data, ~: is 'not-eq' aka xor?
Y =: , (i.2) ~:/ (i.2)

scale =: (-&1)@:(\*&2)

NB. initialize weights b/w \_1 and 1
NB. see https://code.jsoftware.com/wiki/Vocabulary/dollar#dyadic
init\_weights =: 3 : 'scale"0 y ?@$ 0'

w\_hidden =: init\_weights 2 2
w\_output =: init\_weights 2
b\_hidden =: init\_weights 2
b\_output =: scale ? 0

dot =: +/ . \*

sigmoid =: monad define
    % 1 + ^ - y
)
sigmoid\_ddx =: 3 : 'y \* (1-y)'

NB. forward prop
forward =: dyad define
    'WH WO BH BO' =. x
    hidden\_layer\_output =. sigmoid (BH +"1 X (dot "1 2) WH)
    prediction =. sigmoid (BO + WO dot"1 hidden\_layer\_output)
    (hidden\_layer\_output;prediction)
)

train =: dyad define
    'X Y' =. x
    'WH WO BH BO' =. y
    'hidden\_layer\_output prediction' =. y forward X
    l1\_err =. Y - prediction
    l1\_delta =. l1\_err \* sigmoid\_ddx prediction
    hidden\_err =. l1\_delta \*/ WO
    hidden\_delta =. hidden\_err \* sigmoid\_ddx hidden\_layer\_output
    WH\_adj =. WH + (|: X) dot hidden\_delta
    WO\_adj =. WO + (|: hidden\_layer\_output) dot l1\_delta
    BH\_adj =. +/ BH,hidden\_delta
    BO\_adj =. +/ BO,l1\_delta
    (WH\_adj;WO\_adj;BH\_adj;BO\_adj)
)

w\_trained =: (((X;Y) & train) ^: 10000) (w\_hidden;w\_output;b\_hidden;b\_output)
guess =: >1 { w\_trained forward X

Compare to this K implementation for style.

As it happens, this J code is substantially faster than the equivalent using NumPy (0.13s vs. 0.59s).

I'm quite curious as to why the J is so much more performant. I read APL since 1978 recently and APL has quite a few differences as an array environment compared to conventional programming languages.