I adapted the xor example here from Python; they used NumPy.

NB. input data
X =: 4 2 $ 0 0  0 1  1 0  1 1

NB. target data, ~: is 'not-eq' aka xor?
Y =: , (i.2) ~:/ (i.2)

scale =: (-&1)@:(*&2)

NB. initialize weights b/w _1 and 1
NB. see https://code.jsoftware.com/wiki/Vocabulary/dollar#dyadic
init_weights =: 3 : 'scale"0 y ?@$ 0'

w_hidden =: init_weights 2 2
w_output =: init_weights 2
b_hidden =: init_weights 2
b_output =: scale ? 0

dot =: +/ . *

sigmoid =: monad define
    % 1 + ^ - y
)
sigmoid_ddx =: 3 : 'y * (1-y)'

NB. forward prop
forward =: dyad define
    'WH WO BH BO' =. x
    hidden_layer_output =. sigmoid (BH +"1 X (dot "1 2) WH)
    prediction =. sigmoid (BO + WO dot"1 hidden_layer_output)
    (hidden_layer_output;prediction)
)

train =: dyad define
    'X Y' =. x
    'WH WO BH BO' =. y
    'hidden_layer_output prediction' =. y forward X
    l1_err =. Y - prediction
    l1_delta =. l1_err * sigmoid_ddx prediction
    hidden_err =. l1_delta */ WO
    hidden_delta =. hidden_err * sigmoid_ddx hidden_layer_output
    WH_adj =. WH + (|: X) dot hidden_delta
    WO_adj =. WO + (|: hidden_layer_output) dot l1_delta
    BH_adj =. +/ BH,hidden_delta
    BO_adj =. +/ BO,l1_delta
    (WH_adj;WO_adj;BH_adj;BO_adj)
)

w_trained =: (((X;Y) & train) ^: 10000) (w_hidden;w_output;b_hidden;b_output)
guess =: >1 { w_trained forward X

Compare to this K implementation for style.

As it happens, this J code is substantially faster than the equivalent using NumPy (0.13s vs. 0.59s).

I'm quite curious as to why the J is so much more performant. I read APL since 1978 recently and APL has quite a few differences as an array environment compared to conventional programming languages.