Matrix Operations Guide

01 — Scaling

Scalar Multiplication

Multiply every element by a single value. c × A

Scalar (0D) × Any Tensor (nD) → Same Shape (nD)

Common in NN/Transformers:

• Attention scaling — divide by √d_k to stabilize softmax

• Learning rate — weights -= lr × gradients

• Dropout scaling — multiply by 1/(1-p) during training

• Temperature — logits / T to control softmax sharpness

Rows: 2

Cols: 3

B_ij = c × A_ij

Scalar c

Matrix A

Result B

🔥 Used in Transformers: Attention Scaling

In self-attention: scores = (Q @ K.T) / √d_k

The 1/√d_k is scalar multiplication to prevent large dot products from pushing softmax into tiny gradients.

                    # PyTorch

                    B = c * A

                    # Transformer attention scaling

                    scores = (Q @ K.T) / math.sqrt(d_k)

02 — Hadamard Product

Element-wise Multiplication

Multiply corresponding elements. Shapes must match exactly. A ⊙ B or A * B

Tensor (m×n) ⊙ Tensor (m×n) → Tensor (m×n)
Same shape required (or broadcastable)

Common in NN/Transformers:

• Gating mechanisms — LSTM/GRU: gate ⊙ candidate

• Dropout — activations ⊙ binary_mask

• Attention masking — scores ⊙ causal_mask

• GLU/SwiGLU — σ(Wx) ⊙ (Vx) in FFN layers

Rows: 2

Cols: 2

C_ij = A_ij × B_ij

Matrix A

⊙

Matrix B

Result C

Click any result cell to see its calculation

A[0,0]=1 × B[0,0]=5 = 5

                    # PyTorch

                    C = A * B

                    # or

                    C = torch.mul(A, B)

03 — Inner Product

Dot Product

Multiply elements, then sum all. Returns a scalar. a · b or aᵀb

Vector (n,) · Vector (n,) → Scalar
Matrix (1,n) @ Matrix (n,1) → Matrix (1,1)
torch.dot() requires 1D; use aᵀ@b for 2D column vectors

Common in NN/Transformers:

• Attention score — single q · k pair similarity

• Single neuron — weights · inputs + bias

• Cosine similarity — (a · b) / (‖a‖ × ‖b‖)

Length: 3

a · b = Σ(a_i × b_i) = a^Tb

Vector a

Vector b

Scalar Result

Step-by-step calculation

1 × 4 + 2 × 5 + 3 × 6

= 4 + 10 + 18 = 32

                    # 1D vectors

                    result = torch.dot(a, b)

                    # Column vectors (n,1): aᵀb

                    result = (a.T @ b).item()

04 — Tensor Product

Outer Product

Every element of a multiplied with every element of b. Creates a matrix. a ⊗ b

Vector (m,) ⊗ Vector (n,) → Matrix (m×n)
1D tensors, lengths can differ

Common in NN/Transformers:

• LoRA — low-rank adaptation: W + A ⊗ B

• Attention patterns — visualizing q ⊗ k relationships

• Embedding lookup — one-hot ⊗ embedding_matrix

Len(a): 3

Len(b): 4

C_ij = a_i × b_j

Vector a (column)

⊗

Vector b (row)

Result Matrix (3×4)

Hover over result cells to see calculation

Each cell C[i,j] = a[i] × b[j]

                    # PyTorch

                    C = torch.outer(a, b)

                    # or using einsum

                    C = torch.einsum('i,j->ij', a, b)

05 — Matrix Product

Matrix Multiplication

Each output is a dot product of a row from A and a column from B. A @ B

Matrix (m×n) @ Matrix (n×p) → Matrix (m×p)
Inner dimensions must match (n = n)

Common in NN/Transformers (THE core operation!):

• Linear layers — y = X @ W + b (every dense layer)

• Q, K, V projections — Q = X @ W_q

• Attention scores — scores = Q @ K.T

• Attention output — output = softmax(scores) @ V

A rows: 2

A cols / B rows: 3

B cols: 2

C_ij = Σ_k A_ik × B_kj (row i of A · column j of B)

Matrix A (2×3)

Matrix B (3×2)

Result C (2×2)

139

154

Click any result cell to see its calculation

C[0,0] = row 0 of A · col 0 of B

1×7 + 2×9 + 3×11 = 7 + 18 + 33 = 58

                    # PyTorch

                    C = A @ B

                    # or

                    C = torch.matmul(A, B)

06 — Vector Product

Cross Product

Returns a vector perpendicular to both inputs. a × b

Vector (3,) × Vector (3,) → Vector (3,)
3D vectors only!

Use cases (rare in LLMs, common in 3D):

• 3D graphics/NeRF — surface normals, camera rays

• Robotics — torque, angular momentum

• Physics simulations — magnetic force F = qv × B

⚠️ Not used in standard transformers/LLMs

a × b = [a₂b₃ - a₃b₂, a₃b₁ - a₁b₃, a₁b₂ - a₂b₁]

Vector a

Vector b

Result Vector

Step-by-step calculation

x: a[1]×b[2] - a[2]×b[1] = 0×0 - 0×1 = 0

y: a[2]×b[0] - a[0]×b[2] = 0×0 - 1×0 = 0

z: a[0]×b[1] - a[1]×b[0] = 1×1 - 0×0 = 1

                    # PyTorch

                    result = torch.cross(a, b)

07 — Transpose

Matrix Transpose

Flip rows and columns. Row i becomes column i. Aᵀ

Matrix (m×n) → Matrix (n×m)
2D tensor (or specify dims for higher)

Common in NN/Transformers:

• Attention — Q @ K^T (transpose K for dot products)

• Weight tying — output_embed = input_embed.T

• Backpropagation — gradients use W.T

• Batch reshaping — swap batch/sequence dims

Rows: 2

Cols: 3

B_ji = A_ij (swap row and column indices)

Matrix A (2×3)

ᵀ

Result Aᵀ (3×2)

Visual mapping

A[0,0]=1 → B[0,0]=1

A[0,1]=2 → B[1,0]=2

A[0,2]=3 → B[2,0]=3

A[1,0]=4 → B[0,1]=4

A[1,1]=5 → B[1,1]=5

A[1,2]=6 → B[2,1]=6

                    # PyTorch

                    B = A.T

                    # or

                    B = torch.transpose(A, 0, 1)