Makemore 教程总结Tutorial Summary

Andrej Karpathy 的字符级语言模型教程 Andrej Karpathy's Character-Level Language Model Tutorial

本文档由 Lijian Liu 整理编辑，使用 Claude Opus 4.5 AI 辅助完成Document curated and edited by Lijian Liu with Claude Opus 4.5 AI assistance

📑 目录📑 Table of Contents

概述Overview
核心内容Core Content
重要警告与注意事项 (28条)Important Warnings & Gotchas (28 items)
理解测试 (50道选择题)Comprehension Quiz (50 MCQs)

📌 概述Overview

这是 Andrej Karpathy 的 "makemore" 系列教程，目标是构建一个字符级语言模型，能够生成类似名字的新名字。 This is Andrej Karpathy's "makemore" tutorial series, aimed at building a character-level language model that can generate new names similar to existing ones.

🎯 核心内容Core Content

1. 什么是 Makemore？1. What is Makemore?

一个能"生成更多"类似输入数据的模型A model that can "make more" of things similar to input data
使用 names.txt 数据集（约32,000个名字）Uses names.txt dataset (approximately 32,000 names)
训练后可生成听起来像名字但实际上是新创造的名字（如：Dontel, Irot, Zhendi）After training, it can generate names that sound real but are newly created (e.g., Dontel, Irot, Zhendi)

2. 字符级语言模型2. Character-Level Language Model

将每个名字视为字符序列Treats each name as a sequence of characters
例如："reese" = r → e → e → s → eExample: "reese" = r → e → e → s → e
模型学习预测：给定前面的字符，下一个字符是什么？The model learns to predict: given previous characters, what's the next character?

📊 两种训练方法Two Training Methods

方法一：统计计数法Method 1: Statistical Counting

统计所有二元组(bigram)出现的次数Count occurrences of all bigrams
构建 27×27 的计数矩阵 NBuild a 27×27 count matrix N
归一化得到概率分布 PNormalize to get probability distribution P

方法二：神经网络法Method 2: Neural Network

将输入字符进行 one-hot 编码One-hot encode input characters
通过权重矩阵 W (27×27)Pass through weight matrix W (27×27)
得到 logits（对数计数）Get logits (log counts)
应用 softmax 得到概率分布Apply softmax to get probabilities
用梯度下降优化Optimize with gradient descent

            关键代码概念：Key Code Concepts:
            使用 PyTorch 的 torch.zeros((27, 27)) 创建计数矩阵Use PyTorch's torch.zeros((27, 27)) to create count matrix
字符到索引的映射：s2i（string to index）Character to index mapping: s2i (string to index)
索引到字符的映射：i2s（index to string）Index to character mapping: i2s (index to string)
特殊字符 . 表示开始/结束Special character . denotes start/end

        

🔑 关键概念解释Key Concepts Explained

1. 数据加载与探索1. Data Loading & Exploration ⏱️ ~3:06

words = open('names.txt', 'r').read().splitlines()
len(words)      # 32033 个名字names
min(len(w) for w in words)  # 2 （最短）(shortest)
max(len(w) for w in words)  # 15 （最长）(longest)

2. Bigram 的概念2. Bigram Concept ⏱️ ~5:51

Bigram 模型只看前一个字符来预测下一个字符。虽然简单（"very simple and weak"），但是一个很好的起点。Bigram model only looks at the previous character to predict the next one. While simple ("very simple and weak"), it's a great place to start.

# "emma" 包含的 bigramscontains these bigrams:
.e, em, mm, ma, a.  # 5 个示例！examples!

3. 构建计数矩阵3. Building Count Matrix ⏱️ ~13:13

N = torch.zeros((27, 27), dtype=torch.int32)
# 行 = 第一个字符，列 = 第二个字符Row = first char, Column = second char
N[s2i[ch1], s2i[ch2]] += 1

4. 从计数到概率4. From Counts to Probabilities ⏱️ ~25:31

p = N[0].float()  # 获取第一行（计数）Get first row (counts)
p = p / p.sum()   # 归一化为概率分布Normalize to probability distribution
p.sum()           # 1.0 ✅

5. 使用生成器确保可复现性5. Using Generator for Reproducibility ⏱️ ~27:12

g = torch.Generator().manual_seed(2147483647)
torch.rand(3, generator=g)  # 每次运行结果相同Same result every run
torch.multinomial(p, num_samples=1, generator=g)

6. 广播机制 (Broadcasting)6. Broadcasting ⏱️ ~38:00

在归一化概率矩阵时，我们需要理解 PyTorch 的广播规则：When normalizing probability matrices, we need to understand PyTorch's broadcasting rules:

P.shape  # (27, 27)

# 按行求和，保持维度Sum across rows, keep dimension
P.sum(1, keepdim=True).shape  # (27, 1) 列向量column vector

# 广播：(27, 27) / (27, 1) → 每行除以自己的和Broadcast: (27, 27) / (27, 1) → each row divided by its sum
P = P / P.sum(1, keepdim=True)  # ✅ 正确归一化行Correctly normalizes rows

7. 似然与损失函数7. Likelihood & Loss Function ⏱️ ~50:58

            似然 (Likelihood)Likelihood = 所有正确预测概率的乘积Product of all correct prediction probabilities
对数似然 (Log Likelihood)Log Likelihood = 概率的对数之和（因为 log(a×b) = log(a) + log(b)）Sum of log probabilities (since log(a×b) = log(a) + log(b))
负对数似然 (NLL)Negative Log Likelihood (NLL) = 损失函数，越低越好Loss function, lower is better
平均负对数似然Average NLL = 归一化的损失（约 2.45）Normalized loss (~2.45)

        

log_likelihood = 0.0
n = 0
for bigram in bigrams:
    prob = P[ix1, ix2]
    log_likelihood += torch.log(prob)
    n += 1
    
nll = -log_likelihood  # 负对数似然Negative log likelihood
loss = nll / n         # 平均损失 ≈ 2.45Average loss ≈ 2.45

8. 模型平滑 (Smoothing)8. Model Smoothing ⏱️ ~1:01:49

# 添加假计数，避免零概率Add fake counts to avoid zero probability
P = (N + 1).float()  # 拉普拉斯平滑Laplace smoothing
P = P / P.sum(1, keepdim=True)

9. One-Hot 编码9. One-Hot Encoding ⏱️ ~1:10:10

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes=27).float()
# xs = [0, 5, 13, 13, 1]
# xenc.shape = (5, 27) - 每行只有一个1，其余为0- each row has one 1, rest are 0

10. Logits 与 Softmax10. Logits & Softmax ⏱️ ~1:21:01

logits = xenc @ W           # 可正可负的"对数计数""log counts" - can be +/-
counts = logits.exp()       # 指数化 → 总是正数Exponentiate → always positive
probs = counts / counts.sum(1, keepdim=True)  # 归一化 → 概率Normalize → probabilities

# 这就是 Softmax！将任意数值转换为概率分布This is Softmax! Converts any values to probability distribution

Softmax 的作用：将神经网络的任意输出（可正可负）转换为有效的概率分布（总是正数，和为1）。Karpathy 说这是"非常常用的"（very often used）。What Softmax does: Converts arbitrary neural network outputs (can be positive or negative) into a valid probability distribution (always positive, sums to 1). Karpathy says this is "very often used".

11. 向量化损失计算11. Vectorized Loss Calculation ⏱️ ~1:36:08

# 不用循环，一行搞定！No loop needed, one line!
loss = -probs[torch.arange(n), ys].log().mean()

# 等价于：Equivalent to:
# probs[0, ys[0]] → 第0个样本在正确标签处的概率Prob of sample 0 at correct label
# probs[1, ys[1]] → 第1个样本在正确标签处的概率Prob of sample 1 at correct label
# ...

12. 梯度的含义12. Gradient Meaning ⏱️ ~1:40:43

W.grad.shape  # (27, 27) - 与 W 形状相同- same shape as W

# 梯度告诉我们每个权重对损失的影响：Gradient tells us each weight's influence on loss:
# W.grad[i,j] > 0 → 增加 W[i,j] 会增加 lossIncreasing W[i,j] increases loss
# W.grad[i,j] < 0 → 增加 W[i,j] 会减少 lossIncreasing W[i,j] decreases loss

13. 学习率调优13. Learning Rate Tuning ⏱️ ~1:44:08

Karpathy 演示了学习率可以从很小调到很大：Karpathy demonstrates learning rate can be tuned from small to large:

lr = 0.1   # 保守Conservative
lr = 1.0   # 更激进More aggressive
lr = 10.0  # 更快收敛Faster convergence
lr = 50.0  # 在这个简单例子中也能工作！Works even for this simple example!

🔄 梯度下降流程Gradient Descent Flow ⏱️ ~1:35:00

for i in range(100):
    # 前向传播Forward pass
    xenc = F.one_hot(xs, num_classes=27).float()
    logits = xenc @ W
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True)
    loss = -probs[torch.arange(n), ys].log().mean()
    
    # 反向传播Backward pass
    W.grad = None
    loss.backward()
    
    # 更新权重Update weights
    W.data -= 0.1 * W.grad

🛡️ 正则化 (Regularization)Regularization ⏱️ ~1:51:00

相当于统计方法中的"平滑"：Equivalent to "smoothing" in statistical methods:

loss = loss + 0.01 * (W**2).mean()

鼓励权重接近零 → 概率分布更均匀Encourages weights toward zero → more uniform probability distribution
防止过拟合Prevents overfitting

💡 重要发现Key Findings

方面Aspect	统计方法Statistical Method	神经网络方法Neural Network Method
实现Implementation	计数 + 归一化Counting + Normalization	梯度下降优化Gradient Descent Optimization
最终损失Final Loss	~2.45	~2.45
生成结果Generated Results	相同Same	相同Same
可扩展性Scalability	❌ 差❌ Poor	✅ 好✅ Good

🔍 关键洞察：Key Insights:

神经网络的权重矩阵 W 本质上就是对数计数矩阵The neural network's weight matrix W is essentially the log-count matrix
exp(W) ≈ 统计方法中的计数矩阵 NCount matrix N from statistical method
两种方法殊途同归！Both methods lead to the same result!

🚀 下一步Next Steps

当前：Current:只看前1个字符（bigram）Only looks at 1 previous character (bigram)
未来：Future:看前多个字符 → MLP → RNN → Look at multiple previous characters → MLP → RNN → Transformer (GPT-2)
核心框架不变：Core framework unchanged:logits → softmax → 负对数似然 → 梯度下降negative log likelihood → gradient descent

⚠️ 重要警告与注意事项Important Warnings & Gotchas

以下是 Andrej Karpathy 在视频中特别强调需要注意的细节（按出现顺序排列）：The following are details that Andrej Karpathy specifically emphasized in the video (in order of appearance):

1. 使用 zip() 遍历连续字符对Using zip() to Iterate Consecutive Character Pairs ⏱️ ~6:49

w = "emma"
for ch1, ch2 in zip(w, w[1:]):
    print(ch1, ch2)
# e m
# m m
# m a

技巧：Karpathy 称这是一个"cute"的 Python 技巧。zip() 将两个迭代器配对，当较短的一个耗尽时自动停止。w[1:] 是从第二个字符开始的切片，所以 zip(w, w[1:]) 会产生所有连续字符对。Trick: Karpathy calls this a "cute" Python trick. zip() pairs up two iterators and stops when the shorter one is exhausted. w[1:] is the slice starting from the second character, so zip(w, w[1:]) produces all consecutive character pairs.

2. 必须处理开始和结束标记Must Handle Start and End Tokens ⏱️ ~8:01

# ❌ 不完整：丢失了重要信息Incomplete: loses important information
for ch1, ch2 in zip(w, w[1:]):  # 只有 em, mm, maOnly em, mm, ma

# ✅ 完整：包含开始和结束信息Complete: includes start and end info
chs = ['.'] + list(w) + ['.']  # ['.', 'e', 'm', 'm', 'a', '.']
for ch1, ch2 in zip(chs, chs[1:]):  # .e, em, mm, ma, a.

原因：Karpathy 强调"我们必须小心"。单词 "emma" 实际上包含5个 bigram 示例，不是3个！我们需要知道：(1) 'e' 很可能是名字的第一个字符；(2) 'a' 之后名字很可能结束。这些都是重要的统计信息，不能丢失。Reason: Karpathy emphasizes "we have to be careful". The word "emma" actually contains 5 bigram examples, not 3! We need to know: (1) 'e' is likely to be the first character of a name; (2) after 'a', the name is likely to end. This is important statistical information that cannot be lost.

3. 使用 dict.get() 处理不存在的键Using dict.get() to Handle Missing Keys ⏱️ ~10:04

# ❌ 如果键不存在会报错Raises error if key doesn't exist
b[bigram] = b[bigram] + 1  # KeyError!

# ✅ 使用 get() 提供默认值Use get() with default value
b[bigram] = b.get(bigram, 0) + 1  # 如果不存在返回0Returns 0 if not exists

说明：dict.get(key, default) 在键不存在时返回默认值而不是报错。这是 Python 字典计数的常用模式。Note: dict.get(key, default) returns the default value instead of raising an error when the key doesn't exist. This is a common pattern for counting with Python dictionaries.

4. 按值排序字典（使用 lambda）Sorting Dictionary by Value (Using Lambda) ⏱️ ~11:49

# 默认按键（第一个元素）排序Default sorts by key (first element)
sorted(b.items())

# 按值（第二个元素）排序Sort by value (second element)
sorted(b.items(), key=lambda kv: kv[1])  # 升序Ascending
sorted(b.items(), key=lambda kv: kv[1], reverse=True)  # 降序Descending

说明：Karpathy 说这在 Python 中"有点麻烦"（kind of gross）。b.items() 返回 (key, value) 元组，默认按元组的第一个元素排序。要按计数排序，需要用 key=lambda kv: kv[1] 指定按第二个元素排序。Note: Karpathy says this is "kind of gross in Python". b.items() returns (key, value) tuples, which sort by the first element by default. To sort by count, use key=lambda kv: kv[1] to specify sorting by the second element.

5. 创建张量时指定 dtypeSpecifying dtype When Creating Tensors ⏱️ ~13:53

# 默认是 float32Default is float32
a = torch.zeros((3, 5))
a.dtype  # torch.float32

# 计数应该用整数Counts should be integers
N = torch.zeros((27, 27), dtype=torch.int32)

说明：torch.zeros() 默认创建 float32 类型。对于计数矩阵，使用整数类型更合理。虽然后面会转换为浮点数进行归一化，但明确数据类型是好习惯。Note: torch.zeros() creates float32 by default. For count matrices, using integer type makes more sense. Although we'll convert to float later for normalization, being explicit about data types is good practice.

6. 使用 enumerate() 创建字符映射Using enumerate() to Create Character Mapping ⏱️ ~16:58

chars = sorted(list(set(''.join(words))))  # ['a', 'b', ..., 'z']
s2i = {s: i+1 for i, s in enumerate(chars)}
s2i['.'] = 0  # 特殊标记放在索引0Special token at index 0

i2s = {i: s for s, i in s2i.items()}  # 反向映射Reverse mapping

技巧：enumerate() 返回 (索引, 元素) 对。Karpathy 用 i+1 是为了让字母从索引1开始，把索引0留给特殊标记 '.'。创建反向映射 i2s 时只需交换键值对。Trick: enumerate() returns (index, element) pairs. Karpathy uses i+1 so letters start at index 1, reserving index 0 for the special token '.'. Creating the reverse mapping i2s just requires swapping key-value pairs.

7. 使用 .item() 提取 Python 标量Use .item() to Extract Python Scalar ⏱️ ~20:17

N[0, 5]       # tensor(149) - 还是张量！- still a tensor!
type(N[0, 5]) # torch.Tensor

N[0, 5].item() # 149 - 纯 Python 整数- pure Python integer
type(N[0, 5].item()) # int

问题：对张量进行索引后，返回的仍然是一个张量对象，只是形状变成了标量。如果你需要在 Python 中使用这个值（比如打印、用作索引等），需要调用 .item() 来提取实际的数值。Problem: When you index into a tensor, you still get a tensor object back, just with scalar shape. If you need to use this value in Python (for printing, using as index, etc.), you need to call .item() to extract the actual number.

8. 特殊标记的设计选择Special Token Design Choice ⏱️ ~21:02

# ❌ 最初设计：两个特殊标记Initial design: two special tokens
<S> = 26  # 开始标记Start token
<E> = 27  # 结束标记End token
# 问题：矩阵中有整行整列的零（浪费空间）Problem: entire row and column of zeros in matrix (wasted space)

# ✅ 优化设计：单一特殊标记Optimized design: single special token
. = 0  # 同时表示开始和结束Represents both start and end

原因：<E> 永远不会是 bigram 的第一个字符（因为它只出现在结尾），<S> 永远不会是第二个字符（因为它只出现在开头）。这导致矩阵中有很多无用的零。使用单一标记 "." 可以将矩阵从 28×28 缩小到 27×27。Reason: <E> can never be the first character of a bigram (it only appears at the end), and <S> can never be the second character (it only appears at the start). This causes many useless zeros in the matrix. Using a single token "." reduces the matrix from 28×28 to 27×27.

9. torch.multinomial 的 replacement 参数replacement Parameter ⏱️ ~28:22

# ❌ 默认 replacement=False，抽样后不放回Default replacement=False, sampling without replacement
torch.multinomial(probs, num_samples=20)  # 可能报错或结果不对May error or give wrong results

# ✅ 语言模型需要有放回抽样Language models need sampling with replacement
torch.multinomial(probs, num_samples=20, replacement=True)

注意：Karpathy 说"不知道为什么默认是 False"。对于语言模型，我们需要能够多次抽到同一个字符，所以必须设置 replacement=True。Note: Karpathy says "I don't know why the default is False". For language models, we need to be able to sample the same character multiple times, so replacement=True is required.

10. 广播机制中的 keepdim=True 陷阱Broadcasting keepdim=True Trap ⏱️ ~44:25

这是一个非常微妙且难以发现的 bug，Karpathy 说这应该"吓到你"：This is a very subtle and hard-to-find bug, Karpathy says this should "scare you":

P.shape  # (27, 27)

# ❌ 错误：sum 后形状变成 (27,)，广播时变成 (1, 27) 行向量Wrong: sum produces (27,), broadcasts as (1, 27) row vector
P = N / N.sum(1)           # 实际上在归一化列！Actually normalizes columns!

# ✅ 正确：保持维度，形状是 (27, 1) 列向量Correct: keeps dimension, shape is (27, 1) column vector
P = N / N.sum(1, keepdim=True)  # 正确归一化行Correctly normalizes rows

原因：当 keepdim=False（默认值）时，维度被压缩掉。根据广播规则，从右向左对齐，缺失的维度会在左边补1。所以 (27,) 变成 (1, 27)，导致按列归一化而不是按行！Karpathy 强调要"尊重"广播机制，仔细检查，否则会引入非常难以发现的 bug。Reason: When keepdim=False (default), the dimension is squeezed out. According to broadcasting rules, alignment happens from right to left, and missing dimensions are added as 1 on the left. So (27,) becomes (1, 27), causing column normalization instead of row! Karpathy emphasizes to "respect" broadcasting, check carefully, or you'll introduce very hard-to-find bugs.

11. 使用原地操作提高效率Use In-place Operations for Efficiency ⏱️ ~49:51

# ❌ 创建新张量，浪费内存Creates new tensor, wastes memory
P = P / P.sum(1, keepdim=True)

# ✅ 原地操作，更高效In-place operation, more efficient
P /= P.sum(1, keepdim=True)

建议：使用 /=, +=, -= 等原地操作符可以避免创建新的张量，节省内存并提高速度。Recommendation: Using in-place operators like /=, +=, -= avoids creating new tensors, saving memory and improving speed.

12. 零概率导致无穷大损失Zero Probability Causes Infinite Loss ⏱️ ~1:00:50

# 如果某个 bigram 从未出现过（如 "jq"）If a bigram never appeared (e.g., "jq")
prob = 0.0
torch.log(torch.tensor(prob))  # tensor(-inf)

# 负对数似然变成正无穷Negative log likelihood becomes positive infinity
loss = -torch.log(torch.tensor(prob))  # tensor(inf)

解决方案 - 模型平滑：在计数矩阵中添加假计数（如 +1），确保没有零概率。Karpathy 说这"有点恶心"（kind of gross），所以人们用平滑来修复。这在统计方法中叫"拉普拉斯平滑"。Solution - Model Smoothing: Add fake counts (e.g., +1) to the count matrix to ensure no zero probabilities. Karpathy says this is "kind of gross", so people use smoothing to fix it. This is called "Laplace smoothing" in statistics.

# 平滑处理Smoothing
N = N + 1  # 给每个计数加1Add 1 to every count

13. torch.tensor vs torch.Tensor ⏱️ ~1:07:54

PyTorch 有两种创建张量的方式，Karpathy 说这"一点也不令人困惑"（讽刺），文档也"不清楚"：PyTorch has two ways to create tensors, Karpathy says this is "not confusing at all" (sarcastically), and the docs are "not clear":

# ✅ 推荐：小写 tensor - 自动推断数据类型Recommended: lowercase tensor - auto-infers dtype
xs = torch.tensor([1, 2, 3])    # dtype = torch.int64

# ❌ 不推荐：大写 Tensor - 总是返回 float32Not recommended: uppercase Tensor - always returns float32
xs = torch.Tensor([1, 2, 3])    # dtype = torch.float32

建议：始终使用小写的 torch.tensor，它会根据输入数据自动推断正确的数据类型。Karpathy 提到需要查阅社区讨论（random threads）才能理解区别，并警告说"有些东西不幸地不容易也没有很好的文档，你必须小心"。Recommendation: Always use lowercase torch.tensor, which automatically infers the correct dtype from input data. Karpathy mentions you need to search community threads to understand the difference, and warns "some of the stuff is unfortunately not easy and not very well documented and you have to be careful out there".

14. F.one_hot() 返回整数类型Returns Integer Type ⏱️ ~1:12:55

xenc = F.one_hot(xs, num_classes=27)
xenc.dtype  # torch.int64 （64位整数！）(64-bit integer!)

# 神经网络需要浮点数，必须手动转换：Neural nets need floats, must convert manually:
xenc = F.one_hot(xs, num_classes=27).float()  # ✅

问题：one_hot 函数不接受 dtype 参数（不像很多其他 PyTorch 函数）。它返回与输入相同的数据类型，而输入是整数索引，所以输出也是整数。Karpathy 强调"我们总是要小心数据类型"。Problem: The one_hot function doesn't accept a dtype parameter (unlike many other PyTorch functions). It returns the same dtype as input, and since input is integer indices, output is also integer. Karpathy emphasizes "we always want to be careful with data types".

15. 矩阵乘法维度匹配Matrix Multiplication Dimension Matching ⏱️ ~1:15:10

# (batch, input_dim) @ (input_dim, output_dim) = (batch, output_dim)
xenc.shape    # (5, 27) - 5个样本，每个27维- 5 samples, each 27-dim
W.shape       # (27, 27) - 27个输入，27个神经元- 27 inputs, 27 neurons
(xenc @ W).shape  # (5, 27) - 5个样本的27个输出- 27 outputs for 5 samples

理解：内部维度必须匹配（这里是27）。外部维度决定输出形状。每行是一个样本，每列是一个神经元的输出。位置 [3, 13] 表示第3个样本在第13个神经元上的激活值（通过第3个输入与W的第13列的点积得到）。Understanding: Inner dimensions must match (27 here). Outer dimensions determine output shape. Each row is a sample, each column is a neuron's output. Position [3, 13] represents the activation of the 3rd sample on the 13th neuron (computed via dot product of 3rd input and 13th column of W).

16. 必须设置 requires_grad=TrueMust Set requires_grad=True ⏱️ ~1:39:14

# ❌ 默认情况下，叶子张量不计算梯度By default, leaf tensors don't compute gradients
W = torch.randn((27, 27))
W.requires_grad  # False

# ✅ 必须显式启用梯度计算Must explicitly enable gradient computation
W = torch.randn((27, 27), requires_grad=True)

原因：PyTorch 默认不跟踪叶子张量的梯度以节省内存。如果你想通过反向传播优化这个张量，必须告诉 PyTorch 你需要梯度。否则调用 loss.backward() 后，W.grad 将是 None。Karpathy 说"如果你还记得 micrograd，PyTorch 需要我们传入这个"。Reason: PyTorch doesn't track gradients for leaf tensors by default to save memory. If you want to optimize this tensor via backpropagation, you must tell PyTorch you need gradients. Otherwise after calling loss.backward(), W.grad will be None. Karpathy says "if you remember from micrograd, PyTorch requires that we pass this in".

17. 每次迭代前必须重置梯度Must Reset Gradients Before Each Iteration ⏱️ ~1:38:48

for i in range(100):
    # 前向传播...Forward pass...
    
    # ✅ 重置梯度（两种方式）Reset gradients (two ways)
    W.grad = None       # 更高效More efficient
    # W.grad.zero_()     # 另一种方式Alternative way
    
    loss.backward()
    W.data -= lr * W.grad

原因：PyTorch 默认累积梯度（用于某些高级技术如梯度累积）。如果不重置，新梯度会加到旧梯度上，导致错误的更新。Karpathy 指出设置为 None 比 zero_() 更高效，PyTorch 会将 None 视为零。Reason: PyTorch accumulates gradients by default (used for advanced techniques like gradient accumulation). If not reset, new gradients add to old ones, causing wrong updates. Karpathy notes setting to None is more efficient than zero_(), and PyTorch treats None as zero.

18. 更新权重时使用 W.dataUse W.data When Updating Weights ⏱️ ~1:41:40

# ✅ 正确：直接修改底层数据Correct: directly modify underlying data
W.data += -0.1 * W.grad

# ❌ 不推荐：这会被 PyTorch 跟踪为计算图的一部分Not recommended: this gets tracked as part of computation graph
W += -0.1 * W.grad

原因：使用 W.data 告诉 PyTorch 这是一个纯粹的数据操作，不应该被跟踪到计算图中。直接操作 W 可能会导致梯度计算问题。Reason: Using W.data tells PyTorch this is a pure data operation that shouldn't be tracked in the computation graph. Directly operating on W might cause gradient computation issues.

19. One-hot 编码本质上是查表One-hot Encoding is Essentially Table Lookup ⏱️ ~1:47:54

# 表面上：矩阵乘法On the surface: matrix multiplication
xenc = [0, 0, 0, 0, 0, 1, 0, ...]  # 第5位是15th position is 1
logits = xenc @ W

# 实际上：只是提取 W 的第5行！Actually: just extracts 5th row of W!
logits = W[5]  # 完全等价Completely equivalent

洞察：Karpathy 花了大量时间解释这一点。这正是统计方法的工作原理——用第一个字符的索引去查表获取概率分布。神经网络做的是同样的事情，只是 W 是对数计数（logits），需要指数化和归一化。所以 exp(W) 就等价于统计方法中的计数矩阵 N。两种方法"殊途同归"，得到相同的损失和相同的采样结果。Insight: Karpathy spends significant time explaining this. This is exactly how the statistical method works - using the first character's index to look up probability distribution in a table. The neural network does the same thing, except W is log-counts (logits) that need exponentiation and normalization. So exp(W) is equivalent to the count matrix N from the statistical method. Both methods arrive at the same place, with the same loss and same sampling results.

20. 正则化 = 平滑Regularization = Smoothing ⏱️ ~1:51:00

# 统计方法中的平滑Smoothing in statistical method
N = N + 1  # 添加假计数Add fake counts

# 神经网络中的等价物：正则化Equivalent in neural network: regularization
loss = loss + 0.01 * (W**2).mean()

原理：Karpathy 说这"很酷"。当 W 全为零时，exp(0) = 1，所有概率变成均匀分布。正则化项惩罚 W 偏离零，就像平滑使概率更均匀一样。可以把它想象成一个"弹簧力"或"重力"把 W 拉向零。正则化强度（0.01）越大，概率越均匀，等价于添加更多假计数。如果强度太大，模型将无法学习，所有预测都是均匀的。Principle: Karpathy says this is "kind of cool". When W is all zeros, exp(0) = 1, all probabilities become uniform. The regularization term penalizes W deviating from zero, just like smoothing makes probabilities more uniform. Think of it as a "spring force" or "gravity" pulling W toward zero. Higher regularization strength (0.01) means more uniform probabilities, equivalent to adding more fake counts. If the strength is too high, the model can't learn and all predictions are uniform.

21. 采样循环的结构Sampling Loop Structure ⏱️ ~32:40

ix = 0  # 从特殊标记开始Start with special token
while True:
    p = P[ix]  # 获取当前字符的概率分布Get probability distribution for current char
    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    if ix == 0:  # 遇到结束标记Hit end token
        break
    out.append(i2s[ix])

说明：采样从索引0（特殊开始标记）开始，不断根据当前字符查找下一个字符的概率分布，直到采样到结束标记（也是索引0）为止。Note: Sampling starts from index 0 (special start token), continuously looks up the probability distribution for the next character based on the current character, until the end token (also index 0) is sampled.

22. 除法前必须转换为浮点数Must Convert to Float Before Division ⏱️ ~25:31

# ❌ 整数除法会丢失精度Integer division loses precision
p = N[0] / N[0].sum()  # 如果 N 是整数类型If N is integer type

# ✅ 先转换为浮点数Convert to float first
p = N[0].float()
p = p / p.sum()

说明：在进行归一化（除法）之前，必须将整数张量转换为浮点数，否则会丢失小数部分。Note: Before normalization (division), integer tensors must be converted to float, otherwise the decimal part is lost.

23. 概率与对数概率的解释Probability vs Log Probability Interpretation ⏱️ ~54:00

prob = 0.5
log_prob = torch.log(torch.tensor(0.5))  # -0.6931

# 概率范围Probability range: [0, 1]
# 对数概率范围Log probability range: [-inf, 0]

# 对数概率 = 0 意味着概率 = 1（完美预测）log prob = 0 means prob = 1 (perfect prediction)
# 对数概率 = -inf 意味着概率 = 0（不可能事件）log prob = -inf means prob = 0 (impossible event)

说明：对数概率越接近0，预测越好。负对数似然是损失函数，所以取负后越小越好。Note: The closer log probability is to 0, the better the prediction. Negative log likelihood is the loss function, so after negation, smaller is better.

24. 神经网络训练集的创建Neural Network Training Set Creation ⏱️ ~1:05:09

xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        xs.append(s2i[ch1])  # 输入：第一个字符的索引Input: first char index
        ys.append(s2i[ch2])  # 标签：第二个字符的索引Label: second char index

xs = torch.tensor(xs)  # 形状：(228146,)Shape: (228146,)
ys = torch.tensor(ys)  # 形状：(228146,)Shape: (228146,)

说明：从32,000个名字中提取出约228,000个 bigram 训练样本。每个样本是一对 (输入字符索引, 目标字符索引)。Note: From 32,000 names, approximately 228,000 bigram training samples are extracted. Each sample is a pair of (input char index, target char index).

25. 神经元的可视化理解Visualizing Neuron Understanding ⏱️ ~1:18:04

# W.shape = (27, 27)
# 可以看作 27 个神经元，每个有 27 个输入Think of it as 27 neurons, each with 27 inputs

# logits[3, 13] = 第3个输入在第13个神经元上的激活Activation of 3rd input on 13th neuron
#               = W[:, 13] 与输入的点积dot product with input

理解：Karpathy 说这是"神经网络的一层"。每个神经元计算其所有输入的加权和（点积），产生一个激活值（firing rate）。Understanding: Karpathy says this is "one layer of a neural net". Each neuron computes a weighted sum (dot product) of all its inputs, producing an activation value (firing rate).

26. 为什么用 mean() 而不是 sum()Why Use mean() Instead of sum() ⏱️ ~1:36:08

# ❌ 总和会随数据集大小变化Sum changes with dataset size
loss = -probs[torch.arange(n), ys].log().sum()

# ✅ 平均值是归一化的度量Mean is a normalized measure
loss = -probs[torch.arange(n), ys].log().mean()

原因：使用平均值可以让损失值不依赖于数据集大小。无论你有1000个还是100万个样本，损失值都在同一个尺度上（约2.45），便于比较和调试。Reason: Using mean makes the loss value independent of dataset size. Whether you have 1,000 or 1 million samples, the loss is on the same scale (~2.45), making it easier to compare and debug.

27. 统计方法的可扩展性限制Scalability Limitation of Counting Method ⏱️ ~1:57:40

# Bigram: 27×27 = 729 参数parameters
# Trigram: 27×27×27 = 19,683 参数parameters
# 4-gram: 27^4 = 531,441 参数parameters
# 10-gram: 27^10 ≈ 2×10^14 参数 — 不可能！parameters — impossible!

问题：Karpathy 强调统计方法"无法扩展"。参数数量随上下文长度指数增长。神经网络方法通过参数共享解决了这个问题——可以扩展到任意上下文长度，最终达到 GPT-2 级别的模型。Problem: Karpathy emphasizes the counting method "doesn't scale". Parameter count grows exponentially with context length. Neural network methods solve this through parameter sharing — can scale to arbitrary context lengths, eventually reaching GPT-2 level models.

28. 交叉熵损失的快捷方式Cross-Entropy Loss Shortcut ⏱️ ~1:34:30

# 手动实现（教学目的）Manual implementation (for teaching)
counts = logits.exp()
probs = counts / counts.sum(1, keepdim=True)
loss = -probs[torch.arange(n), ys].log().mean()

# ✅ PyTorch 内置函数（生产环境推荐）built-in function (recommended for production)
import torch.nn.functional as F
loss = F.cross_entropy(logits, ys)

说明：PyTorch 提供了 F.cross_entropy，它内部更高效且数值更稳定。Karpathy 手动实现是为了教学目的，展示底层原理。Note: PyTorch provides F.cross_entropy, which is more efficient and numerically stable internally. Karpathy implements it manually for teaching purposes, to show the underlying principles.

📝 理解测试（50道选择题）Comprehension Quiz (50 Multiple Choice Questions)

点击选项查看答案。绿色 = 正确，红色 = 错误。Click an option to see the answer. Green = correct, Red = incorrect.

Q1. 下列哪个不是本教程提到的 Karpathy 之前的视频？Which is NOT a previous Karpathy video mentioned in this tutorial? ⏱️ ~1:00

A. micrograd

B. 反向传播讲解Backpropagation lecture

C. GPT-2 实现implementation

D. 神经网络入门Neural network intro

GPT-2 实现是 makemore 系列的未来目标，不是之前的视频。GPT-2 implementation is a future goal of makemore series, not a previous video.

Q2. Makemore 使用的数据集 names.txt 大约包含多少个名字？How many names does the names.txt dataset used by Makemore contain approximately? ⏱️ ~3:06

A. ~3,200

B. ~10,000

C. ~32,000

D. ~320,000

数据集包含约 32,033 个名字，来自政府网站。The dataset contains approximately 32,033 names from a government website.

Q3. 数据集中名字的最短长度是多少？What is the shortest name length in the dataset? ⏱️ ~3:06

A. 1

B. 2

C. 3

D. 4

最短名字长度为 2，最长为 15。Shortest name length is 2, longest is 15.

Q4. Bigram 模型为什么被称为"非常简单和弱"？Why is the bigram model called "very simple and weak"? ⏱️ ~5:51

A. 只考虑前一个字符Only considers previous character

B. 没有使用神经网络Doesn't use neural network

C. 训练数据太少Too little training data

D. 计算太慢Too slow to compute

只看前1个字符无法捕捉更长的模式。GPT 看前数千个字符。Looking at only 1 previous character can't capture longer patterns. GPT looks at thousands.

Q5. `zip(w, w[1:])` 产生什么？What does `zip(w, w[1:])` produce? ⏱️ ~6:49

A. 所有字符对（包括重复）All character pairs (including repeats)

B. 所有连续字符对All consecutive character pairs

C. 交替字符对Alternating character pairs

D. 首尾字符对First and last character pairs

将原字符串和偏移1的字符串配对，得到连续字符对。Pairs original string with offset-1 string, producing consecutive pairs.

Q6. 在 bigram 模型中，"emma" 这个词包含多少个训练样本？In a bigram model, how many training examples does the word "emma" contain? ⏱️ ~8:01

A. 3

B. 4

C. 5

D. 6

".emma." 包含 5 个 bigrams: .e, em, mm, ma, a.".emma." contains 5 bigrams: .e, em, mm, ma, a.

Q7. `dict.get(key, default)` 的作用是什么？What does `dict.get(key, default)` do? ⏱️ ~10:04

A. 总是返回默认值Always returns default

B. 如果键存在则返回默认值Returns default if key exists

C. 如果键不存在则返回默认值Returns default if key doesn't exist

D. 删除键并返回值Deletes key and returns value

避免 KeyError，在键不存在时返回默认值（如0）。Avoids KeyError, returns default (e.g., 0) when key doesn't exist.

Q8. 按字典值排序需要什么？What is needed to sort dictionary by value? ⏱️ ~11:49

A. sorted(d.values())sorted(d.values())

B. sorted(d.items(), key=lambda kv: kv[1])sorted(d.items(), key=lambda kv: kv[1])

C. sorted(d.keys())sorted(d.keys())

D. d.sort()d.sort()

用 lambda 函数指定按元组的第二个元素（值）排序。Use lambda function to specify sorting by second element (value) of tuple.

Q9. 计数矩阵 N 的形状是什么？What is the shape of the count matrix N? ⏱️ ~16:58

A. (26, 26)

B. (27, 27)

C. (28, 28)

D. (32, 32)

26个字母 + 1个特殊标记 = 27。矩阵是 27×27。26 letters + 1 special token = 27. Matrix is 27×27.

Q10. `.item()` 方法的作用是什么？What does the `.item()` method do? ⏱️ ~20:17

A. 从单元素张量提取 Python 标量Extract Python scalar from single-element tensor

B. 获取张量的第一个元素Get first element of tensor

C. 将张量转换为列表Convert tensor to list

D. 删除张量的维度Remove tensor dimensions

索引后的张量仍是张量对象，.item() 提取实际的 Python 数值。Indexed tensor is still a tensor object, .item() extracts the actual Python number.

Q11. 为什么使用单一特殊标记 "." 而不是分开的 <S> 和 <E>？Why use a single special token "." instead of separate <S> and <E>? ⏱️ ~21:02

A. 计算更快Faster computation

B. 避免矩阵中出现无用的零行/列Avoid useless zero rows/columns in matrix

C. PyTorch 的要求PyTorch requirement

D. 更容易调试Easier to debug

<S> 永远不会是第二个字符，<E> 永远不会是第一个字符，导致矩阵中有整行/列的零。<S> never appears as second char, <E> never as first, causing entire rows/columns of zeros in the matrix.

Q12. 进行除法归一化前为什么要转换为浮点数？Why convert to float before division normalization? ⏱️ ~25:31

A. 计算更快Faster computation

B. 整数除法会丢失小数Integer division loses decimals

C. PyTorch 要求PyTorch requirement

D. 节省内存Save memory

整数除法会丢失小数部分，概率需要是小数。Integer division loses decimal part, probabilities need to be decimals.

Q13. 固定种子生成器（Generator）的作用是什么？What is the purpose of a fixed seed Generator? ⏱️ ~27:12

A. 加速计算Speed up computation

B. 节省内存Save memory

C. 确保结果可复现Ensure reproducibility

D. 生成更多数据Generate more data

使用固定种子的生成器，每次运行会得到相同的随机数。Using a generator with fixed seed, you get the same random numbers each run.

Q14. `torch.multinomial` 的 `replacement` 参数默认值是什么？What is the default value of `replacement` in `torch.multinomial`? ⏱️ ~28:22

A. False

B. True

C. None

D. 1

默认是 False（不放回），但语言模型需要 True（有放回）。Default is False (without replacement), but language models need True (with replacement).

Q15. 采样循环何时终止？When does the sampling loop terminate? ⏱️ ~32:40

A. 生成固定数量的字符后After generating fixed number of characters

B. 采样到结束标记（索引0）时When end token (index 0) is sampled

C. 概率低于阈值时When probability falls below threshold

D. 随机决定Randomly decided

当采样到特殊结束标记（与开始标记相同，索引0）时停止。Stops when special end token (same as start, index 0) is sampled.

Q16. 在 `.sum(1, keepdim=True)` 中不使用 `keepdim=True` 会导致什么问题？What problem occurs without `keepdim=True` in `.sum(1, keepdim=True)`? ⏱️ ~44:25

A. 运行时错误Runtime error

B. 按列归一化而不是按行Column normalization instead of row

C. 结果全为零Results are all zeros

D. 梯度消失Vanishing gradients

不使用 keepdim=True，(27,) 会广播为 (1, 27)，导致按列归一化。Without keepdim=True, (27,) broadcasts to (1, 27), causing column normalization.

Q17. 为什么使用负对数似然作为损失函数？Why use negative log likelihood as loss function? ⏱️ ~50:58

A. 计算更快Faster to compute

B. 梯度更稳定More stable gradients

C. 使"越低越好"的语义成立Makes "lower is better" semantics work

D. PyTorch 的默认选择PyTorch default choice

对数似然是负数，取负使其变正，这样最小化损失 = 最大化似然。Log likelihood is negative; negating it makes it positive, so minimizing loss = maximizing likelihood.

Q18. 完美预测时，负对数似然是多少？What is the negative log likelihood for a perfect prediction? ⏱️ ~50:58

A. 0

B. 1

C. -1

D. ∞

完美预测概率 = 1，log(1) = 0，-log(1) = 0。Perfect prediction prob = 1, log(1) = 0, -log(1) = 0.

Q19. 当概率在 [0, 1] 范围内时，对数概率（log probability）的取值范围是什么？What is the range of log probability when probability is in [0, 1]? ⏱️ ~54:00

A. (-∞, 0]

B. [0, ∞)

C. [-1, 1]

D. [0, 1]

概率范围是 [0,1]，取对数后范围是 (-∞, 0]。当概率=1时，log(1)=0；当概率→0时，log→-∞。Probability range is [0,1], after log it's (-∞, 0]. When prob=1, log(1)=0; when prob→0, log→-∞.

Q20. 训练好的 bigram 模型的平均损失大约是多少？What is the approximate average loss of a trained bigram model? ⏱️ ~1:00:00

A. ~0.45

B. ~2.45

C. ~4.50

D. ~24.5

统计方法和神经网络方法都达到约 2.45 的损失。Both statistical and neural network methods achieve approximately 2.45 loss.

Q21. 如果某个 bigram 在训练数据中从未出现，计算负对数似然损失时会发生什么问题？If a bigram never appeared in training data, what problem occurs when computing negative log likelihood loss? ⏱️ ~1:00:50

A. 损失为 0Loss is 0

B. 损失为无穷大Loss is infinity

C. 梯度为 0Gradient is 0

D. 模型崩溃Model crashes

log(0) = -inf，所以 -log(0) = inf。需要用平滑解决。log(0) = -inf, so -log(0) = inf. Need smoothing to solve this.

Q22. 平滑时添加的假计数通常是多少？How many fake counts are typically added for smoothing? ⏱️ ~1:01:49

A. 1

B. 10

C. 100

D. 1000

拉普拉斯平滑通常添加 1，即 N = N + 1。Laplace smoothing typically adds 1, i.e., N = N + 1.

Q23. 从 32,000 个名字中提取了多少个 bigram 样本？How many bigram samples were extracted from 32,000 names? ⏱️ ~1:05:09

A. ~32,000

B. ~128,000

C. ~228,000

D. ~328,000

每个名字平均贡献约 7 个 bigram，总共约 228,146 个。Each name contributes about 7 bigrams on average, totaling about 228,146.

Q24. `torch.tensor` 和 `torch.Tensor` 的主要区别是什么？What is the main difference between `torch.tensor` and `torch.Tensor`? ⏱️ ~1:07:54

A. 小写版本自动推断数据类型Lowercase auto-infers dtype

B. 大写版本更快Uppercase is faster

C. 它们完全相同They are identical

D. 小写版本只支持整数Lowercase only supports integers

torch.tensor 会根据输入数据自动判断数据类型（整数→int64，小数→float32），而 torch.Tensor 总是返回 float32。torch.tensor automatically detects the data type from your input (integers → int64, decimals → float32), while torch.Tensor always returns float32.

Q25. `F.one_hot()` 返回什么数据类型？What dtype does `F.one_hot()` return? ⏱️ ~1:12:55

A. float32

B. float64

C. int64

D. bool

one_hot 返回与输入相同的数据类型（整数索引→整数输出），需要手动调用 .float()。one_hot returns the same dtype as input (integer indices → integer output), requiring manual .float() call.

Q26. 在矩阵乘法 (5,27) @ (27,27) 中，输出形状是什么？In matrix multiplication (5,27) @ (27,27), what is the output shape? ⏱️ ~1:15:10

A. (5, 27)

B. (27, 5)

C. (5, 5)

D. (27, 27)

矩阵乘法：(m,n) @ (n,p) = (m,p)，所以 (5,27) @ (27,27) = (5,27)。Matrix multiplication: (m,n) @ (n,p) = (m,p), so (5,27) @ (27,27) = (5,27).

Q27. 在这个单层神经网络（无隐藏层激活函数）中，每个神经元对 27 维 one-hot 输入计算什么？In this single-layer neural network (no hidden activation), what does each neuron compute on the 27-dimensional one-hot input? ⏱️ ~1:18:04

A. 输入的加权和（点积）Weighted sum of inputs (dot product)

B. 输入的最大值Maximum of inputs

C. 输入的平均值Average of inputs

D. 输入的乘积Product of inputs

每个神经元计算其权重向量与输入的点积（加权和）。没有隐藏层激活函数，softmax 只在输出层用于转换为概率。Each neuron computes dot product (weighted sum) of its weight vector with input. No hidden activation; softmax is only applied at output to convert to probabilities.

Q28. 在这个教程中，"神经网络"有多少层？How many layers does the "neural network" in this tutorial have? ⏱️ ~1:18:04

A. 1

B. 2

C. 3

D. 0

只有一个线性层：xenc @ W，没有非线性激活。Only one linear layer: xenc @ W, no nonlinear activation.

Q29. Softmax 的第一步是什么？What is the first step of Softmax? ⏱️ ~1:21:01

A. 对 logits 取指数Exponentiate logits

B. 对 logits 取对数Take log of logits

C. 对 logits 归一化Normalize logits

D. 对 logits 取绝对值Take absolute value of logits

Softmax: exp(logits) → 归一化。指数化确保所有值为正。Softmax: exp(logits) → normalize. Exponentiation ensures all values are positive.

Q30. 为什么 Softmax 使用指数函数？Why does Softmax use exponential function? ⏱️ ~1:21:01

A. 计算更快Faster computation

B. 确保所有值为正Ensure all values are positive

C. 减少内存使用Reduce memory usage

D. 历史惯例Historical convention

exp(x) 总是正数，无论 x 是正是负，满足概率必须≥0的要求。exp(x) is always positive regardless of x being positive or negative, satisfying prob ≥ 0 requirement.

Q31. logits 的取值范围是什么？What is the range of logits? ⏱️ ~1:21:01

A. (-∞, +∞)

B. [0, 1]

C. [0, +∞)

D. [-1, 1]

Logits 可以是任意实数，Softmax 将其转换为概率。Logits can be any real number, Softmax converts them to probabilities.

Q32. `F.cross_entropy` 相比手动实现有什么优势？What advantage does `F.cross_entropy` have over manual implementation? ⏱️ ~1:34:30

A. 代码更短Shorter code

B. 更容易理解Easier to understand

C. 更高效且数值稳定More efficient and numerically stable

D. 支持更多功能Supports more features

PyTorch 内置函数经过优化，避免数值溢出问题。PyTorch built-in function is optimized and avoids numerical overflow issues.

Q33. 为什么用 mean() 而不是 sum() 计算损失？Why use mean() instead of sum() for loss? ⏱️ ~1:36:08

A. 计算更快Faster computation

B. 使损失不依赖于数据集大小Makes loss independent of dataset size

C. 梯度更稳定More stable gradients

D. PyTorch 要求PyTorch requirement

平均损失在不同大小的数据集上可比较。Average loss is comparable across different dataset sizes.

Q34. `probs[torch.arange(n), ys]` 返回什么？What does `probs[torch.arange(n), ys]` return? ⏱️ ~1:36:08

A. 所有概率All probabilities

B. 每个样本在正确标签处的概率Probability at correct label for each sample

C. 最大概率Maximum probabilities

D. 概率之和Sum of probabilities

高级索引：probs[0,ys[0]], probs[1,ys[1]], ... 提取正确类别的概率。Advanced indexing: probs[0,ys[0]], probs[1,ys[1]], ... extracts correct class probabilities.

Q35. 为什么每次迭代前要设置 `W.grad = None`？Why set `W.grad = None` before each iteration? ⏱️ ~1:38:48

A. PyTorch 默认累积梯度PyTorch accumulates gradients by default

B. 节省内存Save memory

C. 防止过拟合Prevent overfitting

D. 加速反向传播Speed up backpropagation

PyTorch 累积梯度（用于梯度累积技术），不重置会导致错误更新。PyTorch accumulates gradients (for gradient accumulation), not resetting causes wrong updates.

Q36. `W.grad.zero_()` 和 `W.grad = None` 哪个更高效？Which is more efficient: `W.grad.zero_()` or `W.grad = None`? ⏱️ ~1:38:48

A. W.grad.zero_()

B. W.grad = None

C. 相同Same

D. 取决于情况Depends on situation

设置为 None 更高效，PyTorch 会将 None 视为零处理。Setting to None is more efficient, PyTorch treats None as zero.

Q37. `requires_grad=True` 的作用是什么？What does `requires_grad=True` do? ⏱️ ~1:39:14

A. 使张量不可变Makes tensor immutable

B. 告诉 PyTorch 跟踪此张量的梯度Tells PyTorch to track gradients for this tensor

C. 加速计算Speeds up computation

D. 将张量移到 GPUMoves tensor to GPU

如果想通过反向传播优化张量，必须设置 requires_grad=True。To optimize a tensor via backpropagation, you must set requires_grad=True.

Q38. 梯度的正负号意味着什么？What does the sign of gradient mean? ⏱️ ~1:40:43

A. 正 = 增加参数会减少损失Positive = increasing param decreases loss

B. 负 = 增加参数会增加损失Negative = increasing param increases loss

C. 正 = 增加参数会增加损失Positive = increasing param increases loss

D. 符号无意义Sign has no meaning

梯度指向损失增加最快的方向，所以我们沿负梯度方向更新。Gradient points in direction of fastest loss increase, so we update in negative gradient direction.

Q39. 为什么用 `W.data += ...` 而不是 `W += ...`？Why use `W.data += ...` instead of `W += ...`? ⏱️ ~1:41:40

A. 更快Faster

B. 更节省内存More memory efficient

C. 避免被计算图跟踪Avoid being tracked by computation graph

D. PyTorch 语法要求PyTorch syntax requirement

W.data 表示纯数据操作，不会被跟踪到计算图中。W.data indicates pure data operation, won't be tracked in computation graph.

Q40. Karpathy 演示的最大学习率是多少？What was the maximum learning rate Karpathy demonstrated? ⏱️ ~1:44:08

A. 0.1

B. 1.0

C. 10.0

D. 50.0

在这个简单例子中，学习率 50 也能正常工作。In this simple example, learning rate of 50 also works fine.

Q41. one-hot 编码乘以权重矩阵 W 本质上是什么操作？What operation is one-hot encoding multiplied by weight matrix W essentially? ⏱️ ~1:47:54

A. 矩阵求逆Matrix inversion

B. 行查找（索引）Row lookup (indexing)

C. 矩阵转置Matrix transpose

D. 元素求和Element-wise sum

one-hot × W 等价于提取 W 的对应行，这就是查表操作。one-hot × W is equivalent to extracting the corresponding row of W, which is table lookup.

Q42. 神经网络的 W 矩阵对应统计方法中的什么？What does the neural network's W matrix correspond to in the statistical method? ⏱️ ~1:47:54

A. 概率矩阵 PProbability matrix P

B. 计数矩阵 NCount matrix N

C. 对数计数矩阵 log(N)Log count matrix log(N)

D. 无对应No correspondence

exp(W) ≈ N，所以 W ≈ log(N)。神经网络学习的是对数计数。exp(W) ≈ N, so W ≈ log(N). Neural network learns log counts.

Q43. 为什么 Karpathy 把 W 的值称为 "log counts"？Why does Karpathy call W values "log counts"? ⏱️ ~1:47:54

A. 因为用 log 初始化Because initialized with log

B. 因为计算用到 logBecause computation uses log

C. 因为 exp(W) ≈ 计数矩阵Because exp(W) ≈ count matrix

D. 这只是一个比喻It's just a metaphor

既然 exp(W) ≈ N（计数矩阵），那么 W ≈ log(N)。Since exp(W) ≈ N (count matrix), then W ≈ log(N).

Q44. 正则化项 `0.01 * (W**2).mean()` 的作用类似于什么？What is the regularization term `0.01 * (W**2).mean()` similar to? ⏱️ ~1:51:00

A. 统计方法中的平滑Smoothing in statistical method

B. 学习率调整Learning rate adjustment

C. 批归一化Batch normalization

D. 丢弃法Dropout

正则化把 W 拉向零，使概率更均匀，类似于添加假计数的平滑。Regularization pulls W toward zero, making probabilities more uniform, similar to smoothing with fake counts.

Q45. 当 W 全为零时，概率分布是什么样的？When W is all zeros, what does the probability distribution look like? ⏱️ ~1:51:00

A. 均匀分布Uniform distribution

B. 全为零All zeros

C. 全为一All ones

D. 随机分布Random distribution

exp(0) = 1，所有值相等，归一化后得到均匀分布 1/27。exp(0) = 1, all values equal, normalized to uniform 1/27.

Q46. 正则化强度太大会发生什么？What happens if regularization strength is too high? ⏱️ ~1:51:00

A. 过拟合Overfitting

B. 模型无法学习，预测变成均匀分布Model can't learn, predictions become uniform

C. 训练变慢Training slows down

D. 梯度爆炸Gradient explosion

正则化把 W 拉向零，太强则 W ≈ 0，所有概率相等。Regularization pulls W toward zero; too strong means W ≈ 0, all probabilities equal.

Q47. Karpathy 说统计方法的主要问题是什么？What is the main problem with the statistical method according to Karpathy? ⏱️ ~1:57:40

A. 计算太慢Too slow

B. 精度不够Not accurate enough

C. 无法扩展到更长的上下文Doesn't scale to longer context

D. 需要太多内存Requires too much memory

参数随上下文长度指数增长：27^2, 27^3, ... 27^10 是不可能的。Parameters grow exponentially with context: 27^2, 27^3, ... 27^10 is impossible.

Q48. trigram 模型需要多少参数？How many parameters does a trigram model need? ⏱️ ~1:57:40

A. 729

B. 19,683

C. 531,441

D. 14,348,907

27^3 = 19,683。这就是统计方法不可扩展的原因。27^3 = 19,683. This is why the statistical method doesn't scale.

Q49. 这个教程的核心结论是什么？What is the core conclusion of this tutorial? ⏱️ ~1:57:40

A. 统计方法比神经网络好Statistical method is better than neural network

B. 神经网络比统计方法好Neural network is better than statistical method

C. Bigram 模型足够好Bigram model is good enough

D. 两种方法殊途同归，但神经网络更可扩展Both methods arrive at same result, but neural network scales better

统计方法和神经网络得到相同的结果，但神经网络可以扩展到任意上下文长度。Statistical and neural network methods get same results, but neural network scales to arbitrary context length.

Q50. Makemore 系列最终会达到什么级别的模型？What level of model will the Makemore series eventually reach? ⏱️ ~2:00:00

A. RNN

B. LSTM

C. Transformer / GPT-2

D. CNN

系列从 bigram → MLP → RNN → Transformer，最终实现 GPT-2。Series progresses from bigram → MLP → RNN → Transformer, eventually implementing GPT-2.

Makemore 教程总结Tutorial Summary

📑 目录📑 Table of Contents

📌 概述Overview

🎯 核心内容Core Content

1. 什么是 Makemore？1. What is Makemore?

2. 字符级语言模型2. Character-Level Language Model

📊 两种训练方法Two Training Methods

方法一：统计计数法Method 1: Statistical Counting

方法二：神经网络法Method 2: Neural Network

🔑 关键概念解释Key Concepts Explained

1. 数据加载与探索1. Data Loading & Exploration ⏱️ ~3:06

2. Bigram 的概念2. Bigram Concept ⏱️ ~5:51

3. 构建计数矩阵3. Building Count Matrix ⏱️ ~13:13

4. 从计数到概率4. From Counts to Probabilities ⏱️ ~25:31

5. 使用生成器确保可复现性5. Using Generator for Reproducibility ⏱️ ~27:12

6. 广播机制 (Broadcasting)6. Broadcasting ⏱️ ~38:00

7. 似然与损失函数7. Likelihood & Loss Function ⏱️ ~50:58

8. 模型平滑 (Smoothing)8. Model Smoothing ⏱️ ~1:01:49

9. One-Hot 编码9. One-Hot Encoding ⏱️ ~1:10:10

10. Logits 与 Softmax10. Logits & Softmax ⏱️ ~1:21:01

11. 向量化损失计算11. Vectorized Loss Calculation ⏱️ ~1:36:08

12. 梯度的含义12. Gradient Meaning ⏱️ ~1:40:43

13. 学习率调优13. Learning Rate Tuning ⏱️ ~1:44:08

🔄 梯度下降流程Gradient Descent Flow ⏱️ ~1:35:00

🛡️ 正则化 (Regularization)Regularization ⏱️ ~1:51:00

💡 重要发现Key Findings

🚀 下一步Next Steps

⚠️ 重要警告与注意事项Important Warnings & Gotchas

📝 理解测试（50道选择题）Comprehension Quiz (50 Multiple Choice Questions)

Q1. 下列哪个不是本教程提到的 Karpathy 之前的视频？Which is NOT a previous Karpathy video mentioned in this tutorial? ⏱️ ~1:00

Q2. Makemore 使用的数据集 names.txt 大约包含多少个名字？How many names does the names.txt dataset used by Makemore contain approximately? ⏱️ ~3:06

Q3. 数据集中名字的最短长度是多少？What is the shortest name length in the dataset? ⏱️ ~3:06

Q4. Bigram 模型为什么被称为"非常简单和弱"？Why is the bigram model called "very simple and weak"? ⏱️ ~5:51

Q5. zip(w, w[1:]) 产生什么？What does zip(w, w[1:]) produce? ⏱️ ~6:49

Q6. 在 bigram 模型中，"emma" 这个词包含多少个训练样本？In a bigram model, how many training examples does the word "emma" contain? ⏱️ ~8:01

Q7. dict.get(key, default) 的作用是什么？What does dict.get(key, default) do? ⏱️ ~10:04

Q8. 按字典值排序需要什么？What is needed to sort dictionary by value? ⏱️ ~11:49

Q9. 计数矩阵 N 的形状是什么？What is the shape of the count matrix N? ⏱️ ~16:58

Q10. .item() 方法的作用是什么？What does the .item() method do? ⏱️ ~20:17

Q11. 为什么使用单一特殊标记 "." 而不是分开的 <S> 和 <E>？Why use a single special token "." instead of separate <S> and <E>? ⏱️ ~21:02

Q12. 进行除法归一化前为什么要转换为浮点数？Why convert to float before division normalization? ⏱️ ~25:31

Q13. 固定种子生成器（Generator）的作用是什么？What is the purpose of a fixed seed Generator? ⏱️ ~27:12

Q14. torch.multinomial 的 replacement 参数默认值是什么？What is the default value of replacement in torch.multinomial? ⏱️ ~28:22

Q15. 采样循环何时终止？When does the sampling loop terminate? ⏱️ ~32:40

Q16. 在 .sum(1, keepdim=True) 中不使用 keepdim=True 会导致什么问题？What problem occurs without keepdim=True in .sum(1, keepdim=True)? ⏱️ ~44:25

Q17. 为什么使用负对数似然作为损失函数？Why use negative log likelihood as loss function? ⏱️ ~50:58

Q18. 完美预测时，负对数似然是多少？What is the negative log likelihood for a perfect prediction? ⏱️ ~50:58

Q19. 当概率在 [0, 1] 范围内时，对数概率（log probability）的取值范围是什么？What is the range of log probability when probability is in [0, 1]? ⏱️ ~54:00

Q20. 训练好的 bigram 模型的平均损失大约是多少？What is the approximate average loss of a trained bigram model? ⏱️ ~1:00:00

Q21. 如果某个 bigram 在训练数据中从未出现，计算负对数似然损失时会发生什么问题？If a bigram never appeared in training data, what problem occurs when computing negative log likelihood loss? ⏱️ ~1:00:50

Q22. 平滑时添加的假计数通常是多少？How many fake counts are typically added for smoothing? ⏱️ ~1:01:49

Q23. 从 32,000 个名字中提取了多少个 bigram 样本？How many bigram samples were extracted from 32,000 names? ⏱️ ~1:05:09

Q24. torch.tensor 和 torch.Tensor 的主要区别是什么？What is the main difference between torch.tensor and torch.Tensor? ⏱️ ~1:07:54

Q25. F.one_hot() 返回什么数据类型？What dtype does F.one_hot() return? ⏱️ ~1:12:55

Q26. 在矩阵乘法 (5,27) @ (27,27) 中，输出形状是什么？In matrix multiplication (5,27) @ (27,27), what is the output shape? ⏱️ ~1:15:10

Q27. 在这个单层神经网络（无隐藏层激活函数）中，每个神经元对 27 维 one-hot 输入计算什么？In this single-layer neural network (no hidden activation), what does each neuron compute on the 27-dimensional one-hot input? ⏱️ ~1:18:04

Q28. 在这个教程中，"神经网络"有多少层？How many layers does the "neural network" in this tutorial have? ⏱️ ~1:18:04

Q29. Softmax 的第一步是什么？What is the first step of Softmax? ⏱️ ~1:21:01

Q30. 为什么 Softmax 使用指数函数？Why does Softmax use exponential function? ⏱️ ~1:21:01

Q31. logits 的取值范围是什么？What is the range of logits? ⏱️ ~1:21:01

Q32. F.cross_entropy 相比手动实现有什么优势？What advantage does F.cross_entropy have over manual implementation? ⏱️ ~1:34:30

Q33. 为什么用 mean() 而不是 sum() 计算损失？Why use mean() instead of sum() for loss? ⏱️ ~1:36:08

Q34. probs[torch.arange(n), ys] 返回什么？What does probs[torch.arange(n), ys] return? ⏱️ ~1:36:08

Q35. 为什么每次迭代前要设置 W.grad = None？Why set W.grad = None before each iteration? ⏱️ ~1:38:48

Q36. W.grad.zero_() 和 W.grad = None 哪个更高效？Which is more efficient: W.grad.zero_() or W.grad = None? ⏱️ ~1:38:48

Q37. requires_grad=True 的作用是什么？What does requires_grad=True do? ⏱️ ~1:39:14

Q38. 梯度的正负号意味着什么？What does the sign of gradient mean? ⏱️ ~1:40:43

Q39. 为什么用 W.data += ... 而不是 W += ...？Why use W.data += ... instead of W += ...? ⏱️ ~1:41:40

Q40. Karpathy 演示的最大学习率是多少？What was the maximum learning rate Karpathy demonstrated? ⏱️ ~1:44:08

Q41. one-hot 编码乘以权重矩阵 W 本质上是什么操作？What operation is one-hot encoding multiplied by weight matrix W essentially? ⏱️ ~1:47:54

Q42. 神经网络的 W 矩阵对应统计方法中的什么？What does the neural network's W matrix correspond to in the statistical method? ⏱️ ~1:47:54

Q43. 为什么 Karpathy 把 W 的值称为 "log counts"？Why does Karpathy call W values "log counts"? ⏱️ ~1:47:54

Q44. 正则化项 0.01 * (W**2).mean() 的作用类似于什么？What is the regularization term 0.01 * (W**2).mean() similar to? ⏱️ ~1:51:00

Q45. 当 W 全为零时，概率分布是什么样的？When W is all zeros, what does the probability distribution look like? ⏱️ ~1:51:00

Q46. 正则化强度太大会发生什么？What happens if regularization strength is too high? ⏱️ ~1:51:00

Q47. Karpathy 说统计方法的主要问题是什么？What is the main problem with the statistical method according to Karpathy? ⏱️ ~1:57:40

Q48. trigram 模型需要多少参数？How many parameters does a trigram model need? ⏱️ ~1:57:40

Q49. 这个教程的核心结论是什么？What is the core conclusion of this tutorial? ⏱️ ~1:57:40

Q50. Makemore 系列最终会达到什么级别的模型？What level of model will the Makemore series eventually reach? ⏱️ ~2:00:00

Q5. `zip(w, w[1:])` 产生什么？What does `zip(w, w[1:])` produce? ⏱️ ~6:49

Q7. `dict.get(key, default)` 的作用是什么？What does `dict.get(key, default)` do? ⏱️ ~10:04

Q10. `.item()` 方法的作用是什么？What does the `.item()` method do? ⏱️ ~20:17

Q14. `torch.multinomial` 的 `replacement` 参数默认值是什么？What is the default value of `replacement` in `torch.multinomial`? ⏱️ ~28:22

Q16. 在 `.sum(1, keepdim=True)` 中不使用 `keepdim=True` 会导致什么问题？What problem occurs without `keepdim=True` in `.sum(1, keepdim=True)`? ⏱️ ~44:25

Q24. `torch.tensor` 和 `torch.Tensor` 的主要区别是什么？What is the main difference between `torch.tensor` and `torch.Tensor`? ⏱️ ~1:07:54

Q25. `F.one_hot()` 返回什么数据类型？What dtype does `F.one_hot()` return? ⏱️ ~1:12:55

Q32. `F.cross_entropy` 相比手动实现有什么优势？What advantage does `F.cross_entropy` have over manual implementation? ⏱️ ~1:34:30

Q34. `probs[torch.arange(n), ys]` 返回什么？What does `probs[torch.arange(n), ys]` return? ⏱️ ~1:36:08

Q35. 为什么每次迭代前要设置 `W.grad = None`？Why set `W.grad = None` before each iteration? ⏱️ ~1:38:48

Q36. `W.grad.zero_()` 和 `W.grad = None` 哪个更高效？Which is more efficient: `W.grad.zero_()` or `W.grad = None`? ⏱️ ~1:38:48

Q37. `requires_grad=True` 的作用是什么？What does `requires_grad=True` do? ⏱️ ~1:39:14

Q39. 为什么用 `W.data += ...` 而不是 `W += ...`？Why use `W.data += ...` instead of `W += ...`? ⏱️ ~1:41:40

Q44. 正则化项 `0.01 * (W**2).mean()` 的作用类似于什么？What is the regularization term `0.01 * (W**2).mean()` similar to? ⏱️ ~1:51:00