Andrej Karpathy 的字符级语言模型教程 Andrej Karpathy's Character-Level Language Model Tutorial
本文档由 Lijian Liu 整理编辑,使用 Claude Opus 4.5 AI 辅助完成Document curated and edited by Lijian Liu with Claude Opus 4.5 AI assistance
这是 Andrej Karpathy 的 "makemore" 系列教程,目标是构建一个字符级语言模型,能够生成类似名字的新名字。 This is Andrej Karpathy's "makemore" tutorial series, aimed at building a character-level language model that can generate new names similar to existing ones.
names.txt 数据集(约32,000个名字)Uses names.txt dataset (approximately 32,000 names)torch.zeros((27, 27)) 创建计数矩阵Use PyTorch's torch.zeros((27, 27)) to create count matrixs2i(string to index)Character to index mapping: s2i (string to index)i2s(index to string)Index to character mapping: i2s (index to string). 表示开始/结束Special character . denotes start/endwords = open('names.txt', 'r').read().splitlines()
len(words) # 32033 个名字names
min(len(w) for w in words) # 2 (最短)(shortest)
max(len(w) for w in words) # 15 (最长)(longest)
Bigram 模型只看前一个字符来预测下一个字符。虽然简单("very simple and weak"),但是一个很好的起点。Bigram model only looks at the previous character to predict the next one. While simple ("very simple and weak"), it's a great place to start.
# "emma" 包含的 bigramscontains these bigrams:
.e, em, mm, ma, a. # 5 个示例!examples!
N = torch.zeros((27, 27), dtype=torch.int32)
# 行 = 第一个字符,列 = 第二个字符Row = first char, Column = second char
N[s2i[ch1], s2i[ch2]] += 1
p = N[0].float() # 获取第一行(计数)Get first row (counts)
p = p / p.sum() # 归一化为概率分布Normalize to probability distribution
p.sum() # 1.0 ✅
g = torch.Generator().manual_seed(2147483647)
torch.rand(3, generator=g) # 每次运行结果相同Same result every run
torch.multinomial(p, num_samples=1, generator=g)
在归一化概率矩阵时,我们需要理解 PyTorch 的广播规则:When normalizing probability matrices, we need to understand PyTorch's broadcasting rules:
P.shape # (27, 27)
# 按行求和,保持维度Sum across rows, keep dimension
P.sum(1, keepdim=True).shape # (27, 1) 列向量column vector
# 广播:(27, 27) / (27, 1) → 每行除以自己的和Broadcast: (27, 27) / (27, 1) → each row divided by its sum
P = P / P.sum(1, keepdim=True) # ✅ 正确归一化行Correctly normalizes rows
log_likelihood = 0.0
n = 0
for bigram in bigrams:
prob = P[ix1, ix2]
log_likelihood += torch.log(prob)
n += 1
nll = -log_likelihood # 负对数似然Negative log likelihood
loss = nll / n # 平均损失 ≈ 2.45Average loss ≈ 2.45
# 添加假计数,避免零概率Add fake counts to avoid zero probability
P = (N + 1).float() # 拉普拉斯平滑Laplace smoothing
P = P / P.sum(1, keepdim=True)
import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=27).float()
# xs = [0, 5, 13, 13, 1]
# xenc.shape = (5, 27) - 每行只有一个1,其余为0- each row has one 1, rest are 0
logits = xenc @ W # 可正可负的"对数计数""log counts" - can be +/-
counts = logits.exp() # 指数化 → 总是正数Exponentiate → always positive
probs = counts / counts.sum(1, keepdim=True) # 归一化 → 概率Normalize → probabilities
# 这就是 Softmax!将任意数值转换为概率分布This is Softmax! Converts any values to probability distribution
Softmax 的作用:将神经网络的任意输出(可正可负)转换为有效的概率分布(总是正数,和为1)。Karpathy 说这是"非常常用的"(very often used)。What Softmax does: Converts arbitrary neural network outputs (can be positive or negative) into a valid probability distribution (always positive, sums to 1). Karpathy says this is "very often used".
# 不用循环,一行搞定!No loop needed, one line!
loss = -probs[torch.arange(n), ys].log().mean()
# 等价于:Equivalent to:
# probs[0, ys[0]] → 第0个样本在正确标签处的概率Prob of sample 0 at correct label
# probs[1, ys[1]] → 第1个样本在正确标签处的概率Prob of sample 1 at correct label
# ...
W.grad.shape # (27, 27) - 与 W 形状相同- same shape as W
# 梯度告诉我们每个权重对损失的影响:Gradient tells us each weight's influence on loss:
# W.grad[i,j] > 0 → 增加 W[i,j] 会增加 lossIncreasing W[i,j] increases loss
# W.grad[i,j] < 0 → 增加 W[i,j] 会减少 lossIncreasing W[i,j] decreases loss
Karpathy 演示了学习率可以从很小调到很大:Karpathy demonstrates learning rate can be tuned from small to large:
lr = 0.1 # 保守Conservative
lr = 1.0 # 更激进More aggressive
lr = 10.0 # 更快收敛Faster convergence
lr = 50.0 # 在这个简单例子中也能工作!Works even for this simple example!
for i in range(100):
# 前向传播Forward pass
xenc = F.one_hot(xs, num_classes=27).float()
logits = xenc @ W
counts = logits.exp()
probs = counts / counts.sum(1, keepdim=True)
loss = -probs[torch.arange(n), ys].log().mean()
# 反向传播Backward pass
W.grad = None
loss.backward()
# 更新权重Update weights
W.data -= 0.1 * W.grad
相当于统计方法中的"平滑":Equivalent to "smoothing" in statistical methods:
loss = loss + 0.01 * (W**2).mean()
| 方面Aspect | 统计方法Statistical Method | 神经网络方法Neural Network Method |
|---|---|---|
| 实现Implementation | 计数 + 归一化Counting + Normalization | 梯度下降优化Gradient Descent Optimization |
| 最终损失Final Loss | ~2.45 | ~2.45 |
| 生成结果Generated Results | 相同Same | 相同Same |
| 可扩展性Scalability | ❌ 差❌ Poor | ✅ 好✅ Good |
exp(W) ≈ 统计方法中的计数矩阵 NCount matrix N from statistical method以下是 Andrej Karpathy 在视频中特别强调需要注意的细节(按出现顺序排列):The following are details that Andrej Karpathy specifically emphasized in the video (in order of appearance):
w = "emma"
for ch1, ch2 in zip(w, w[1:]):
print(ch1, ch2)
# e m
# m m
# m a
技巧:Karpathy 称这是一个"cute"的 Python 技巧。zip() 将两个迭代器配对,当较短的一个耗尽时自动停止。w[1:] 是从第二个字符开始的切片,所以 zip(w, w[1:]) 会产生所有连续字符对。Trick: Karpathy calls this a "cute" Python trick. zip() pairs up two iterators and stops when the shorter one is exhausted. w[1:] is the slice starting from the second character, so zip(w, w[1:]) produces all consecutive character pairs.
# ❌ 不完整:丢失了重要信息Incomplete: loses important information
for ch1, ch2 in zip(w, w[1:]): # 只有 em, mm, maOnly em, mm, ma
# ✅ 完整:包含开始和结束信息Complete: includes start and end info
chs = ['.'] + list(w) + ['.'] # ['.', 'e', 'm', 'm', 'a', '.']
for ch1, ch2 in zip(chs, chs[1:]): # .e, em, mm, ma, a.
原因:Karpathy 强调"我们必须小心"。单词 "emma" 实际上包含5个 bigram 示例,不是3个!我们需要知道:(1) 'e' 很可能是名字的第一个字符;(2) 'a' 之后名字很可能结束。这些都是重要的统计信息,不能丢失。Reason: Karpathy emphasizes "we have to be careful". The word "emma" actually contains 5 bigram examples, not 3! We need to know: (1) 'e' is likely to be the first character of a name; (2) after 'a', the name is likely to end. This is important statistical information that cannot be lost.
# ❌ 如果键不存在会报错Raises error if key doesn't exist
b[bigram] = b[bigram] + 1 # KeyError!
# ✅ 使用 get() 提供默认值Use get() with default value
b[bigram] = b.get(bigram, 0) + 1 # 如果不存在返回0Returns 0 if not exists
说明:dict.get(key, default) 在键不存在时返回默认值而不是报错。这是 Python 字典计数的常用模式。Note: dict.get(key, default) returns the default value instead of raising an error when the key doesn't exist. This is a common pattern for counting with Python dictionaries.
# 默认按键(第一个元素)排序Default sorts by key (first element)
sorted(b.items())
# 按值(第二个元素)排序Sort by value (second element)
sorted(b.items(), key=lambda kv: kv[1]) # 升序Ascending
sorted(b.items(), key=lambda kv: kv[1], reverse=True) # 降序Descending
说明:Karpathy 说这在 Python 中"有点麻烦"(kind of gross)。b.items() 返回 (key, value) 元组,默认按元组的第一个元素排序。要按计数排序,需要用 key=lambda kv: kv[1] 指定按第二个元素排序。Note: Karpathy says this is "kind of gross in Python". b.items() returns (key, value) tuples, which sort by the first element by default. To sort by count, use key=lambda kv: kv[1] to specify sorting by the second element.
# 默认是 float32Default is float32
a = torch.zeros((3, 5))
a.dtype # torch.float32
# 计数应该用整数Counts should be integers
N = torch.zeros((27, 27), dtype=torch.int32)
说明:torch.zeros() 默认创建 float32 类型。对于计数矩阵,使用整数类型更合理。虽然后面会转换为浮点数进行归一化,但明确数据类型是好习惯。Note: torch.zeros() creates float32 by default. For count matrices, using integer type makes more sense. Although we'll convert to float later for normalization, being explicit about data types is good practice.
chars = sorted(list(set(''.join(words)))) # ['a', 'b', ..., 'z']
s2i = {s: i+1 for i, s in enumerate(chars)}
s2i['.'] = 0 # 特殊标记放在索引0Special token at index 0
i2s = {i: s for s, i in s2i.items()} # 反向映射Reverse mapping
技巧:enumerate() 返回 (索引, 元素) 对。Karpathy 用 i+1 是为了让字母从索引1开始,把索引0留给特殊标记 '.'。创建反向映射 i2s 时只需交换键值对。Trick: enumerate() returns (index, element) pairs. Karpathy uses i+1 so letters start at index 1, reserving index 0 for the special token '.'. Creating the reverse mapping i2s just requires swapping key-value pairs.
N[0, 5] # tensor(149) - 还是张量!- still a tensor!
type(N[0, 5]) # torch.Tensor
N[0, 5].item() # 149 - 纯 Python 整数- pure Python integer
type(N[0, 5].item()) # int
问题:对张量进行索引后,返回的仍然是一个张量对象,只是形状变成了标量。如果你需要在 Python 中使用这个值(比如打印、用作索引等),需要调用 .item() 来提取实际的数值。Problem: When you index into a tensor, you still get a tensor object back, just with scalar shape. If you need to use this value in Python (for printing, using as index, etc.), you need to call .item() to extract the actual number.
# ❌ 最初设计:两个特殊标记Initial design: two special tokens
<S> = 26 # 开始标记Start token
<E> = 27 # 结束标记End token
# 问题:矩阵中有整行整列的零(浪费空间)Problem: entire row and column of zeros in matrix (wasted space)
# ✅ 优化设计:单一特殊标记Optimized design: single special token
. = 0 # 同时表示开始和结束Represents both start and end
原因:<E> 永远不会是 bigram 的第一个字符(因为它只出现在结尾),<S> 永远不会是第二个字符(因为它只出现在开头)。这导致矩阵中有很多无用的零。使用单一标记 "." 可以将矩阵从 28×28 缩小到 27×27。Reason: <E> can never be the first character of a bigram (it only appears at the end), and <S> can never be the second character (it only appears at the start). This causes many useless zeros in the matrix. Using a single token "." reduces the matrix from 28×28 to 27×27.
# ❌ 默认 replacement=False,抽样后不放回Default replacement=False, sampling without replacement
torch.multinomial(probs, num_samples=20) # 可能报错或结果不对May error or give wrong results
# ✅ 语言模型需要有放回抽样Language models need sampling with replacement
torch.multinomial(probs, num_samples=20, replacement=True)
注意:Karpathy 说"不知道为什么默认是 False"。对于语言模型,我们需要能够多次抽到同一个字符,所以必须设置 replacement=True。Note: Karpathy says "I don't know why the default is False". For language models, we need to be able to sample the same character multiple times, so replacement=True is required.
这是一个非常微妙且难以发现的 bug,Karpathy 说这应该"吓到你":This is a very subtle and hard-to-find bug, Karpathy says this should "scare you":
P.shape # (27, 27)
# ❌ 错误:sum 后形状变成 (27,),广播时变成 (1, 27) 行向量Wrong: sum produces (27,), broadcasts as (1, 27) row vector
P = N / N.sum(1) # 实际上在归一化列!Actually normalizes columns!
# ✅ 正确:保持维度,形状是 (27, 1) 列向量Correct: keeps dimension, shape is (27, 1) column vector
P = N / N.sum(1, keepdim=True) # 正确归一化行Correctly normalizes rows
原因:当 keepdim=False(默认值)时,维度被压缩掉。根据广播规则,从右向左对齐,缺失的维度会在左边补1。所以 (27,) 变成 (1, 27),导致按列归一化而不是按行!Karpathy 强调要"尊重"广播机制,仔细检查,否则会引入非常难以发现的 bug。Reason: When keepdim=False (default), the dimension is squeezed out. According to broadcasting rules, alignment happens from right to left, and missing dimensions are added as 1 on the left. So (27,) becomes (1, 27), causing column normalization instead of row! Karpathy emphasizes to "respect" broadcasting, check carefully, or you'll introduce very hard-to-find bugs.
# ❌ 创建新张量,浪费内存Creates new tensor, wastes memory
P = P / P.sum(1, keepdim=True)
# ✅ 原地操作,更高效In-place operation, more efficient
P /= P.sum(1, keepdim=True)
建议:使用 /=, +=, -= 等原地操作符可以避免创建新的张量,节省内存并提高速度。Recommendation: Using in-place operators like /=, +=, -= avoids creating new tensors, saving memory and improving speed.
# 如果某个 bigram 从未出现过(如 "jq")If a bigram never appeared (e.g., "jq")
prob = 0.0
torch.log(torch.tensor(prob)) # tensor(-inf)
# 负对数似然变成正无穷Negative log likelihood becomes positive infinity
loss = -torch.log(torch.tensor(prob)) # tensor(inf)
解决方案 - 模型平滑:在计数矩阵中添加假计数(如 +1),确保没有零概率。Karpathy 说这"有点恶心"(kind of gross),所以人们用平滑来修复。这在统计方法中叫"拉普拉斯平滑"。Solution - Model Smoothing: Add fake counts (e.g., +1) to the count matrix to ensure no zero probabilities. Karpathy says this is "kind of gross", so people use smoothing to fix it. This is called "Laplace smoothing" in statistics.
# 平滑处理Smoothing
N = N + 1 # 给每个计数加1Add 1 to every count
PyTorch 有两种创建张量的方式,Karpathy 说这"一点也不令人困惑"(讽刺),文档也"不清楚":PyTorch has two ways to create tensors, Karpathy says this is "not confusing at all" (sarcastically), and the docs are "not clear":
# ✅ 推荐:小写 tensor - 自动推断数据类型Recommended: lowercase tensor - auto-infers dtype
xs = torch.tensor([1, 2, 3]) # dtype = torch.int64
# ❌ 不推荐:大写 Tensor - 总是返回 float32Not recommended: uppercase Tensor - always returns float32
xs = torch.Tensor([1, 2, 3]) # dtype = torch.float32
建议:始终使用小写的 torch.tensor,它会根据输入数据自动推断正确的数据类型。Karpathy 提到需要查阅社区讨论(random threads)才能理解区别,并警告说"有些东西不幸地不容易也没有很好的文档,你必须小心"。Recommendation: Always use lowercase torch.tensor, which automatically infers the correct dtype from input data. Karpathy mentions you need to search community threads to understand the difference, and warns "some of the stuff is unfortunately not easy and not very well documented and you have to be careful out there".
xenc = F.one_hot(xs, num_classes=27)
xenc.dtype # torch.int64 (64位整数!)(64-bit integer!)
# 神经网络需要浮点数,必须手动转换:Neural nets need floats, must convert manually:
xenc = F.one_hot(xs, num_classes=27).float() # ✅
问题:one_hot 函数不接受 dtype 参数(不像很多其他 PyTorch 函数)。它返回与输入相同的数据类型,而输入是整数索引,所以输出也是整数。Karpathy 强调"我们总是要小心数据类型"。Problem: The one_hot function doesn't accept a dtype parameter (unlike many other PyTorch functions). It returns the same dtype as input, and since input is integer indices, output is also integer. Karpathy emphasizes "we always want to be careful with data types".
# (batch, input_dim) @ (input_dim, output_dim) = (batch, output_dim)
xenc.shape # (5, 27) - 5个样本,每个27维- 5 samples, each 27-dim
W.shape # (27, 27) - 27个输入,27个神经元- 27 inputs, 27 neurons
(xenc @ W).shape # (5, 27) - 5个样本的27个输出- 27 outputs for 5 samples
理解:内部维度必须匹配(这里是27)。外部维度决定输出形状。每行是一个样本,每列是一个神经元的输出。位置 [3, 13] 表示第3个样本在第13个神经元上的激活值(通过第3个输入与W的第13列的点积得到)。Understanding: Inner dimensions must match (27 here). Outer dimensions determine output shape. Each row is a sample, each column is a neuron's output. Position [3, 13] represents the activation of the 3rd sample on the 13th neuron (computed via dot product of 3rd input and 13th column of W).
# ❌ 默认情况下,叶子张量不计算梯度By default, leaf tensors don't compute gradients
W = torch.randn((27, 27))
W.requires_grad # False
# ✅ 必须显式启用梯度计算Must explicitly enable gradient computation
W = torch.randn((27, 27), requires_grad=True)
原因:PyTorch 默认不跟踪叶子张量的梯度以节省内存。如果你想通过反向传播优化这个张量,必须告诉 PyTorch 你需要梯度。否则调用 loss.backward() 后,W.grad 将是 None。Karpathy 说"如果你还记得 micrograd,PyTorch 需要我们传入这个"。Reason: PyTorch doesn't track gradients for leaf tensors by default to save memory. If you want to optimize this tensor via backpropagation, you must tell PyTorch you need gradients. Otherwise after calling loss.backward(), W.grad will be None. Karpathy says "if you remember from micrograd, PyTorch requires that we pass this in".
for i in range(100):
# 前向传播...Forward pass...
# ✅ 重置梯度(两种方式)Reset gradients (two ways)
W.grad = None # 更高效More efficient
# W.grad.zero_() # 另一种方式Alternative way
loss.backward()
W.data -= lr * W.grad
原因:PyTorch 默认累积梯度(用于某些高级技术如梯度累积)。如果不重置,新梯度会加到旧梯度上,导致错误的更新。Karpathy 指出设置为 None 比 zero_() 更高效,PyTorch 会将 None 视为零。Reason: PyTorch accumulates gradients by default (used for advanced techniques like gradient accumulation). If not reset, new gradients add to old ones, causing wrong updates. Karpathy notes setting to None is more efficient than zero_(), and PyTorch treats None as zero.
# ✅ 正确:直接修改底层数据Correct: directly modify underlying data
W.data += -0.1 * W.grad
# ❌ 不推荐:这会被 PyTorch 跟踪为计算图的一部分Not recommended: this gets tracked as part of computation graph
W += -0.1 * W.grad
原因:使用 W.data 告诉 PyTorch 这是一个纯粹的数据操作,不应该被跟踪到计算图中。直接操作 W 可能会导致梯度计算问题。Reason: Using W.data tells PyTorch this is a pure data operation that shouldn't be tracked in the computation graph. Directly operating on W might cause gradient computation issues.
# 表面上:矩阵乘法On the surface: matrix multiplication
xenc = [0, 0, 0, 0, 0, 1, 0, ...] # 第5位是15th position is 1
logits = xenc @ W
# 实际上:只是提取 W 的第5行!Actually: just extracts 5th row of W!
logits = W[5] # 完全等价Completely equivalent
洞察:Karpathy 花了大量时间解释这一点。这正是统计方法的工作原理——用第一个字符的索引去查表获取概率分布。神经网络做的是同样的事情,只是 W 是对数计数(logits),需要指数化和归一化。所以 exp(W) 就等价于统计方法中的计数矩阵 N。两种方法"殊途同归",得到相同的损失和相同的采样结果。Insight: Karpathy spends significant time explaining this. This is exactly how the statistical method works - using the first character's index to look up probability distribution in a table. The neural network does the same thing, except W is log-counts (logits) that need exponentiation and normalization. So exp(W) is equivalent to the count matrix N from the statistical method. Both methods arrive at the same place, with the same loss and same sampling results.
# 统计方法中的平滑Smoothing in statistical method
N = N + 1 # 添加假计数Add fake counts
# 神经网络中的等价物:正则化Equivalent in neural network: regularization
loss = loss + 0.01 * (W**2).mean()
原理:Karpathy 说这"很酷"。当 W 全为零时,exp(0) = 1,所有概率变成均匀分布。正则化项惩罚 W 偏离零,就像平滑使概率更均匀一样。可以把它想象成一个"弹簧力"或"重力"把 W 拉向零。正则化强度(0.01)越大,概率越均匀,等价于添加更多假计数。如果强度太大,模型将无法学习,所有预测都是均匀的。Principle: Karpathy says this is "kind of cool". When W is all zeros, exp(0) = 1, all probabilities become uniform. The regularization term penalizes W deviating from zero, just like smoothing makes probabilities more uniform. Think of it as a "spring force" or "gravity" pulling W toward zero. Higher regularization strength (0.01) means more uniform probabilities, equivalent to adding more fake counts. If the strength is too high, the model can't learn and all predictions are uniform.
ix = 0 # 从特殊标记开始Start with special token
while True:
p = P[ix] # 获取当前字符的概率分布Get probability distribution for current char
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
if ix == 0: # 遇到结束标记Hit end token
break
out.append(i2s[ix])
说明:采样从索引0(特殊开始标记)开始,不断根据当前字符查找下一个字符的概率分布,直到采样到结束标记(也是索引0)为止。Note: Sampling starts from index 0 (special start token), continuously looks up the probability distribution for the next character based on the current character, until the end token (also index 0) is sampled.
# ❌ 整数除法会丢失精度Integer division loses precision
p = N[0] / N[0].sum() # 如果 N 是整数类型If N is integer type
# ✅ 先转换为浮点数Convert to float first
p = N[0].float()
p = p / p.sum()
说明:在进行归一化(除法)之前,必须将整数张量转换为浮点数,否则会丢失小数部分。Note: Before normalization (division), integer tensors must be converted to float, otherwise the decimal part is lost.
prob = 0.5
log_prob = torch.log(torch.tensor(0.5)) # -0.6931
# 概率范围Probability range: [0, 1]
# 对数概率范围Log probability range: [-inf, 0]
# 对数概率 = 0 意味着概率 = 1(完美预测)log prob = 0 means prob = 1 (perfect prediction)
# 对数概率 = -inf 意味着概率 = 0(不可能事件)log prob = -inf means prob = 0 (impossible event)
说明:对数概率越接近0,预测越好。负对数似然是损失函数,所以取负后越小越好。Note: The closer log probability is to 0, the better the prediction. Negative log likelihood is the loss function, so after negation, smaller is better.
xs, ys = [], []
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
xs.append(s2i[ch1]) # 输入:第一个字符的索引Input: first char index
ys.append(s2i[ch2]) # 标签:第二个字符的索引Label: second char index
xs = torch.tensor(xs) # 形状:(228146,)Shape: (228146,)
ys = torch.tensor(ys) # 形状:(228146,)Shape: (228146,)
说明:从32,000个名字中提取出约228,000个 bigram 训练样本。每个样本是一对 (输入字符索引, 目标字符索引)。Note: From 32,000 names, approximately 228,000 bigram training samples are extracted. Each sample is a pair of (input char index, target char index).
# W.shape = (27, 27)
# 可以看作 27 个神经元,每个有 27 个输入Think of it as 27 neurons, each with 27 inputs
# logits[3, 13] = 第3个输入在第13个神经元上的激活Activation of 3rd input on 13th neuron
# = W[:, 13] 与输入的点积dot product with input
理解:Karpathy 说这是"神经网络的一层"。每个神经元计算其所有输入的加权和(点积),产生一个激活值(firing rate)。Understanding: Karpathy says this is "one layer of a neural net". Each neuron computes a weighted sum (dot product) of all its inputs, producing an activation value (firing rate).
# ❌ 总和会随数据集大小变化Sum changes with dataset size
loss = -probs[torch.arange(n), ys].log().sum()
# ✅ 平均值是归一化的度量Mean is a normalized measure
loss = -probs[torch.arange(n), ys].log().mean()
原因:使用平均值可以让损失值不依赖于数据集大小。无论你有1000个还是100万个样本,损失值都在同一个尺度上(约2.45),便于比较和调试。Reason: Using mean makes the loss value independent of dataset size. Whether you have 1,000 or 1 million samples, the loss is on the same scale (~2.45), making it easier to compare and debug.
# Bigram: 27×27 = 729 参数parameters
# Trigram: 27×27×27 = 19,683 参数parameters
# 4-gram: 27^4 = 531,441 参数parameters
# 10-gram: 27^10 ≈ 2×10^14 参数 — 不可能!parameters — impossible!
问题:Karpathy 强调统计方法"无法扩展"。参数数量随上下文长度指数增长。神经网络方法通过参数共享解决了这个问题——可以扩展到任意上下文长度,最终达到 GPT-2 级别的模型。Problem: Karpathy emphasizes the counting method "doesn't scale". Parameter count grows exponentially with context length. Neural network methods solve this through parameter sharing — can scale to arbitrary context lengths, eventually reaching GPT-2 level models.
# 手动实现(教学目的)Manual implementation (for teaching)
counts = logits.exp()
probs = counts / counts.sum(1, keepdim=True)
loss = -probs[torch.arange(n), ys].log().mean()
# ✅ PyTorch 内置函数(生产环境推荐)built-in function (recommended for production)
import torch.nn.functional as F
loss = F.cross_entropy(logits, ys)
说明:PyTorch 提供了 F.cross_entropy,它内部更高效且数值更稳定。Karpathy 手动实现是为了教学目的,展示底层原理。Note: PyTorch provides F.cross_entropy, which is more efficient and numerically stable internally. Karpathy implements it manually for teaching purposes, to show the underlying principles.
点击选项查看答案。绿色 = 正确,红色 = 错误。Click an option to see the answer. Green = correct, Red = incorrect.
zip(w, w[1:]) 产生什么?What does zip(w, w[1:]) produce? ⏱️ ~6:49dict.get(key, default) 的作用是什么?What does dict.get(key, default) do? ⏱️ ~10:04.item() 方法的作用是什么?What does the .item() method do? ⏱️ ~20:17.item() 提取实际的 Python 数值。Indexed tensor is still a tensor object, .item() extracts the actual Python number.torch.multinomial 的 replacement 参数默认值是什么?What is the default value of replacement in torch.multinomial? ⏱️ ~28:22.sum(1, keepdim=True) 中不使用 keepdim=True 会导致什么问题?What problem occurs without keepdim=True in .sum(1, keepdim=True)? ⏱️ ~44:25keepdim=True,(27,) 会广播为 (1, 27),导致按列归一化。Without keepdim=True, (27,) broadcasts to (1, 27), causing column normalization.torch.tensor 和 torch.Tensor 的主要区别是什么?What is the main difference between torch.tensor and torch.Tensor? ⏱️ ~1:07:54torch.tensor 会根据输入数据自动判断数据类型(整数→int64,小数→float32),而 torch.Tensor 总是返回 float32。torch.tensor automatically detects the data type from your input (integers → int64, decimals → float32), while torch.Tensor always returns float32.F.one_hot() 返回什么数据类型?What dtype does F.one_hot() return? ⏱️ ~1:12:55one_hot 返回与输入相同的数据类型(整数索引→整数输出),需要手动调用 .float()。one_hot returns the same dtype as input (integer indices → integer output), requiring manual .float() call.F.cross_entropy 相比手动实现有什么优势?What advantage does F.cross_entropy have over manual implementation? ⏱️ ~1:34:30probs[torch.arange(n), ys] 返回什么?What does probs[torch.arange(n), ys] return? ⏱️ ~1:36:08W.grad = None?Why set W.grad = None before each iteration? ⏱️ ~1:38:48W.grad.zero_() 和 W.grad = None 哪个更高效?Which is more efficient: W.grad.zero_() or W.grad = None? ⏱️ ~1:38:48requires_grad=True 的作用是什么?What does requires_grad=True do? ⏱️ ~1:39:14requires_grad=True。To optimize a tensor via backpropagation, you must set requires_grad=True.W.data += ... 而不是 W += ...?Why use W.data += ... instead of W += ...? ⏱️ ~1:41:40W.data 表示纯数据操作,不会被跟踪到计算图中。W.data indicates pure data operation, won't be tracked in computation graph.0.01 * (W**2).mean() 的作用类似于什么?What is the regularization term 0.01 * (W**2).mean() similar to? ⏱️ ~1:51:00