🤖 nanochat:Karpathy 出品,$48 复刻 GPT-2,2 小时训练出 ChatGPT 级对话模型
项目地址:karpathy/nanochat | ⭐ 53.3k | 🛠 Python | 👤 Andrej Karpathy
老实说,大部分搞 LLM 的人根本没亲手训过一个模型。不是不想,是真贵——2019 年 OpenAI 训 GPT-2 花了 $43,000。Karpathy 的 nanochat 把这个数字打到了 $48,一台 8×H100 的机器跑 2 小时,你就能拥有一个 GPT-2 能力的对话模型,还能在浏览器里跟它聊天。
一个 `--depth` 搞定所有
nanochat 用了一个骚操作——只有一个复杂度拨盘:--depth(Transformer 层数)。宽度、头数、学习率、训练步数、权重衰减……所有超参数自动推导,保证计算最优。想训大模型就加大 depth,想快速实验就减小,其他一切自动适配。
# 装环境
uv sync --extra gpu
source .venv/bin/activate
# 训 GPT-2 级模型(8×H100,~2 小时)
bash runs/speedrun.sh
# 启动聊天 WebUI
python -m scripts.chat_web
训完之后访问 http://<你的IP>:8000/,就能跟自己的模型唠嗑了。效果大概相当于跟一个幼稚园小朋友聊天——会写诗、会编故事、会一本正经地胡说八道。
速度排行榜:从 168 小时到 1.65 小时
最骚的是 nanochat 维护了一个「GPT-2 速度排行榜」。OpenAI 原版 GPT-2 训练用了 168 小时,DCLM CORE 分数 0.2565。现在?最新记录是 1.65 小时,CORE 分数 0.2626——不止快了 100 倍,质量还更高了。
从 $43,000 到 $48,从一周到 99 分钟。7 年的技术进步,全在这一行对比里。
# 用自己的数据跑一轮快速实验(~5 分钟)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=12
完整管线:不只是预训练
nanochat 不是玩具。它覆盖了训练 LLM 的全流程——从 tokenizer 到 RL 到 WebUI:
# 训练 tokenizer
python -m scripts.tok_train
# 预训练基座模型(d26 ≈ GPT-2 能力)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=26
# SFT 微调 + RL 强化学习
python -m scripts.chat_sft
python -m scripts.chat_rl
# CLI 直接对话 / WebUI
python -m scripts.chat_cli -p "Hello!"
python -m scripts.chat_web
精度控制也做得干净。nanochat 不用 torch.amp.autocast,而是通过全局 COMPUTE_DTYPE 显式管理。A100/H100 默认 bf16,V100 自动降级 fp32,也可以用环境变量强制:
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train
为什么值得关注
价格绝了:$48 体验从前 $43,000 的事,GPU 云服务按小时租就行,用完关机不心疼
2. 代码极简:没有框架级别的抽象,没有巨型配置文件,纯 PyTorch,可读性强到你想 fork
3. 全流程覆盖:tokenizer 训练 → 预训练 → SFT → RL → 评估 → WebUI,一个仓库搞定
4. 社区驱动:速度排行榜让优化变成游戏,社区 PR 不断刷新记录,Karpathy 亲自 review
别整那些花里胡哨的,想真正理解 LLM 是怎么训出来的,直接跑一遍 nanochat 比看一百篇论文都管用。
🤖 nanochat: Karpathy's $48 GPT-2 Replication — Train a ChatGPT-Class Model in 2 Hours
Project: karpathy/nanochat | ⭐ 53.3k | 🛠 Python | 👤 Andrej Karpathy
Let's be honest — most people working with LLMs have never actually trained one from scratch. Not because they don't want to, but because it's expensive. OpenAI spent $43,000 training GPT-2 back in 2019. Karpathy's nanochat brings that down to $48 — 2 hours on an 8×H100 box, and you get a GPT-2-class chat model you can talk to in your browser.
One Dial to Rule Them All
nanochat has a single complexity knob: --depth (number of transformer layers). Width, heads, learning rate, training steps, weight decay — every hyperparameter is auto-derived for compute-optimal results. Want a bigger model? Turn up depth. Quick experiment? Turn it down. Everything else just works.
# Setup
uv sync --extra gpu
source .venv/bin/activate
# Train GPT-2 class model (8×H100, ~2 hours)
bash runs/speedrun.sh
# Launch chat WebUI
python -m scripts.chat_web
Visit http://:8000/ and chat with your very own model. It writes poems, tells stories, and confidently hallucinates — like a kindergartner with opinions.
Speed Leaderboard: 168h → 1.65h
The original GPT-2 took 168 hours with a DCLM CORE score of 0.2565. The latest nanochat record: 1.65 hours at 0.2626 — over 100x faster and better quality. From $43,000 to $48.
# Quick experiment (~5 min on 8×H100)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=12
Full Pipeline: More Than Pretraining
nanochat covers the entire LLM workflow — tokenizer, pretraining, SFT, RL, evaluation, and WebUI:
# Train tokenizer
python -m scripts.tok_train
# Pretrain base model
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=26
# SFT + RL
python -m scripts.chat_sft
python -m scripts.chat_rl
# CLI / WebUI chat
python -m scripts.chat_cli -p "Hello!"
python -m scripts.chat_web
Precision is managed explicitly through a global COMPUTE_DTYPE. Default is bf16 on A100/H100, fp32 fallback on older GPUs. Override with:
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train
Why It Matters
Price is insane: $48 to experience what used to cost $43,000 — rent a GPU box by the hour
2. Minimal code: Pure PyTorch, no framework abstractions, highly forkable and readable
3. End-to-end: Tokenizer → pretrain → SFT → RL → eval → WebUI, all in one repo
4. Community-driven: The leaderboard gamifies optimization, PRs keep breaking records
Skip the theory. Run nanochat once — it teaches you more about LLM training than a hundred papers.