欣淇
发布于 2026-05-11 / 0 阅读
0
0

🤖 nanochat:Karpathy 出品,$48 复刻 GPT-2,2 小时训练出 ChatGPT 级对话模型

🤖 nanochat:Karpathy 出品,$48 复刻 GPT-2,2 小时训练出 ChatGPT 级对话模型

项目地址:karpathy/nanochat | ⭐ 53.3k | 🛠 Python | 👤 Andrej Karpathy


老实说,大部分搞 LLM 的人根本没亲手训过一个模型。不是不想,是真贵——2019 年 OpenAI 训 GPT-2 花了 $43,000。Karpathy 的 nanochat 把这个数字打到了 $48,一台 8×H100 的机器跑 2 小时,你就能拥有一个 GPT-2 能力的对话模型,还能在浏览器里跟它聊天。

一个 `--depth` 搞定所有

nanochat 用了一个骚操作——只有一个复杂度拨盘--depth(Transformer 层数)。宽度、头数、学习率、训练步数、权重衰减……所有超参数自动推导,保证计算最优。想训大模型就加大 depth,想快速实验就减小,其他一切自动适配。

# 装环境
uv sync --extra gpu
source .venv/bin/activate

# 训 GPT-2 级模型(8×H100,~2 小时)
bash runs/speedrun.sh

# 启动聊天 WebUI
python -m scripts.chat_web

训完之后访问 http://<你的IP>:8000/,就能跟自己的模型唠嗑了。效果大概相当于跟一个幼稚园小朋友聊天——会写诗、会编故事、会一本正经地胡说八道。

速度排行榜:从 168 小时到 1.65 小时

最骚的是 nanochat 维护了一个「GPT-2 速度排行榜」。OpenAI 原版 GPT-2 训练用了 168 小时,DCLM CORE 分数 0.2565。现在?最新记录是 1.65 小时,CORE 分数 0.2626——不止快了 100 倍,质量还更高了。

$43,000 到 $48,从一周到 99 分钟。7 年的技术进步,全在这一行对比里。

# 用自己的数据跑一轮快速实验(~5 分钟)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=12

完整管线:不只是预训练

nanochat 不是玩具。它覆盖了训练 LLM 的全流程——从 tokenizer 到 RL 到 WebUI:

# 训练 tokenizer
python -m scripts.tok_train

# 预训练基座模型(d26 ≈ GPT-2 能力)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=26

# SFT 微调 + RL 强化学习
python -m scripts.chat_sft
python -m scripts.chat_rl

# CLI 直接对话 / WebUI
python -m scripts.chat_cli -p "Hello!"
python -m scripts.chat_web

精度控制也做得干净。nanochat 不用 torch.amp.autocast,而是通过全局 COMPUTE_DTYPE 显式管理。A100/H100 默认 bf16,V100 自动降级 fp32,也可以用环境变量强制:

NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

为什么值得关注

  • 价格绝了:$48 体验从前 $43,000 的事,GPU 云服务按小时租就行,用完关机不心疼
  • 2. 代码极简:没有框架级别的抽象,没有巨型配置文件,纯 PyTorch,可读性强到你想 fork

    3. 全流程覆盖:tokenizer 训练 → 预训练 → SFT → RL → 评估 → WebUI,一个仓库搞定

    4. 社区驱动:速度排行榜让优化变成游戏,社区 PR 不断刷新记录,Karpathy 亲自 review

    别整那些花里胡哨的,想真正理解 LLM 是怎么训出来的,直接跑一遍 nanochat 比看一百篇论文都管用。

    🤖 nanochat: Karpathy's $48 GPT-2 Replication — Train a ChatGPT-Class Model in 2 Hours

    Project: karpathy/nanochat | ⭐ 53.3k | 🛠 Python | 👤 Andrej Karpathy


    Let's be honest — most people working with LLMs have never actually trained one from scratch. Not because they don't want to, but because it's expensive. OpenAI spent $43,000 training GPT-2 back in 2019. Karpathy's nanochat brings that down to $48 — 2 hours on an 8×H100 box, and you get a GPT-2-class chat model you can talk to in your browser.

    One Dial to Rule Them All

    nanochat has a single complexity knob: --depth (number of transformer layers). Width, heads, learning rate, training steps, weight decay — every hyperparameter is auto-derived for compute-optimal results. Want a bigger model? Turn up depth. Quick experiment? Turn it down. Everything else just works.

    # Setup
    uv sync --extra gpu
    source .venv/bin/activate
    
    # Train GPT-2 class model (8×H100, ~2 hours)
    bash runs/speedrun.sh
    
    # Launch chat WebUI
    python -m scripts.chat_web
    

    Visit http://:8000/ and chat with your very own model. It writes poems, tells stories, and confidently hallucinates — like a kindergartner with opinions.

    Speed Leaderboard: 168h → 1.65h

    The original GPT-2 took 168 hours with a DCLM CORE score of 0.2565. The latest nanochat record: 1.65 hours at 0.2626 — over 100x faster and better quality. From $43,000 to $48.

    # Quick experiment (~5 min on 8×H100)
    torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=12
    

    Full Pipeline: More Than Pretraining

    nanochat covers the entire LLM workflow — tokenizer, pretraining, SFT, RL, evaluation, and WebUI:

    # Train tokenizer
    python -m scripts.tok_train
    
    # Pretrain base model
    torchrun --standalone --nproc_per_node=8 -m scripts.base_train --depth=26
    
    # SFT + RL
    python -m scripts.chat_sft
    python -m scripts.chat_rl
    
    # CLI / WebUI chat
    python -m scripts.chat_cli -p "Hello!"
    python -m scripts.chat_web
    

    Precision is managed explicitly through a global COMPUTE_DTYPE. Default is bf16 on A100/H100, fp32 fallback on older GPUs. Override with:

    NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
    NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train
    

    Why It Matters

  • Price is insane: $48 to experience what used to cost $43,000 — rent a GPU box by the hour
  • 2. Minimal code: Pure PyTorch, no framework abstractions, highly forkable and readable

    3. End-to-end: Tokenizer → pretrain → SFT → RL → eval → WebUI, all in one repo

    4. Community-driven: The leaderboard gamifies optimization, PRs keep breaking records

    Skip the theory. Run nanochat once — it teaches you more about LLM training than a hundred papers.


    评论