🎬 video-use:7.6k Stars,用 Claude Code 剪视频,不用学 Premiere
项目地址:github.com/browser-use/video-use | ⭐ 7,664 | 🛠 Python | 👤 browser-use
老实说,剪视频这事儿我一直挺抗拒的。打开 Premiere 等两分钟,调个时间轴再等渲染,改个字幕又得重新导出一遍。更别提那些 filler words——"嗯"、"那个"、"就是"——手动剪到怀疑人生。
browser-use 团队出了个新玩意儿:video-use。把原始素材往文件夹里一丢,跟 Claude Code 说一句"帮我把这些剪成一条成片",它自己就干完了。100% 开源,不绑任何付费服务(除了可选的 ElevenLabs API)。
它能干什么?
🔥 自动剪掉 filler words — umm、ah、假开头、两段之间的空白,精确到字级别。不是按时间轴砍,是读语音转录文本后精准定位。
⚡ 自动调色 — 暖色电影风、中性通透风,或者你自己写的 ffmpeg chain,每个片段单独处理。
🎯 30ms 音频淡入淡出 — 每个剪辑点自动加,你永远不会听到爆音。
📝 自动烧录字幕 — 默认 2 词一大写块,完全可自定义样式。
🎨 动画叠加层 — 通过 HyperFrames、Remotion、Manim 或 PIL 生成,作为子 Agent 并行跑,一个动画一个子进程。
🔄 自我评估 — 渲染完在每个剪辑边界自动检查,有视觉跳跃或音频爆音自动修,最多重试 3 次。
💾 会话记忆 — 存在 project.md 里,下周回来接着干。
怎么用?
最直接的方式:
# 1. 克隆项目
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use
# 2. 装依赖
cd ~/Developer/video-use
uv sync
brew install ffmpeg
# 3. 配 API key
cp .env.example .env
# 编辑 .env,填入 ELEVENLABS_API_KEY
# 4. 把原始素材放到一个文件夹
cd /path/to/your/videos
claude
然后在 Claude Code 里说一句:
edit these into a launch video
它会先扫描素材,给你一个剪辑方案,你确认后才开始干活。所有输出在 edit/final.mp4。
原理
LLM 从来不看视频画面——它读视频。
Layer 1 — 音频转录文本。 用 ElevenLabs Scribe 转一遍,拿到字级别的时间戳、说话人标注、音频事件(笑声、掌声)。所有素材打包成一个 ~12KB 的 takes_packed.md,LLM 直接读这个。
Layer 2 — 视觉快照(按需)。 timeline_view 给任意时间范围生成一张胶片条 + 波形 + 字标签的 PNG。只在需要做剪辑决策时调用——比如判断犹豫停顿、对比两条重拍、确认剪辑点是否准确。
粗暴做法:30,000 帧 × 1,500 tokens = 45M tokens 的噪音。
video-use 的做法:12KB 文本 + 几张 PNG。
思路跟 browser-use 给 LLM 一个结构化 DOM 而不是截图完全一致——只不过这次对象从网页换成了视频。
踩坑提醒
- 默认用 ElevenLabs 做转录,不开源。你也可以换成 Whisper 本地跑,但时间准确度差一截。
- 第一次跑的时候它会问你 ElevenLabs API key,提前备好。
- 如果素材很多,建议先手动筛选一轮再丢给 Agent——它目前没有自动"挑最好的 take"的能力。
总结
- 7.6k Stars,browser-use 团队出品,质量有保障
- Claude Code 直接对话式剪辑,不用学任何剪辑软件
- 自动去 filler words + 调色 + 字幕 + 动画,全链路自动化
- 开放的 ffmpeg chain,不限制你的创意
- 自我评估循环保证输出质量,不翻车
🎬 video-use: 7.6k Stars — Edit Videos with Claude Code, No Premiere Required
github.com/browser-use/video-use | ⭐ 7,664 | 🛠 Python | 👤 browser-use
Honestly, I've always dreaded video editing. Open Premiere, wait two minutes, tweak the timeline, wait for render, fix a subtitle typo, export the whole thing again. And those filler words — "umm," "uh," false starts — manually cutting them out is soul-crushing.
The browser-use team just dropped video-use. Drop raw footage in a folder, tell Claude Code "edit these into a launch video," and it handles the rest. 100% open source, no paid service required (except the optional ElevenLabs API).
What It Does
🔥 Auto-cuts filler words — umm, ah, false starts, silence between takes. Word-level precision via transcript analysis, not timeline scrubbing.
⚡ Auto color grading — warm cinematic, neutral punch, or your own custom ffmpeg chain. Applied per segment.
🎯 30ms audio fades at every cut — no pops, guaranteed.
📝 Burns subtitles — 2-word UPPERCASE chunks by default, fully customizable.
🎨 Animation overlays via HyperFrames, Remotion, Manim, or PIL — spawned as parallel sub-agents.
🔄 Self-evaluates rendered output at every cut boundary. Catches visual jumps and audio pops, auto-fixes up to 3 retries.
💾 Session memory in project.md — pick up next week where you left off.
Quick Start
# 1. Clone and symlink
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use
# 2. Install deps
cd ~/Developer/video-use
uv sync
brew install ffmpeg
# 3. Add API key
cp .env.example .env
# Edit .env with ELEVENLABS_API_KEY
# 4. Point your agent at raw footage
cd /path/to/your/videos
claude
Then just say:
edit these into a launch video
It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4.
How It Works
The LLM never watches the video — it reads it.
Layer 1 — Audio transcript. One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events. All takes pack into a ~12KB takes_packed.md.
Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points.
Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise.
Video Use: 12KB text + a handful of PNGs.
Same philosophy as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.
Gotchas
- Default transcriber is ElevenLabs (not open-source). You can swap to Whisper locally, but timestamp accuracy drops.
- Have your ElevenLabs API key ready before the first run.
- Currently no "pick the best take" logic — pre-filter your footage before dumping it in.
Summary
- 7.6k Stars from the browser-use team — quality pedigree
- Chat-based video editing with Claude Code, zero learning curve
- Auto filler-word removal + color grading + subtitles + animations
- Open ffmpeg chain, no creative limitations
- Self-evaluation loop ensures output quality