欣淇
发布于 2026-05-16 / 0 阅读
0
0

🎬 video-use:7.6k Stars,用 Claude Code 剪视频,不用学 Premiere / video-use: Edit Videos with Claude Code, No Premiere Required

🎬 video-use:7.6k Stars,用 Claude Code 剪视频,不用学 Premiere

项目地址:github.com/browser-use/video-use | ⭐ 7,664 | 🛠 Python | 👤 browser-use


老实说,剪视频这事儿我一直挺抗拒的。打开 Premiere 等两分钟,调个时间轴再等渲染,改个字幕又得重新导出一遍。更别提那些 filler words——"嗯"、"那个"、"就是"——手动剪到怀疑人生。

browser-use 团队出了个新玩意儿:video-use。把原始素材往文件夹里一丢,跟 Claude Code 说一句"帮我把这些剪成一条成片",它自己就干完了。100% 开源,不绑任何付费服务(除了可选的 ElevenLabs API)。

它能干什么?

🔥 自动剪掉 filler wordsummah、假开头、两段之间的空白,精确到字级别。不是按时间轴砍,是读语音转录文本后精准定位。

自动调色 — 暖色电影风、中性通透风,或者你自己写的 ffmpeg chain,每个片段单独处理。

🎯 30ms 音频淡入淡出 — 每个剪辑点自动加,你永远不会听到爆音。

📝 自动烧录字幕 — 默认 2 词一大写块,完全可自定义样式。

🎨 动画叠加层 — 通过 HyperFrames、Remotion、Manim 或 PIL 生成,作为子 Agent 并行跑,一个动画一个子进程。

🔄 自我评估 — 渲染完在每个剪辑边界自动检查,有视觉跳跃或音频爆音自动修,最多重试 3 次。

💾 会话记忆 — 存在 project.md 里,下周回来接着干。

怎么用?

最直接的方式:

# 1. 克隆项目
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use

# 2. 装依赖
cd ~/Developer/video-use
uv sync
brew install ffmpeg

# 3. 配 API key
cp .env.example .env
# 编辑 .env,填入 ELEVENLABS_API_KEY

# 4. 把原始素材放到一个文件夹
cd /path/to/your/videos
claude

然后在 Claude Code 里说一句:

edit these into a launch video

它会先扫描素材,给你一个剪辑方案,你确认后才开始干活。所有输出在 edit/final.mp4

原理

LLM 从来不看视频画面——它视频。

Layer 1 — 音频转录文本。 用 ElevenLabs Scribe 转一遍,拿到字级别的时间戳、说话人标注、音频事件(笑声、掌声)。所有素材打包成一个 ~12KB 的 takes_packed.md,LLM 直接读这个。

Layer 2 — 视觉快照(按需)。 timeline_view 给任意时间范围生成一张胶片条 + 波形 + 字标签的 PNG。只在需要做剪辑决策时调用——比如判断犹豫停顿、对比两条重拍、确认剪辑点是否准确。

粗暴做法:30,000 帧 × 1,500 tokens = 45M tokens 的噪音。
video-use 的做法:12KB 文本 + 几张 PNG

思路跟 browser-use 给 LLM 一个结构化 DOM 而不是截图完全一致——只不过这次对象从网页换成了视频。

踩坑提醒

  • 默认用 ElevenLabs 做转录,不开源。你也可以换成 Whisper 本地跑,但时间准确度差一截。
  • 第一次跑的时候它会问你 ElevenLabs API key,提前备好。
  • 如果素材很多,建议先手动筛选一轮再丢给 Agent——它目前没有自动"挑最好的 take"的能力。

总结

  • 7.6k Stars,browser-use 团队出品,质量有保障
  • Claude Code 直接对话式剪辑,不用学任何剪辑软件
  • 自动去 filler words + 调色 + 字幕 + 动画,全链路自动化
  • 开放的 ffmpeg chain,不限制你的创意
  • 自我评估循环保证输出质量,不翻车

🎬 video-use: 7.6k Stars — Edit Videos with Claude Code, No Premiere Required

github.com/browser-use/video-use | ⭐ 7,664 | 🛠 Python | 👤 browser-use


Honestly, I've always dreaded video editing. Open Premiere, wait two minutes, tweak the timeline, wait for render, fix a subtitle typo, export the whole thing again. And those filler words — "umm," "uh," false starts — manually cutting them out is soul-crushing.

The browser-use team just dropped video-use. Drop raw footage in a folder, tell Claude Code "edit these into a launch video," and it handles the rest. 100% open source, no paid service required (except the optional ElevenLabs API).

What It Does

🔥 Auto-cuts filler wordsumm, ah, false starts, silence between takes. Word-level precision via transcript analysis, not timeline scrubbing.

Auto color grading — warm cinematic, neutral punch, or your own custom ffmpeg chain. Applied per segment.

🎯 30ms audio fades at every cut — no pops, guaranteed.

📝 Burns subtitles — 2-word UPPERCASE chunks by default, fully customizable.

🎨 Animation overlays via HyperFrames, Remotion, Manim, or PIL — spawned as parallel sub-agents.

🔄 Self-evaluates rendered output at every cut boundary. Catches visual jumps and audio pops, auto-fixes up to 3 retries.

💾 Session memory in project.md — pick up next week where you left off.

Quick Start

# 1. Clone and symlink
git clone https://github.com/browser-use/video-use ~/Developer/video-use
ln -sfn ~/Developer/video-use ~/.claude/skills/video-use

# 2. Install deps
cd ~/Developer/video-use
uv sync
brew install ffmpeg

# 3. Add API key
cp .env.example .env
# Edit .env with ELEVENLABS_API_KEY

# 4. Point your agent at raw footage
cd /path/to/your/videos
claude

Then just say:

edit these into a launch video

It inventories the sources, proposes a strategy, waits for your OK, then produces edit/final.mp4.

How It Works

The LLM never watches the video — it reads it.

Layer 1 — Audio transcript. One ElevenLabs Scribe call per source gives word-level timestamps, speaker diarization, and audio events. All takes pack into a ~12KB takes_packed.md.

Layer 2 — Visual composite (on demand). timeline_view produces a filmstrip + waveform + word labels PNG for any time range. Called only at decision points.

Naive approach: 30,000 frames × 1,500 tokens = 45M tokens of noise.
Video Use: 12KB text + a handful of PNGs.

Same philosophy as browser-use giving an LLM a structured DOM instead of a screenshot — but for video.

Gotchas

  • Default transcriber is ElevenLabs (not open-source). You can swap to Whisper locally, but timestamp accuracy drops.
  • Have your ElevenLabs API key ready before the first run.
  • Currently no "pick the best take" logic — pre-filter your footage before dumping it in.

Summary

  • 7.6k Stars from the browser-use team — quality pedigree
  • Chat-based video editing with Claude Code, zero learning curve
  • Auto filler-word removal + color grading + subtitles + animations
  • Open ffmpeg chain, no creative limitations
  • Self-evaluation loop ensures output quality

评论