BytesAgain is a curated directory of 60,000+ AI agent skills from ClawHub, GitHub, LobeHub, and Dify. Search skills by keyword in 7 languages, browse by role (developer, creator, trader, marketer) or by use case.

How do I find AI skills on BytesAgain?

Use the search bar on BytesAgain.com to search by keyword in 7 languages. You can also browse by role (developer, creator, trader, marketer) or by use case. Each skill shows install instructions for Claude, Cursor, OpenClaw, Continue, and more.

Yes, BytesAgain is completely free. No registration required for searching skills. The MCP API is also free with rate limits.

Does BytesAgain have an API for AI agents?

Yes! BytesAgain provides a free MCP SSE endpoint at /api/mcp/sse for AI agents, plus a REST API at /api/mcp?action=search&q= . No authentication needed.

Can I request a new AI skill on BytesAgain?

Yes! Visit the Requests page on BytesAgain.com to submit a skill request. Your request will be visible to the community and notified to the site admin.

🦀 ClawHub

Faster Whisper

by @theplasmak

Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. SRT...

Versionv1.5.1

Installs40

#clawhub

⚡ When to Use

Trigger	Action
- Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
- Generate subtitles — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
- Identify speakers — diarization labels who said what (`--diarize`)
- Transcribe from URLs — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
- Transcribe podcast feeds — `--rss ` fetches and transcribes episodes
- Batch process files — glob patterns, directories, skip-existing support; ETA shown automatically
- Convert speech to text locally — no API costs, works offline (after model download)
- Translate to English — translate any language to English with `--translate`
- Do multilingual transcription — supports 99+ languages with auto-detection
- Transcribe a batch of files in different languages — `--language-map` assigns a different language per file
- Transcribe multilingual audio — `--multilingual` for mixed-language audio
- Transcribe audio with specific terms — use `--initial-prompt` for jargon-heavy content or any other terms to look out for
- Preprocess noisy audio (before transcription) — `--normalize` and `--denoise` before transcription
- Stream output — `--stream` shows segments as they're transcribed
- Clip time ranges — `--clip-timestamps` to transcribe specific sections
- Search the transcript — `--search "term"` finds all timestamps where a word/phrase appears
- Detect chapters — `--detect-chapters` finds section breaks from silence gaps
- Export speaker audio — `--export-speakers DIR` saves each speaker's turns as separate WAV files
- Spreadsheet output — `--format csv` produces a properly-quoted CSV with timestamps
Trigger phrases:
"transcribe this audio", "convert speech to text", "what did they say", "make a transcript",
"audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English",
"find where X is mentioned", "search transcript for", "when did they say", "at what timestamp",
"add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording",
"TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format",
"ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles",
"LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics",
"HTML transcript", "confidence-colored transcript", "color-coded transcript",
"separate audio per speaker", "export speaker audio", "split by speaker",
"transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed",
"different languages in batch", "per-file language",
"transcribe in multiple formats", "srt and txt at the same time", "output both srt and text",
"remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean",
"transcribe left channel", "transcribe right channel", "stereo channel", "left track only",
"wrap subtitle lines", "character limit per line", "max chars per subtitle",
"detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"
⚠️ Agent guidance — keep invocations minimal:
_CORE RULE: default command (`./scripts/transcribe audio.mp3`) is the fastest path — add flags only when the user explicitly asks for that capability._
Transcription:
- Only add `--diarize` if the user asks "who said what" / "identify speakers" / "label speakers"
- Only add `--format srt/vtt/ass/lrc/ttml` if the user asks for subtitles/captions in that format
- Only add `--format csv` if the user asks for CSV or spreadsheet output
- Only add `--word-timestamps` if the user needs word-level timing
- Only add `--initial-prompt` if there's domain-specific jargon to prime
- Only add `--translate` if the user wants non-English audio translated to English
- Only add `--normalize`/`--denoise` if the user mentions bad audio quality or noise
- Only add `--stream` if the user wants live/progressive output for long files
- Only add `--clip-timestamps` if the user wants a specific time range
- Only add `--temperature 0.0` if the model is hallucinating on music/silence
- Only add `--vad-threshold` if VAD is aggressively cutting speech or including noise
- Only add `--min-speakers`/`--max-speakers` when you know the speaker count
- Only add `--hf-token` if the token is not cached at `~/.cache/huggingface/token`
- Only add `--max-words-per-line` for subtitle readability on long segments
- Only add `--filter-hallucinations` if the transcript contains obvious artifacts (music markers, duplicates)
- Only add `--merge-sentences` if the user asks for sentence-level subtitle cues
- Only add `--clean-filler` if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds)
- Only add `--channel left\|right` if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel
- Only add `--max-chars-per-line N` when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over `--max-words-per-line`
- Only add `--detect-paragraphs` if the user asks for paragraph breaks or structured text output; `--paragraph-gap` (default 3.0s) only if they want a custom gap
- Only add `--speaker-names "Alice,Bob"` when the user provides real names to replace SPEAKER_1/2 — always requires `--diarize`
- Only add `--hotwords WORDS` when the user names specific rare terms not well served by `--initial-prompt`; prefer `--initial-prompt` for general domain jargon
- Only add `--prefix TEXT` when the user knows the exact words the audio starts with
- Only add `--detect-language-only` when the user only wants to identify the language, not transcribe
- Only add `--stats-file PATH` if the user asks for performance stats, RTF, or benchmark info
- Only add `--parallel N` for large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches
- Only add `--retries N` for unreliable inputs (URLs, network files) where transient failures are expected
- Only add `--burn-in OUTPUT` only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input
- Only add `--keep-temp` when the user may re-process the same URL to avoid re-downloading
- Only add `--output-template` when user specifies a custom naming pattern in batch mode
- Multi-format output (`--format srt,text`): only when user explicitly wants multiple formats in one pass; always pair with `-o `
- Any word-level feature auto-runs wav2vec2 alignment (~5-10s overhead)
- `--diarize` adds ~20-30s on top of that
Search:
- Only add `--search "term"` when the user asks to find/locate/search for a specific word or phrase in audio
- `--search` replaces the normal transcript output — it prints only matching segments with timestamps
- Add `--search-fuzzy` only when the user mentions approximate/partial matching or typos
- To save search results to a file, use `-o results.txt`
Chapter detection:
- Only add `--detect-chapters` when the user asks for chapters, sections, a table of contents, or "where does the topic change"
- Default `--chapter-gap 8` (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content
- `--chapter-format youtube` (default) outputs YouTube-ready timestamps; use `json` for programmatic use
- Always use `--chapters-file PATH` when combining chapters with a transcript output — avoids mixing chapter markers into the transcript text
- If the user only wants chapters (not the transcript), pipe stdout to a file with `-o /dev/null` and use `--chapters-file`
- Batch mode limitation: `--chapters-file` takes a single path — in batch mode, each file's chapters overwrite the previous. For batch chapter detection, omit `--chapters-file` (chapters print to stdout under `=== CHAPTERS (N) ===`) or use a separate run per file
Speaker audio export:
- Only add `--export-speakers DIR` when the user explicitly asks to save each speaker's audio separately
- Always pair with `--diarize` — it silently skips if no speaker labels are present
- Requires ffmpeg; outputs `SPEAKER_1.wav`, `SPEAKER_2.wav`, etc. (or real names if `--speaker-names` is set)
Language map:
- Only add `--language-map` in batch mode when the user has confirmed different languages across files
- Inline format: `"interview.mp3=en,lecture.mp3=fr"` — fnmatch globs on filename
- JSON file format: `@/path/to/map.json` where the file is `{"pattern": "lang_code"}`
RSS / Podcast:
- Only add `--rss URL` when the user provides a podcast RSS feed URL
- Default fetches 5 newest episodes; `--rss-latest 0` for all; `--skip-existing` to resume safely
- Always use `-o ` with `--rss` — without it, all episode transcripts print to stdout concatenated, which is hard to use; each episode gets its own file when `-o ` is set
Output format for agent relay:
- Search results (`--search`) → print directly to user; output is human-readable
- Chapter output → if no `--chapters-file`, chapters appear in stdout under `=== CHAPTERS (N) ===` header after the transcript; with `--format json`, chapters are also embedded in the JSON under `"chapters"` key
- Subtitle formats (SRT, VTT, ASS, LRC, TTML) → always write to `-o` file; tell the user the output path, never paste raw subtitle content
- Data formats (CSV, HTML, TTML, JSON) → always write to `-o` file; tell the user the output path, don't paste raw XML/CSV/HTML
- ASS format → for Aegisub, VLC, mpv; write to file and tell user they can open it in Aegisub or play it in VLC/mpv
- LRC format → timed lyrics for music players (Foobar2000, AIMP, VLC); write to file
- Multi-format (`--format srt,text`) → requires `-o `; each format goes to a separate file; tell user all paths written
- JSON format → useful for programmatic post-processing; not ideal to paste in full to user
- Text/transcript → safe to show directly to user for short files; summarise for long ones
- Stats output (`--stats-file`) → summarise key fields (duration, processing time, RTF) for the user rather than pasting raw JSON
- Language detection (`--detect-language-only`) → print the result directly; it's a single line
- ETA is printed automatically to stderr for batch jobs; no action needed
When NOT to use:
- Cloud-only environments without local compute
- Files <10 seconds where API call latency doesn't matter
faster-whisper vs whisperx:
This skill covers everything whisperx does — diarization (`--diarize`), word-level timestamps (`--word-timestamps`), SRT/VTT subtitles — so whisperx is not needed. Use whisperx only if you specifically need its pyannote pipeline or batch-GPU features not covered here.

💡 Examples

# Basic transcription
./scripts/transcribe audio.mp3
SRT subtitles
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt
WebVTT subtitles
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt
Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en
Speaker diarization
./scripts/transcribe meeting.wav --diarize
Diarized VTT subtitles
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt
Prime with domain terminology
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"
Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/
Batch with glob, skip already-done files
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/
Filter low-confidence segments
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6
JSON output with full metadata
./scripts/transcribe audio.mp3 --format json -o result.json
Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

⚙️ Configuration

Input:
  AUDIO                 Audio file(s), directory, glob pattern, or URL
                        Accepts: mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
                        URLs auto-download via yt-dlp (YouTube, direct links, etc.)
Model & Language:
  -m, --model NAME      Whisper model (default: distil-large-v3.5; "turbo" = large-v3-turbo)
  --revision REV        Model revision (git branch/tag/commit) to pin a specific version
  -l, --language CODE   Language code, e.g. en, es, fr (auto-detects if omitted)
  --initial-prompt TEXT  Prompt to condition the model (terminology, formatting style)
  --prefix TEXT         Prefix to condition the first segment (e.g. known starting words)
  --hotwords WORDS      Space-separated hotwords to boost recognition
  --translate           Translate any language to English (instead of transcribing)
  --multilingual        Enable multilingual/code-switching mode (helps smaller models)
  --hf-token TOKEN      HuggingFace token for private/gated models and diarization
  --model-dir PATH      Custom model cache directory (default: ~/.cache/huggingface/)
Output Format:
  -f, --format FMT      text | json | srt | vtt | tsv | lrc | html | ass | ttml (default: text)
                        Accepts comma-separated list: --format srt,text writes both in one pass
                        Multi-format requires -o  when saving to files
  --word-timestamps     Include word-level timestamps (wav2vec2 aligned automatically)
  --stream              Output segments as they are transcribed (disables diarize/alignment)
  --max-words-per-line N  For SRT/VTT, split segments into sub-cues of at most N words
  --max-chars-per-line N  For SRT/VTT/ASS/TTML, split lines so each fits within N characters
                        Takes priority over --max-words-per-line when both are set
  --clean-filler        Remove hesitation fillers (um, uh, er, ah, hmm, hm) and discourse markers
                        (you know, I mean, you see) from transcript text. Off by default.
  --detect-paragraphs   Insert paragraph breaks (blank lines) in text output at natural boundaries.
                        A new paragraph starts when: silence gap ≥ --paragraph-gap, OR the previous
                        segment ends a sentence AND the gap ≥ 1.5s.
  --paragraph-gap SEC   Minimum silence gap in seconds to start a new paragraph (default: 3.0).
                        Used with --detect-paragraphs.
  --channel {left,right,mix}
                        Stereo channel to transcribe: left (c0), right (c1), or mix (default: mix).
                        Extracts the channel via ffmpeg before transcription. Requires ffmpeg.
  --merge-sentences     Merge consecutive segments into sentence-level chunks
                        (improves SRT/VTT readability; groups by terminal punctuation or >2s gap)
  -o, --output PATH     Output file or directory (directory for batch mode)
  --output-template TEMPLATE
                        Batch output filename template. Variables: {stem}, {lang}, {ext}, {model}
                        Example: "{stem}_{lang}.{ext}" → "interview_en.srt"
Inference Tuning:
  --beam-size N         Beam search size; higher = more accurate but slower (default: 5)
  --temperature T       Sampling temperature or comma-separated fallback list, e.g.
                        '0.0' or '0.0,0.2,0.4' (default: faster-whisper's schedule)
  --no-speech-threshold PROB
                        Probability threshold to mark segments as silence (default: 0.6)
  --batch-size N        Batched inference batch size (default: 8; reduce if OOM)
  --no-vad              Disable voice activity detection (on by default)
  --vad-threshold T     VAD speech probability threshold (default: 0.5)
  --vad-neg-threshold T VAD negative threshold for ending speech (default: auto)
  --vad-onset T         Alias for --vad-threshold (legacy)
  --vad-offset T        Alias for --vad-neg-threshold (legacy)
  --min-speech-duration MS  Minimum speech segment duration in ms (default: 0)
  --max-speech-duration SEC Maximum speech segment duration in seconds (default: unlimited)
  --min-silence-duration MS Minimum silence before splitting a segment in ms (default: 2000)
  --speech-pad MS       Padding around speech segments in ms (default: 400)
  --no-batch            Disable batched inference (use standard WhisperModel)
  --hallucination-silence-threshold SEC
                        Skip silent sections where model hallucinates (e.g. 1.0)
  --no-condition-on-previous-text
                        Don't condition on previous text (reduces repetition/hallucination loops;
                        auto-enabled for distil models per HuggingFace recommendation)
  --condition-on-previous-text
                        Force-enable conditioning on previous text (overrides auto-disable for distil models)
  --compression-ratio-threshold RATIO
                        Filter segments above this compression ratio (default: 2.4)
  --log-prob-threshold PROB
                        Filter segments below this avg log probability (default: -1.0)
  --max-new-tokens N    Maximum tokens per segment (prevents runaway generation)
  --clip-timestamps RANGE
                        Transcribe specific time ranges: '30,60' or '0,30;60,90' (seconds)
  --progress            Show transcription progress bar
  --best-of N           Candidates when sampling with non-zero temperature (default: 5)
  --patience F          Beam search patience factor (default: 1.0)
  --repetition-penalty F  Penalty for repeated tokens (default: 1.0)
  --no-repeat-ngram-size N  Prevent n-gram repetitions of this size (default: 0 = off)
Advanced Inference:
  --no-timestamps       Output text without timing info (faster; incompatible with
                        --word-timestamps, --format srt/vtt/tsv, --diarize)
  --chunk-length N      Audio chunk length in seconds for batched inference (default: auto)
  --language-detection-threshold T
                        Confidence threshold for language auto-detection (default: 0.5)
  --language-detection-segments N
                        Audio segments to sample for language detection (default: 1)
  --length-penalty F    Beam search length penalty; >1 favors longer, <1 favors shorter (default: 1.0)
  --prompt-reset-on-temperature T
                        Reset initial prompt when temperature fallback hits threshold (default: 0.5)
  --no-suppress-blank   Disable blank token suppression (may help soft/quiet speech)
  --suppress-tokens IDS Comma-separated token IDs to suppress in addition to default -1
  --max-initial-timestamp T
                        Maximum timestamp for the first segment in seconds (default: 1.0)
  --prepend-punctuations CHARS
                        Punctuation characters merged into preceding word (default: "'¿([{-)
  --append-punctuations CHARS
                        Punctuation characters merged into following word (default: "'.。,，!！?？:：")]}、")
Preprocessing:
  --normalize           Normalize audio volume (EBU R128 loudnorm) before transcription
  --denoise             Apply noise reduction (high-pass + FFT denoise) before transcription
Advanced:
  --diarize             Speaker diarization (requires pyannote.audio)
  --min-speakers N      Minimum number of speakers hint for diarization
  --max-speakers N      Maximum number of speakers hint for diarization
  --speaker-names NAMES Comma-separated names to replace SPEAKER_1, SPEAKER_2 (e.g. 'Alice,Bob')
                        Requires --diarize
  --min-confidence PROB Filter segments below this avg word confidence (0.0–1.0)
  --skip-existing       Skip files whose output already exists (batch mode)
  --detect-language-only
                        Detect language and exit (no transcription). Output: "Language: en (probability: 0.984)"
                        With --format json: {"language": "en", "language_probability": 0.984}
  --stats-file PATH     Write JSON stats sidecar after transcription (processing time, RTF, word count, etc.)
                        Directory path → writes {stem}.stats.json inside; file path → exact path
  --burn-in OUTPUT      Burn subtitles into the original video (single-file mode only; requires ffmpeg)
  --filter-hallucinations
                        Filter common Whisper hallucinations: music/applause markers, duplicate segments,
                        'Thank you for watching', lone punctuation, etc.
  --keep-temp           Keep temp files from URL downloads (useful for re-processing without re-downloading)
  --parallel N          Number of parallel workers for batch processing (default: sequential)
  --retries N           Retry failed files up to N times with exponential backoff (default: 0;
                        incompatible with --parallel)
Batch ETA:
  Automatically shown for sequential batch jobs (no flag needed). After each file completes,
  the next file's progress line includes:  [current/total] filename | ETA: Xm Ys
  ETA is calculated from average time per file × remaining files.
  Shown to stderr (surfaced to users via OpenClaw/Clawdbot output).
Language Map (per-file language override):
  --language-map MAP    Per-file language override for batch mode. Two forms:
                          Inline: "interview*.mp3=en,lecture.wav=fr,keynote.wav=de"
                          JSON file: "@/path/to/map.json"  (must be {pattern: lang} dict)
                        Patterns support fnmatch globs on filename or stem.
                        Priority: exact filename > exact stem > glob on filename > glob on stem > fallback.
                        Files not matched fall back to --language (or auto-detect if not set).
Transcript Search:
  --search TERM         Search the transcript for TERM and print matching segments with timestamps.
                        Replaces normal transcript output (use -o to save results to a file).
                        Case-insensitive exact substring match by default.
  --search-fuzzy        Enable fuzzy/approximate matching with --search (useful for typos, phonetic
                        near-misses, or partial words; uses SequenceMatcher ratio ≥ 0.6)
Chapter Detection:
  --detect-chapters     Auto-detect chapter/section breaks from silence gaps and print chapter markers.
                        Output is printed after the transcript (or to --chapters-file).
  --chapter-gap SEC     Minimum silence gap in seconds between consecutive segments to start a new
                        chapter (default: 8.0). Tune down for dense speech, up for sparse content.
  --chapters-file PATH  Write chapter markers to this file (default: stdout after transcript)
  --chapter-format FMT  youtube | text | json — chapter output format:
                          youtube: "0:00 Chapter 1" (YouTube description ready)
                          text:    "Chapter 1: 00:00:00"
                          json:    JSON array with chapter, start, title fields
                        (default: youtube)
Speaker Audio Export:
  --export-speakers DIR After diarization, export each speaker's audio turns concatenated into
                        separate WAV files saved in DIR. Requires --diarize and ffmpeg.
                        Output: SPEAKER_1.wav, SPEAKER_2.wav, … (or real names if --speaker-names set)
RSS / Podcast:
  --rss URL             Podcast RSS feed URL — extracts audio enclosures and transcribes them.
                        AUDIO positional is optional when --rss is used.
  --rss-latest N        Number of most-recent episodes to process (default: 5; 0 = all episodes)
Device:
  --device DEV          auto | cpu | cuda (default: auto)
  --compute-type TYPE   auto | int8 | int8_float16 | float16 | float32 (default: auto)
                        int8_float16 = hybrid mode for GPU (saves VRAM, minimal quality loss)
  --threads N           CPU thread count for CTranslate2 (default: auto)
  -q, --quiet           Suppress progress and status messages
  --log-level LEVEL     Set faster_whisper library logging level: debug | info | warning | error
                        (default: warning; use debug to see CTranslate2/VAD internals)Utility:
  --version             Print installed faster-whisper version and exit
  --update              Upgrade faster-whisper in the skill venv and exit

📋 Tips & Best Practices

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above) Setup fails: Make sure Python 3.10+ is installed Out of memory: Use smaller model, --compute-type int8, or --batch-size 4 Slow on CPU: Expected — use GPU for practical transcription Model download fails: Check ~/.cache/huggingface/ permissions Diarization model fails: Ensure HuggingFace token exists and model agreements accepted; or pass token directly with --hf-token hf_xxx URL download fails: Check yt-dlp is installed (pipx install yt-dlp) No audio files in batch: Check file extensions match supported formats Check installed version: Run ./scripts/transcribe --version Upgrade faster-whisper: Run ./setup.sh --update (upgrades in-place, no full reinstall) Hallucinations on silence/music: Try --temperature 0.0 --no-speech-threshold 0.8 VAD splits speech incorrectly: Tune with --vad-threshold 0.3 (lower) or --min-silence-duration 300 Improve speech detection: Run ./setup.sh --update to upgrade faster-whisper to the latest version (includes Silero VAD V6).