Audio Transcribe
by @zxkane
This skill should be used when the user explicitly asks to "transcribe a meeting", "transcribe audio", "transcribe a meeting recording", "convert audio to te...
1. Environment Setup
AUTO_YES=1 bash $SCRIPTS/setup_env.sh
Or force CPU: AUTO_YES=1 bash $SCRIPTS/setup_env.sh cpu
The setup script patches FunASR's spectral clustering for O(N²·k) performance. Without this, recordings over ~1 hour hang for hours during speaker clustering.
2. Run Transcription
Output files are written to the current working directory.
LLM cleanup (Phase 3) is opt-in. By default, transcription runs locally
without contacting any external service. To enable LLM-powered ASR correction
and speaker name refinement, pass --model . Use LLM cleanup when:
> ā ļø Data Privacy: When LLM cleanup is enabled via --model, transcript
> excerpts are sent to external LLM providers (AWS Bedrock, Anthropic, or
> OpenAI depending on the model ID). Use --skip-llm or omit --model to
> keep all data local. For Bedrock, boto3 uses the standard AWS credential
> chain (IAM role, SSO, ~/.aws/credentials, env vars).
# Chinese meeting with hotwords (local-only, no LLM)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txtEnglish meeting with speaker names
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang en --speakers "Alice,Bob,Carol,Dave"Auto-detect language (zh/en/ja/ko/yue)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang auto --num-speakers 6Whisper for any language
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang whisper --num-speakers 4Enable LLM cleanup for polished output (requires --model)
Bedrock (uses AWS credential chain: IAM role, SSO, ~/.aws/credentials)
python3 $SCRIPTS/transcribe.py meeting.wav \
--lang zh --num-speakers 9 --hotwords hotwords.txt \
--provider bedrock --model us.anthropic.claude-sonnet-4-6Bedrock "global" cross-region profile (recent AWS deployments)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model global.anthropic.claude-sonnet-4-6Bedrock via litellm-style wrapper (supported; prefix is stripped for boto3)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider bedrock --model amazon-bedrock/global.anthropic.claude-sonnet-4-6Anthropic API (requires ANTHROPIC_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider anthropic --model claude-sonnet-4-6OpenAI-compatible API (requires OPENAI_API_KEY env var)
python3 $SCRIPTS/transcribe.py meeting.wav \
--provider openai --model gpt-4oFull pipeline with all supporting files + LLM (best quality)
python3 $SCRIPTS/transcribe.py episode.m4a \
--lang zh --num-speakers 2 \
--hotwords hotwords.txt \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.json \
--reference show-notes.md \
--model us.anthropic.claude-sonnet-4-6Resume interrupted LLM cleanup
python3 $SCRIPTS/transcribe.py meeting.wav \
--skip-transcribe --model us.anthropic.claude-sonnet-4-6
3. Verify Speaker Labels
If the transcript has swapped speaker labels (common with podcasts), the verification script can detect and fix mismatches using LLM analysis:
# Dry-run: check if host/guest are swapped
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.jsonApply the fix
python3 $SCRIPTS/verify_speakers.py podcast_raw_transcript.json \
--speakers "å
³ē¾½,å¼ é£" \
--speaker-context speaker-context.json --fixMulti-speaker meeting: full reassignment
python3 $SCRIPTS/verify_speakers.py meeting_raw_transcript.json \
--speakers "Alice,Bob,Carol,Dave" \
--speaker-context speaker-context.json --fixThen regenerate the markdown with corrected labels
python3 $SCRIPTS/transcribe.py original.m4a \
--skip-transcribe --clean-cache
The script analyzes the first 5 minutes (configurable with --minutes)
and auto-detects podcast (2 speakers, swap detection) vs meeting
(N speakers, full reassignment).
clawhub install zxkane-audio-transcriber-funasr