🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub

pdf-miner

by @baichenwzj

Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/e...

Versionv1.0.2
πŸ’‘ Examples

Run commands from this skill directory.

Basic Extraction

# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdf

Output to custom path

python scripts/extract_pdf.py input.pdf output.md

Specific pages

python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20

Text or tables only

python scripts/extract_pdf.py input.pdf --text-only python scripts/extract_pdf.py input.pdf --tables-only python scripts/extract_pdf.py input.pdf --tables-only --json

Advanced Modes

# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"

Metrics: extract lines with keywords + numeric values

python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"

TOC: extract table of contents / chapter structure (robust, multi-format)

python scripts/extract_pdf.py report.pdf --toc

Optionally adjust sensitivity (default: 3 entries per page required)

python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2

Diff: compare two PDFs, show pages unique to each

python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff

Chunk: split output into LLM-friendly chunks

python scripts/extract_pdf.py report.pdf --chunk # single file, 8000 chars each python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000 python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks # separate files

Clean headers/footers

python scripts/extract_pdf.py report.pdf --clean-headers

Batch: process multiple PDFs

python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted

OCR for Scanned/Image PDFs (Automatic by Default)

OCR is automatically triggered for pages with very little extractable text (default threshold: 100 characters). This helps handle scanned or image-based PDFs without requiring the --ocr flag.

#### Usage Examples

# Automatic OCR (default behavior)
python scripts/extract_pdf.py scanned.pdf

Force OCR on all pages (ignore text length)

python scripts/extract_pdf.py scanned.pdf --ocr

Force OCR only on specific pages

python scripts/extract_pdf.py scanned.pdf --ocr --ocr-pages 1-5,10

Adjust OCR quality (DPI)

python scripts/extract_pdf.py scanned.pdf --ocr --ocr-dpi 300

Use a different vision model

python scripts/extract_pdf.py scanned.pdf --ocr --ocr-model "stepfun/step-3.5-flash:free"

Disable automatic OCR detection (if you want pure extraction only)

python scripts/extract_pdf.py file.pdf --no-auto-ocr

Change the low-text threshold (default 100 chars)

python scripts/extract_pdf.py file.pdf --ocr-threshold 200

#### Configuration

OCR requires a vision API key. See Initial Setup for OCR.

| Option | Default | Description | |--------|---------|-------------| | --ocr | off | Force OCR on pages (with auto-detect or --ocr-pages) | | --auto-ocr | on | Automatically OCR low-text pages (hidden; use --no-auto-ocr to disable) | | --no-auto-ocr | - | Disable automatic OCR detection | | --ocr-pages | - | Comma-separated pages/ranges to OCR (requires --ocr) | | --ocr-threshold | 100 | Minimum text length to consider a page as "sufficient" (characters) | | --ocr-dpi | 200 | Image DPI for OCR rendering | | --ocr-api-key | from env/config | Override API key | | --ocr-base-url | from env/config | Override API base URL | | --ocr-model | from env/config | Override vision model |

#### Troubleshooting

OCR failed with "No API key" β†’ Configure your API key in config.json or via OCR_API_KEY env var.

OCR model rejects images β†’ The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.

Too many pages being OCR'd β†’ Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.

Rate limit errors β†’ Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.

βš™οΈ Configuration

OCR requires a vision API key. See Initial Setup for OCR.

| Option | Default | Description | |--------|---------|-------------| | --ocr | off | Force OCR on pages (with auto-detect or --ocr-pages) | | --auto-ocr | on | Automatically OCR low-text pages (hidden; use --no-auto-ocr to disable) | | --no-auto-ocr | - | Disable automatic OCR detection | | --ocr-pages | - | Comma-separated pages/ranges to OCR (requires --ocr) | | --ocr-threshold | 100 | Minimum text length to consider a page as "sufficient" (characters) | | --ocr-dpi | 200 | Image DPI for OCR rendering | | --ocr-api-key | from env/config | Override API key | | --ocr-base-url | from env/config | Override API base URL | | --ocr-model | from env/config | Override vision model |

#### Troubleshooting

OCR failed with "No API key" β†’ Configure your API key in config.json or via OCR_API_KEY env var.

OCR model rejects images β†’ The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.

Too many pages being OCR'd β†’ Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.

Rate limit errors β†’ Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.

πŸ“‹ Tips & Best Practices

OCR failed with "No API key" β†’ Configure your API key in config.json or via OCR_API_KEY env var.

OCR model rejects images β†’ The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.

Too many pages being OCR'd β†’ Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.

Rate limit errors β†’ Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.

View on ClawHub
TERMINAL
clawhub install pdf-miner

πŸ§ͺ Use this skill with your agent

Most visitors already have an agent. Pick your environment, install or copy the workflow, then run the smoke-test prompt above.

πŸ” Can't find the right skill?

Search 60,000+ AI agent skills β€” free, no login needed.

Search Skills β†’