pdf-miner
by @baichenwzj
Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/e...
Run commands from this skill directory.
Basic Extraction
# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdfOutput to custom path
python scripts/extract_pdf.py input.pdf output.mdSpecific pages
python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20Text or tables only
python scripts/extract_pdf.py input.pdf --text-only
python scripts/extract_pdf.py input.pdf --tables-only
python scripts/extract_pdf.py input.pdf --tables-only --json
Advanced Modes
# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"Metrics: extract lines with keywords + numeric values
python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"TOC: extract table of contents / chapter structure (robust, multi-format)
python scripts/extract_pdf.py report.pdf --toc
Optionally adjust sensitivity (default: 3 entries per page required)
python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2Diff: compare two PDFs, show pages unique to each
python scripts/extract_pdf.py old_version.pdf new_version.pdf --diffChunk: split output into LLM-friendly chunks
python scripts/extract_pdf.py report.pdf --chunk # single file, 8000 chars each
python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000
python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks # separate filesClean headers/footers
python scripts/extract_pdf.py report.pdf --clean-headersBatch: process multiple PDFs
python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted
OCR for Scanned/Image PDFs (Automatic by Default)
OCR is automatically triggered for pages with very little extractable text (default threshold: 100 characters). This helps handle scanned or image-based PDFs without requiring the --ocr flag.
#### Usage Examples
# Automatic OCR (default behavior)
python scripts/extract_pdf.py scanned.pdfForce OCR on all pages (ignore text length)
python scripts/extract_pdf.py scanned.pdf --ocrForce OCR only on specific pages
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-pages 1-5,10Adjust OCR quality (DPI)
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-dpi 300Use a different vision model
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-model "stepfun/step-3.5-flash:free"Disable automatic OCR detection (if you want pure extraction only)
python scripts/extract_pdf.py file.pdf --no-auto-ocrChange the low-text threshold (default 100 chars)
python scripts/extract_pdf.py file.pdf --ocr-threshold 200
#### Configuration
OCR requires a vision API key. See Initial Setup for OCR.
| Option | Default | Description |
|--------|---------|-------------|
| --ocr | off | Force OCR on pages (with auto-detect or --ocr-pages) |
| --auto-ocr | on | Automatically OCR low-text pages (hidden; use --no-auto-ocr to disable) |
| --no-auto-ocr | - | Disable automatic OCR detection |
| --ocr-pages | - | Comma-separated pages/ranges to OCR (requires --ocr) |
| --ocr-threshold | 100 | Minimum text length to consider a page as "sufficient" (characters) |
| --ocr-dpi | 200 | Image DPI for OCR rendering |
| --ocr-api-key | from env/config | Override API key |
| --ocr-base-url | from env/config | Override API base URL |
| --ocr-model | from env/config | Override vision model |
#### Troubleshooting
OCR failed with "No API key"
β Configure your API key in config.json or via OCR_API_KEY env var.
OCR model rejects images
β The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.
Too many pages being OCR'd
β Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.
Rate limit errors β Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.
OCR requires a vision API key. See Initial Setup for OCR.
| Option | Default | Description |
|--------|---------|-------------|
| --ocr | off | Force OCR on pages (with auto-detect or --ocr-pages) |
| --auto-ocr | on | Automatically OCR low-text pages (hidden; use --no-auto-ocr to disable) |
| --no-auto-ocr | - | Disable automatic OCR detection |
| --ocr-pages | - | Comma-separated pages/ranges to OCR (requires --ocr) |
| --ocr-threshold | 100 | Minimum text length to consider a page as "sufficient" (characters) |
| --ocr-dpi | 200 | Image DPI for OCR rendering |
| --ocr-api-key | from env/config | Override API key |
| --ocr-base-url | from env/config | Override API base URL |
| --ocr-model | from env/config | Override vision model |
#### Troubleshooting
OCR failed with "No API key"
β Configure your API key in config.json or via OCR_API_KEY env var.
OCR model rejects images
β The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.
Too many pages being OCR'd
β Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.
Rate limit errors β Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.
OCR failed with "No API key"
β Configure your API key in config.json or via OCR_API_KEY env var.
OCR model rejects images
β The configured model might not support vision. Choose a vision-capable model (e.g., qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.
Too many pages being OCR'd
β Increase the threshold: --ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.
Rate limit errors β Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.
clawhub install pdf-miner