pdf-extract-skill
by @secondport
OpenClaw PDF extraction skill using OpenDataLoader. Use when the user wants to extract and process PDF content for RAG, embeddings, or coordinate-based citat...
clawhub install pdf-extract-skillπ About This Skill
name: pdf-extract-skill description: "OpenClaw PDF extraction skill using OpenDataLoader. Use when the user wants to extract and process PDF content for RAG, embeddings, or coordinate-based citations." license: "Apache-2.0" metadata: {"openclaw":{"emoji":"π","requires":{"bins":["java","python3","opendataloader-pdf"],"runtimes":["Java 11+","Python 3.10+"]}},"clawdbot":{"emoji":"π","requires":{"bins":["java","python3","opendataloader-pdf"],"runtimes":["Java 11+","Python 3.10+"]}}}
SKILL: OpenClaw PDF Supercharger with OpenDataLoader
0) Modular Map (.md)
To improve maintainability and allow targeted calls to specific .md files, this skill relies on helper documents:Usage rules:
1) Goal
This skill maximizes PDF reading quality for OpenClaw in ClawHub using OpenDataLoader PDF.Pillars:
2) When to Use This Skill
Use this skill when the user needs to:Do not use this skill for:
3) Core Architecture Rule (No MCP)
Since the MCP does not exist yet, this skill must operate with CLI only:Do not create complex wrappers or intermediate services unless strictly needed.
4) Robust Prerequisites
Always validate before conversion:Quick checks:
If Java fails on Windows, reopen the terminal and verify PATH.
5) Standard OpenClaw Operating Flow
Step A: Classify user intent
1. General reading/summary -> markdown 2. RAG with metadata and citations -> json,markdown 3. Complex tables or scanned PDF -> hybrid docling-fast 4. Charts with image descriptions -> hybrid + hybrid-mode full + enrich-picture-descriptionStep B: Run in batches (required)
Always process multiple files in a single invocation to avoid JVM startup overhead per call.Recommended example: opendataloader-pdf file1.pdf file2.pdf ./folder/ -o ./output -f json,markdown
Step C: Return a simple OpenClaw response format
Suggested response: 1. Status: ok or warning 2. Processed files 3. Output path 4. Generated formats 5. Suggested next actionTemplate: "Processing completed. N PDFs were converted to ./output with json,markdown format. If you want, I can now extract specific pages or enable OCR for scanned files."
6) Ready-to-Use CLI Profiles
Profile 1: Fast LLM reading
opendataloader-pdf ./pdfs/ -o ./output -f markdownProfile 2: Recommended for RAG
opendataloader-pdf ./pdfs/ -o ./output -f json,markdownProfile 3: Specific pages only
opendataloader-pdf report.pdf -o ./output -f json --pages "1,3,5-7"Profile 4: Sensitive data sanitization
opendataloader-pdf report.pdf -o ./output -f markdown --sanitizeProfile 5: Preserve line breaks
opendataloader-pdf report.pdf -o ./output -f markdown --keep-line-breaksProfile 6: Embedded or external images
opendataloader-pdf report.pdf -o ./output -f json --image-output external opendataloader-pdf report.pdf -o ./output -f json --image-output embedded7) High-Precision Hybrid Mode
Use it when:7.1 Start backend
Standard: opendataloader-pdf-hybrid --port 5002Forced OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr
Multi-language OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "es,en"
With image descriptions: opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
7.2 Use backend from client
Hybrid auto mode: opendataloader-pdf --hybrid docling-fast file1.pdf file2.pdf ./folder/ -o ./output -f json,markdownWith timeout and fallback: opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 --hybrid-fallback file1.pdf ./folder/ -o ./output -f json
Image descriptions enabled (full required): opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf ./folder/ -o ./output -f json,markdown
Critical note: If the backend starts with --enrich-picture-description, the client must use --hybrid-mode full to include descriptions in output.
8) Key Robustness Parameters
9) Decision Matrix for OpenClaw
1. If the user wants speed and clean text: markdown. 2. If precise positional citations are needed: json (with bounding box) or json,markdown. 3. If output is empty/poor on scanned files: backend with --force-ocr. 4. If tables are very complex: enable --hybrid docling-fast. 5. If charts must be interpreted: backend with --enrich-picture-description and client with --hybrid-mode full.10) Quick Troubleshooting
Problem: Java not found. Solution: install Java 11+ and verify with java -version.Problem: Hybrid backend connection error. Solution: start opendataloader-pdf-hybrid in another terminal and verify port 5002.
Problem: Too slow. Solution: process in batches, increase hybrid timeout, and verify backend RAM.
Problem: Mixed columns. Solution: use default reading mode (xycut) and try --use-struct-tree for tagged PDFs.
Problem: Poor table quality. Solution: use json output + hybrid mode.
11) Best Practices for ClawHub
12) Skill Quality Checklist
13) Credits
This skill uses and credits the excellent OpenDataLoader project: https://opendataloader.org/Official documentation used for this version: https://opendataloader.org/docs