๐ฆ ClawHub
MinerU PDF Extractor
by @a-i-r
Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
๐ก Examples
cd scripts/Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
Output: BATCH_ID=xxx UPLOAD_URL=xxx
Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdfStep 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
Output: FULL_ZIP_URL=xxx
Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
Script Descriptions
#### local_file_step1_apply_upload_url.sh
Apply for upload URL and batch_id.
Usage:
./local_file_step1_apply_upload_url.sh [language] [layout_model]
Parameters:
language: ch (Chinese), en (English), auto (auto-detect), default chlayout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yoloOutput:
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
#### local_file_step2_upload_file.sh
Upload PDF file to the presigned URL.
Usage:
./local_file_step2_upload_file.sh
#### local_file_step3_poll_result.sh
Poll extraction results until completion or failure.
Usage:
./local_file_step3_poll_result.sh [max_retries] [retry_interval_seconds]
Output:
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
#### local_file_step4_download.sh
Download result ZIP and extract.
Usage:
./local_file_step4_download.sh [output_zip_filename] [extract_directory_name]
Output Structure:
extracted/
โโโ full.md # ๐ Markdown document (main result)
โโโ images/ # ๐ผ๏ธ Extracted images
โโโ content_list.json # Structured content
โโโ layout.json # Layout analysis data
Detailed Documentation
๐ Complete Guide: See docs/Local_File_Parsing_Guide.md
TERMINAL
clawhub install mineru-pdf-extractor