🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub

Hwp Extract Pipeline

by @heoboong

HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HW...

Versionv1.0.0
Downloads367
TERMINAL
clawhub install hwp-extract-pipeline

πŸ“– About This Skill


name: hwp-extract-pipeline description: "HWP/HWPX/PDF extraction pipeline: attempt hwp-reader, then pyhwp, then OCR, with safe fallbacks. Use when agent needs reliable text extraction from Korean HWP/HWPX or PDF/scan attachments."

hwp-extract-pipeline

κ°„λ‹¨ν•œ HWP/HWPX/PDF μΆ”μΆœ νŒŒμ΄ν”„λΌμΈ μŠ€ν‚¬μž…λ‹ˆλ‹€. 핡심 λͺ©ν‘œλŠ” λ‘œμ»¬μ— μ €μž₯된 곡고문(ν•œκΈ€ 파일)을 μ•ˆμ •μ μœΌλ‘œ ν…μŠ€νŠΈλ‘œ λ³€ν™˜ν•΄ JSON ν˜•μ‹μœΌλ‘œ λ°˜ν™˜ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

간단 μ‚¬μš©λ²•

  • μ‹€ν–‰ 슀크립트: scripts/extract_hwp.py
  • μž…λ ₯: 둜컬 파일 경둜(예: /home/vorox/.openclaw/agents/nalda-mail-opt/data//getImageFile.do)
  • 좜λ ₯: JSON 좜λ ₯(ν‘œμ€€μΆœλ ₯) 및 데이터 폴더에 _extracted.json으둜 μ €μž₯
  • μš°μ„ μˆœμœ„(폴백 방식) 1. hwp-reader 호좜 (μ™ΈλΆ€ skill 호좜 κ°€λŠ₯μ‹œ) 2. pyhwp(venv) 기반 μΆ”μΆœ 3. μ‹œμŠ€ν…œ OCR (poppler + tesseract) β€” μ‹œμŠ€ν…œ μ„€μΉ˜ ν•„μš”ν•  수 있음 4. strings 기반 폴백

    μ°Έκ³  λ¬Έμ„œ

  • scripts/README.md (간단 μ‚¬μš© μ˜ˆμ‹œ 및 톡합 방법)