🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub

Data Scraper

by @dinghaibin

Extract data from websites and APIs for analysis. Use when user needs to collect product prices from e-commerce sites, gather news articles, extract structur...

Versionv1.0.0
Downloads256
TERMINAL
clawhub install scraper-pro

πŸ“– About This Skill


name: data-scraper description: Extract data from websites and APIs for analysis. Use when user needs to collect product prices from e-commerce sites, gather news articles, extract structured data from web pages, build datasets from public sources, or automate data collection for research.

Data Scraper

Extract structured data from websites and APIs.

Quick Start

# Basic page scrape
python scripts/scrape.py https://example.com --output data.json

Core Features

  • CSS/XPath selectors: Target specific elements
  • Multiple output formats: JSON, CSV, Markdown
  • Pagination support: Scrape multiple pages
  • Rate limiting: Respect server limits
  • Authentication: Handle login/sessions
  • Usage

    python scripts/scrape.py [OPTIONS]

    Options: --url TEXT URL to scrape (required) --selector TEXT CSS selector for data extraction --output PATH Output file path --format FORMAT Output format: json, csv, markdown --limit NUM Maximum items to scrape --wait SECS Wait between requests --login URL Login URL for authenticated scraping

    Examples

    Product Price Collection

    python scripts/scrape.py \
      --url "https://example.com/products" \
      --selector ".product" \
      --output prices.json \
      --format json
    

    News Article Aggregation

    python scripts/scrape.py \
      --url "https://news.example.com/latest" \
      --selector "article" \
      --output news.md \
      --format markdown
    

    Configuration File

    Create scrape.yaml for complex scraping:

    url: https://example.com/products
    selectors:
      items: ".product-card"
      title: ".product-title"
      price: ".price::text"
      image: "img::attr(src)"
      link: "a::attr(href)"

    pagination: type: click button: ".next-page" max_pages: 10

    output: format: json file: products.json

    Best Practices

    1. Check robots.txt before scraping 2. Add delays between requests 3. Cache responses for development 4. Handle errors gracefully 5. Store raw HTML for debugging

    Legal Note

    Ensure you have permission to scrape target websites. Check Terms of Service and robots.txt.

    πŸ’‘ Examples

    Product Price Collection

    python scripts/scrape.py \
      --url "https://example.com/products" \
      --selector ".product" \
      --output prices.json \
      --format json
    

    News Article Aggregation

    python scripts/scrape.py \
      --url "https://news.example.com/latest" \
      --selector "article" \
      --output news.md \
      --format markdown
    

    πŸ“‹ Tips & Best Practices

    1. Check robots.txt before scraping 2. Add delays between requests 3. Cache responses for development 4. Handle errors gracefully 5. Store raw HTML for debugging