🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub

gpt-multimodal

by @lnj22

Analyze images and multi-frame sequences using OpenAI GPT series

Versionv0.1.0
⚑ When to Use
TriggerAction
- Extracting and reading text from images (OCR via vision models)
- Comparing multiple images to detect differences or changes
- Processing video frames to understand temporal progression
- Generating detailed image descriptions or captions
- Answering questions about visual content
πŸ“‹ Tips & Best Practices

Prompt Engineering for Vision

1. Be specific: "Count the number of people wearing red shirts" vs "Analyze this image" 2. Request structured output: Ask for JSON, lists, or tables when appropriate 3. Provide context: "This is a medical diagram showing..." helps the model understand 4. Use examples: Show the format you want in your prompt

Image Quality Guidelines

  • Use clear, well-lit images
  • Ensure text is readable at original size
  • Avoid extreme angles or distortions
  • Crop to relevant content to save tokens
  • Use standard orientations (avoid rotated images)
  • Multi-Image Analysis

  • Order matters: Present images in logical sequence
  • Reference images explicitly: "In the first image..."
  • Limit to 10-20 images per request for best results
  • Use low detail for large batches of similar images
  • View on ClawHub
    TERMINAL
    clawhub install pedestrian-traffic-counting-gpt-multimodal

    πŸ§ͺ Use this skill with your agent

    Most visitors already have an agent. Pick your environment, install or copy the workflow, then run the smoke-test prompt above.

    πŸ” Can't find the right skill?

    Search 60,000+ AI agent skills β€” free, no login needed.

    Search Skills β†’