🎁 Get the FREE AI Skills Starter GuideSubscribe →
BytesAgainBytesAgain
🦀 ClawHub✦ BytesAgain

Dedupe

by @xueyetianya

Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or d...

Versionv1.0.0
Downloads371
Installs2
TERMINAL
clawhub install dedupe

📖 About This Skill


name: "dedupe" version: "1.0.0" description: "Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or data entries." author: "BytesAgain" homepage: "https://bytesagain.com" source: "https://github.com/bytesagain/ai-skills" tags: [dedupe, deduplication, data-quality, hash, fuzzy-match, etl, atomic] category: "atomic"

Dedupe — Data Deduplication Reference

Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.

When to Use

  • Removing duplicate rows from datasets or databases
  • Deduplicating files in storage systems
  • Implementing fuzzy matching for near-duplicate detection
  • Choosing between exact and probabilistic dedup methods
  • Building ETL pipelines with deduplication stages
  • Commands

    intro

    scripts/script.sh intro
    

    Overview of deduplication — types, strategies, and tradeoffs.

    exact

    scripts/script.sh exact
    

    Exact deduplication — hash-based, key-based, and sorting approaches.

    fuzzy

    scripts/script.sh fuzzy
    

    Fuzzy deduplication — similarity measures, blocking, and record linkage.

    files

    scripts/script.sh files
    

    File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.

    algorithms

    scripts/script.sh algorithms
    

    Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.

    sql

    scripts/script.sh sql
    

    SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.

    cli

    scripts/script.sh cli
    

    Command-line dedup tools — sort, uniq, awk, and stream processing.

    checklist

    scripts/script.sh checklist
    

    Deduplication quality checklist and validation steps.

    help

    scripts/script.sh help
    

    version

    scripts/script.sh version
    

    Configuration

    | Variable | Description | |----------|-------------| | DEDUPE_DIR | Data directory (default: ~/.dedupe/) |


    *Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*

    ⚡ When to Use

    TriggerAction
    - Deduplicating files in storage systems
    - Implementing fuzzy matching for near-duplicate detection
    - Choosing between exact and probabilistic dedup methods
    - Building ETL pipelines with deduplication stages

    ⚙️ Configuration

    | Variable | Description | |----------|-------------| | DEDUPE_DIR | Data directory (default: ~/.dedupe/) |


    *Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*