Dedupe
by @xueyetianya
Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or d...
clawhub install dedupe📖 About This Skill
name: "dedupe" version: "1.0.0" description: "Deduplication reference — exact matching, fuzzy matching, hash-based dedup, bloom filters, and data quality. Use when removing duplicate records, files, or data entries." author: "BytesAgain" homepage: "https://bytesagain.com" source: "https://github.com/bytesagain/ai-skills" tags: [dedupe, deduplication, data-quality, hash, fuzzy-match, etl, atomic] category: "atomic"
Dedupe — Data Deduplication Reference
Quick-reference skill for deduplication strategies, algorithms, and data quality patterns.
When to Use
Commands
intro
scripts/script.sh intro
Overview of deduplication — types, strategies, and tradeoffs.
exact
scripts/script.sh exact
Exact deduplication — hash-based, key-based, and sorting approaches.
fuzzy
scripts/script.sh fuzzy
Fuzzy deduplication — similarity measures, blocking, and record linkage.
files
scripts/script.sh files
File-level deduplication — fdupes, jdupes, rdfind, and storage dedup.
algorithms
scripts/script.sh algorithms
Dedup algorithms — bloom filters, HyperLogLog, MinHash, SimHash.
sql
scripts/script.sh sql
SQL deduplication patterns — ROW_NUMBER, DISTINCT, GROUP BY strategies.
cli
scripts/script.sh cli
Command-line dedup tools — sort, uniq, awk, and stream processing.
checklist
scripts/script.sh checklist
Deduplication quality checklist and validation steps.
help
scripts/script.sh help
version
scripts/script.sh version
Configuration
| Variable | Description |
|----------|-------------|
| DEDUPE_DIR | Data directory (default: ~/.dedupe/) |
*Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*
⚡ When to Use
⚙️ Configuration
| Variable | Description |
|----------|-------------|
| DEDUPE_DIR | Data directory (default: ~/.dedupe/) |
*Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*