🎁 Get the FREE AI Skills Starter Guide β€” Subscribe β†’
BytesAgainBytesAgain
πŸ¦€ ClawHub✦ BytesAgain

Bigdata

by @bytesagain3

Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data.

Versionv2.0.1
Downloads540
Installs1
TERMINAL
clawhub install bigdata

πŸ“– About This Skill


name: bigdata version: "2.0.0" author: BytesAgain homepage: https://bytesagain.com source: https://github.com/bytesagain/ai-skills license: MIT-0 tags: [bigdata, tool, utility] description: "Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data."

BigData

A comprehensive data processing toolkit for ingesting, transforming, querying, filtering, aggregating, and managing data workflows β€” all from the command line with local timestamped log storage.

Commands

| Command | Description | |---------|-------------| | bigdata ingest | Ingest raw data into the system. Without args, shows recent ingest entries | | bigdata transform | Record a data transformation step. Without args, shows recent transforms | | bigdata query | Log and track data queries. Without args, shows recent queries | | bigdata filter | Apply and record data filters. Without args, shows recent filters | | bigdata aggregate | Record aggregation operations. Without args, shows recent aggregations | | bigdata visualize | Log visualization tasks. Without args, shows recent visualizations | | bigdata export | Log export operations. Without args, shows recent exports | | bigdata sample | Record data sampling operations. Without args, shows recent samples | | bigdata schema | Track schema definitions and changes. Without args, shows recent schemas | | bigdata validate | Log data validation checks. Without args, shows recent validations | | bigdata pipeline | Record pipeline configurations. Without args, shows recent pipelines | | bigdata profile | Log data profiling operations. Without args, shows recent profiles | | bigdata stats | Show summary statistics across all entry types | | bigdata search | Search across all log entries for a keyword | | bigdata recent | Show the 20 most recent activity entries from the history log | | bigdata status | Health check β€” version, data dir, total entries, disk usage, last activity | | bigdata help | Show all available commands | | bigdata version | Print version (v2.0.0) |

Each data command (ingest, transform, query, etc.) works the same way:

  • With arguments: saves the entry with a timestamp to its dedicated .log file and records it in the activity history
  • Without arguments: displays the 20 most recent entries from that command's log
  • Data Storage

    All data is stored locally in plain-text log files:

    ~/.local/share/bigdata/
    β”œβ”€β”€ ingest.log          # Ingested data entries
    β”œβ”€β”€ transform.log       # Transformation records
    β”œβ”€β”€ query.log           # Query log
    β”œβ”€β”€ filter.log          # Filter operations
    β”œβ”€β”€ aggregate.log       # Aggregation records
    β”œβ”€β”€ visualize.log       # Visualization tasks
    β”œβ”€β”€ export.log          # Export operations
    β”œβ”€β”€ sample.log          # Sampling records
    β”œβ”€β”€ schema.log          # Schema definitions
    β”œβ”€β”€ validate.log        # Validation checks
    β”œβ”€β”€ pipeline.log        # Pipeline configurations
    β”œβ”€β”€ profile.log         # Profiling results
    └── history.log         # Unified activity log with timestamps
    

    Each entry is stored as YYYY-MM-DD HH:MM| for easy parsing and export.

    Requirements

  • Bash 4.0+ (uses set -euo pipefail)
  • Standard UNIX utilities: date, wc, du, grep, head, tail, cat
  • No external dependencies or API keys required
  • Works offline β€” all data stays on your machine
  • When to Use

    1. Data pipeline tracking β€” Record each step of a multi-stage data workflow (ingest β†’ transform β†’ validate β†’ export) with full timestamps for audit trails 2. Quick data logging β€” Capture observations, measurements, or notes about datasets directly from the terminal without opening a separate app 3. Schema management β€” Keep track of schema definitions, changes, and validation rules as your data evolves over time 4. Data quality monitoring β€” Log validation checks and profiling results to build a history of data quality metrics 5. Workflow documentation β€” Use search and recent commands to review what data operations were performed, when, and in what order

    Examples

    Log a complete data workflow

    # Ingest raw data
    bigdata ingest "customer_orders_2024.csv β€” 1.2M rows loaded"

    Transform it

    bigdata transform "normalize dates to ISO-8601, trim whitespace, deduplicate"

    Validate the output

    bigdata validate "all required fields present, no nulls in customer_id"

    Record the schema

    bigdata schema "orders: id(int), customer_id(int), amount(decimal), date(date)"

    Export when ready

    bigdata export "final dataset pushed to analytics warehouse"

    Search and review activity

    # Search across all logs for a keyword
    bigdata search "customer"

    Check overall statistics

    bigdata stats

    View recent activity across all commands

    bigdata recent

    Health check

    bigdata status

    Pipeline and profiling

    # Define a pipeline
    bigdata pipeline "daily-etl: ingest β†’ clean β†’ validate β†’ load β€” runs at 02:00 UTC"

    Profile a dataset

    bigdata profile "users table: 500K rows, 12 columns, 0.3% nulls in email field"

    Sample data for testing

    bigdata sample "random 10% sample from transactions for QA testing"

    Record an aggregation

    bigdata aggregate "monthly revenue by region β€” Q1 totals computed"

    Filter and query tracking

    # Log a filter operation
    bigdata filter "removed records older than 2020-01-01, kept 850K of 1.2M rows"

    Track a query

    bigdata query "SELECT region, SUM(revenue) FROM orders GROUP BY region"

    Log a visualization

    bigdata visualize "bar chart: monthly revenue trend, exported as PNG"

    Output

    All commands print confirmation to stdout. Data is persisted in ~/.local/share/bigdata/. Use bigdata stats for a summary or bigdata search to find specific entries across all logs.


    *Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*

    ⚑ When to Use

    TriggerAction
    2. **Quick data logging** β€” Capture observations, measurements, or notes about datasets directly from the terminal without opening a separate app
    3. **Schema management** β€” Keep track of schema definitions, changes, and validation rules as your data evolves over time
    4. **Data quality monitoring** β€” Log validation checks and profiling results to build a history of data quality metrics
    5. **Workflow documentation** β€” Use search and recent commands to review what data operations were performed, when, and in what order

    πŸ’‘ Examples

    Log a complete data workflow

    # Ingest raw data
    bigdata ingest "customer_orders_2024.csv β€” 1.2M rows loaded"

    Transform it

    bigdata transform "normalize dates to ISO-8601, trim whitespace, deduplicate"

    Validate the output

    bigdata validate "all required fields present, no nulls in customer_id"

    Record the schema

    bigdata schema "orders: id(int), customer_id(int), amount(decimal), date(date)"

    Export when ready

    bigdata export "final dataset pushed to analytics warehouse"

    Search and review activity

    # Search across all logs for a keyword
    bigdata search "customer"

    Check overall statistics

    bigdata stats

    View recent activity across all commands

    bigdata recent

    Health check

    bigdata status

    Pipeline and profiling

    # Define a pipeline
    bigdata pipeline "daily-etl: ingest β†’ clean β†’ validate β†’ load β€” runs at 02:00 UTC"

    Profile a dataset

    bigdata profile "users table: 500K rows, 12 columns, 0.3% nulls in email field"

    Sample data for testing

    bigdata sample "random 10% sample from transactions for QA testing"

    Record an aggregation

    bigdata aggregate "monthly revenue by region β€” Q1 totals computed"

    Filter and query tracking

    # Log a filter operation
    bigdata filter "removed records older than 2020-01-01, kept 850K of 1.2M rows"

    Track a query

    bigdata query "SELECT region, SUM(revenue) FROM orders GROUP BY region"

    Log a visualization

    bigdata visualize "bar chart: monthly revenue trend, exported as PNG"