Bigdata
by @bytesagain3
Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data.
clawhub install bigdataπ About This Skill
name: bigdata version: "2.0.0" author: BytesAgain homepage: https://bytesagain.com source: https://github.com/bytesagain/ai-skills license: MIT-0 tags: [bigdata, tool, utility] description: "Split large files, run parallel processing, and stream batch analysis. Use when sampling datasets, aggregating logs, or transforming bulk data."
BigData
A comprehensive data processing toolkit for ingesting, transforming, querying, filtering, aggregating, and managing data workflows β all from the command line with local timestamped log storage.
Commands
| Command | Description |
|---------|-------------|
| bigdata ingest | Ingest raw data into the system. Without args, shows recent ingest entries |
| bigdata transform | Record a data transformation step. Without args, shows recent transforms |
| bigdata query | Log and track data queries. Without args, shows recent queries |
| bigdata filter | Apply and record data filters. Without args, shows recent filters |
| bigdata aggregate | Record aggregation operations. Without args, shows recent aggregations |
| bigdata visualize | Log visualization tasks. Without args, shows recent visualizations |
| bigdata export | Log export operations. Without args, shows recent exports |
| bigdata sample | Record data sampling operations. Without args, shows recent samples |
| bigdata schema | Track schema definitions and changes. Without args, shows recent schemas |
| bigdata validate | Log data validation checks. Without args, shows recent validations |
| bigdata pipeline | Record pipeline configurations. Without args, shows recent pipelines |
| bigdata profile | Log data profiling operations. Without args, shows recent profiles |
| bigdata stats | Show summary statistics across all entry types |
| bigdata search | Search across all log entries for a keyword |
| bigdata recent | Show the 20 most recent activity entries from the history log |
| bigdata status | Health check β version, data dir, total entries, disk usage, last activity |
| bigdata help | Show all available commands |
| bigdata version | Print version (v2.0.0) |
Each data command (ingest, transform, query, etc.) works the same way:
.log file and records it in the activity historyData Storage
All data is stored locally in plain-text log files:
~/.local/share/bigdata/
βββ ingest.log # Ingested data entries
βββ transform.log # Transformation records
βββ query.log # Query log
βββ filter.log # Filter operations
βββ aggregate.log # Aggregation records
βββ visualize.log # Visualization tasks
βββ export.log # Export operations
βββ sample.log # Sampling records
βββ schema.log # Schema definitions
βββ validate.log # Validation checks
βββ pipeline.log # Pipeline configurations
βββ profile.log # Profiling results
βββ history.log # Unified activity log with timestamps
Each entry is stored as YYYY-MM-DD HH:MM| for easy parsing and export.
Requirements
set -euo pipefail)date, wc, du, grep, head, tail, catWhen to Use
1. Data pipeline tracking β Record each step of a multi-stage data workflow (ingest β transform β validate β export) with full timestamps for audit trails 2. Quick data logging β Capture observations, measurements, or notes about datasets directly from the terminal without opening a separate app 3. Schema management β Keep track of schema definitions, changes, and validation rules as your data evolves over time 4. Data quality monitoring β Log validation checks and profiling results to build a history of data quality metrics 5. Workflow documentation β Use search and recent commands to review what data operations were performed, when, and in what order
Examples
Log a complete data workflow
# Ingest raw data
bigdata ingest "customer_orders_2024.csv β 1.2M rows loaded"Transform it
bigdata transform "normalize dates to ISO-8601, trim whitespace, deduplicate"Validate the output
bigdata validate "all required fields present, no nulls in customer_id"Record the schema
bigdata schema "orders: id(int), customer_id(int), amount(decimal), date(date)"Export when ready
bigdata export "final dataset pushed to analytics warehouse"
Search and review activity
# Search across all logs for a keyword
bigdata search "customer"Check overall statistics
bigdata statsView recent activity across all commands
bigdata recentHealth check
bigdata status
Pipeline and profiling
# Define a pipeline
bigdata pipeline "daily-etl: ingest β clean β validate β load β runs at 02:00 UTC"Profile a dataset
bigdata profile "users table: 500K rows, 12 columns, 0.3% nulls in email field"Sample data for testing
bigdata sample "random 10% sample from transactions for QA testing"Record an aggregation
bigdata aggregate "monthly revenue by region β Q1 totals computed"
Filter and query tracking
# Log a filter operation
bigdata filter "removed records older than 2020-01-01, kept 850K of 1.2M rows"Track a query
bigdata query "SELECT region, SUM(revenue) FROM orders GROUP BY region"Log a visualization
bigdata visualize "bar chart: monthly revenue trend, exported as PNG"
Output
All commands print confirmation to stdout. Data is persisted in ~/.local/share/bigdata/. Use bigdata stats for a summary or bigdata search to find specific entries across all logs.
*Powered by BytesAgain | bytesagain.com | hello@bytesagain.com*
β‘ When to Use
π‘ Examples
Log a complete data workflow
# Ingest raw data
bigdata ingest "customer_orders_2024.csv β 1.2M rows loaded"Transform it
bigdata transform "normalize dates to ISO-8601, trim whitespace, deduplicate"Validate the output
bigdata validate "all required fields present, no nulls in customer_id"Record the schema
bigdata schema "orders: id(int), customer_id(int), amount(decimal), date(date)"Export when ready
bigdata export "final dataset pushed to analytics warehouse"
Search and review activity
# Search across all logs for a keyword
bigdata search "customer"Check overall statistics
bigdata statsView recent activity across all commands
bigdata recentHealth check
bigdata status
Pipeline and profiling
# Define a pipeline
bigdata pipeline "daily-etl: ingest β clean β validate β load β runs at 02:00 UTC"Profile a dataset
bigdata profile "users table: 500K rows, 12 columns, 0.3% nulls in email field"Sample data for testing
bigdata sample "random 10% sample from transactions for QA testing"Record an aggregation
bigdata aggregate "monthly revenue by region β Q1 totals computed"
Filter and query tracking
# Log a filter operation
bigdata filter "removed records older than 2020-01-01, kept 850K of 1.2M rows"Track a query
bigdata query "SELECT region, SUM(revenue) FROM orders GROUP BY region"Log a visualization
bigdata visualize "bar chart: monthly revenue trend, exported as PNG"