The Ultimate CSV Scripting Guide
Automate data validation, cleaning, and reporting using standard Unix utilities like awk, sed, and join.
Counting Records
Standard counting often includes the header row. Use these methods to get accurate data counts.
# Count lines including header
wc -l data.csv
# Count data rows only (Excludes line 1)
awk 'NR > 1' data.csv | wc -l
# Count using variable for scripts
ROW_COUNT=$(tail -n +2 data.csv | grep -c .)
Column & Field Validation
Check if every row adheres to the expected schema (e.g., exactly 8 columns).
# Find rows that DO NOT have exactly 8 columns
awk -F',' 'NF != 8 { print "Error: Line " NR " has " NF " fields" }' data.csv
# Validate that Column 3 is always numeric
awk -F',' '$3 !~ /^[0-9]+$/ { print "Non-numeric ID at line " NR ": " $3 }' data.csv
Data Transformation
Convert case, replace delimiters, or calculate new fields on the fly.
# Convert CSV to Pipe-Delimited (|)
sed 's/,/|/g' data.csv > output.psv
# Uppercase the names in Column 2
awk -F',' 'BEGIN{OFS=","} {$2=toupper($2); print}' data.csv
# Add a "Total" column (Col4 = Col2 + Col3)
awk -F',' 'BEGIN{OFS=","} {if(NR>1) $4=$2+$3; else $4="Total"; print}' data.csv
Merging Files (Relational Joins)
Use the join command to combine two CSVs on a common ID column.
# Sort files first (Join requires sorted input)
sort -t',' -k1,1 users.csv > users_sorted.csv
sort -t',' -k1,1 orders.csv > orders_sorted.csv
# Join on Column 1 (ID)
join -t',' -1 1 -2 1 users_sorted.csv orders_sorted.csv > joined_data.csv
Note: The -1 1 -2 1 flags tell join to use Column 1 of File 1 and Column 1 of File 2 as the keys.
Common One-Liners
| Requirement | Command |
|---|---|
| Remove Header Row | sed '1d' file.csv |
| Sum values in Col 5 | awk -F, '{s+=$5} END{print s}' file.csv |
| Unique values in Col 1 | cut -d, -f1 file.csv | sort -u |
| Filter by Date (regex) | grep "2023-12-.." file.csv |