Text Processing (grep, sed, awk)
Why Text Processing Matters
Linux administration is mostly reading and transforming text: logs, configs, CSV exports, command output. grep, sed, and awk handle 90% of ad-hoc analysis without loading data into a database.
grep — Search Patterns
grep "failed" /var/log/auth.log
grep -n "error" app.log # line numbers
grep -v "DEBUG" app.log # invert match (exclude)
grep -c "200" access.log # count matching lines
grep -w "root" /etc/passwd # whole word only
grep -i "warning" app.log # case insensitive
# Recursive with file filter
grep -r --include="*.conf" "listen" /etc/
grep -r --exclude-dir=.git "TODO" .
# Context lines
grep -C 3 "Exception" app.log # 3 lines before and after
grep -B 2 -A 5 "FATAL" app.log
Extended and Fixed-String Modes
grep -E "^(error|warn|fatal)" syslog # extended regex (ERE)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log
grep -F "error[" app.log # fixed string (no regex; faster)
grep -Ff patterns.txt access.log # patterns from file
sed — Stream Editor
Non-interactive line editing and filtering:
# Print specific lines
sed -n '10,20p' file.txt
sed -n '/error/p' app.log # lines matching pattern
# Delete lines
sed '/^$/d' file.txt # blank lines
sed '/^#/d' config.conf # comments
# Substitute (first occurrence per line)
sed 's/foo/bar/' file.txt
sed 's/http:/https:/g' config.conf # global replace
# In-place edit (GNU sed — Linux default)
sed -i 's/old/new/g' file.txt
sed -i.bak 's/old/new/g' file.txt # backup first
# Multiple expressions
sed -e 's/ERROR/ERR/g' -e '/^DEBUG/d' app.log
macOS/BSD sed: Use sed -i '' 's/old/new/' — syntax differs from GNU sed.
sed Address Ranges
sed '1,5d' file.txt # delete lines 1-5
sed '/START/,/END/d' log.txt # delete range between markers
sed '0,/pattern/s//replacement/' file # replace first match only
awk — Column-Oriented Processing
awk treats input as records (lines) and fields (columns):
# Default FS = whitespace
awk '{ print $1, $3 }' data.txt
# Custom field separator
awk -F: '{ print $1 }' /etc/passwd # usernames
awk -F, '{ print $2 }' data.csv # CSV column 2
# Conditional printing
awk -F: '$3 == 0 { print $1 }' /etc/passwd # UID 0 users
# Aggregation
awk '{ sum += $2; count++ } END { print sum/count }' numbers.txt
Log Analysis with awk
# Count HTTP status codes (column 9 in combined log format)
awk '{ codes[$9]++ } END { for (c in codes) print c, codes[c] }' access.log \
| sort -k2 -nr
# Average response time if in last column
awk '{ sum += $NF; n++ } END { print sum/n }' timings.log
# Print lines where field 5 > 1000
awk '$5 > 1000 { print $0 }' metrics.log
Combining Tools
# Top 10 client IPs
awk '{ print $1 }' access.log | sort | uniq -c | sort -nr | head -10
# Extract errors, strip timestamps, deduplicate
grep -i error app.log \
| sed 's/^\[[^]]*\] //' \
| sort -u
# CSV: sum column 3 for rows where column 1 = "US"
awk -F, '$1 == "US" { sum += $3 } END { print sum }' sales.csv
# Multi-step pipeline with intermediate filter
zgrep "POST" /var/log/nginx/access.log.*.gz \
| awk '$9 ~ /^5/ { print $1, $7, $9 }' \
| sort | uniq -c | sort -nr | head -20
When to Use Which
| Tool | Strength | Avoid when |
|---|---|---|
| grep | Fast line filtering | Complex field math |
| sed | One-off edits, simple transforms | Multi-column reports |
| awk | Reports, aggregations, field logic | Simple pattern match (use grep) |
Modern alternatives: ripgrep (rg) for search, jq for JSON, csvkit for CSV.
Performance Tips
# Fixed string faster than regex
grep -F "literal string" huge.log
# Limit input early
tail -100000 huge.log | awk '...'
# Parallel grep with xargs (multiple files)
grep -l "pattern" /var/log/app/*.log | xargs -P4 grep -h "pattern"
# Use zgrep for compressed logs
zgrep "error" /var/log/nginx/access.log.*.gz
For multi-GB logs, consider dedicated tools: lnav, goaccess, or shipping to Loki/ELK.
Best Practices
| Practice | Reason |
|---|---|
| Test sed on copy first | In-place edits are irreversible |
| Quote awk programs in shell | Prevents $ expansion by bash |
| Validate log column positions | Format changes break field indexes |
Use -F for structured data |
Explicit delimiter beats whitespace assumptions |
# Quote awk for shell safety
awk -F: '{ print $1 }' /etc/passwd
Common Mistakes
| Mistake | Consequence |
|---|---|
| Wrong awk field number after log format change | Silent wrong reports |
GNU vs BSD sed -i syntax |
Accidental backup files or failed edits |
| Grepping huge files without limit | Minutes of I/O on production disk |
| Regex special chars unescaped | False matches or sed errors |
Troubleshooting
sed not changing file: Missing -i flag — output goes to stdout only.
awk empty output: Check -F delimiter; CSV may need -F',' not default whitespace.
grep too slow on compressed logs: Use zgrep or decompress to tmpfs first.
Production Scenario
During a traffic spike, ops needs top failing endpoints in the last hour:
# Assume ISO timestamp in field 4; adjust for your format
awk -v cutoff="$(date -d '1 hour ago' '+%Y-%m-%dT%H')" \
'$4 >= cutoff && $9 ~ /^5/ { urls[$7]++ }
END { for (u in urls) print urls[u], u }' /var/log/nginx/access.log \
| sort -nr | head -15
Results feed a Slack alert; root cause traced to a single API route returning 502 from upstream.
Master grep, sed, and awk and most log investigations become a one-liner pipeline — the fastest path from “something is wrong” to “here is the evidence.”