Why Text Processing Matters

Linux administration is mostly reading and transforming text: logs, configs, CSV exports, command output. grep, sed, and awk handle 90% of ad-hoc analysis without loading data into a database.

grep — Search Patterns

  grep "failed" /var/log/auth.log
grep -n "error" app.log           # line numbers
grep -v "DEBUG" app.log           # invert match (exclude)
grep -c "200" access.log          # count matching lines
grep -w "root" /etc/passwd        # whole word only
grep -i "warning" app.log         # case insensitive

# Recursive with file filter
grep -r --include="*.conf" "listen" /etc/
grep -r --exclude-dir=.git "TODO" .

# Context lines
grep -C 3 "Exception" app.log     # 3 lines before and after
grep -B 2 -A 5 "FATAL" app.log
  

Extended and Fixed-String Modes

  grep -E "^(error|warn|fatal)" syslog     # extended regex (ERE)
grep -E "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}" access.log

grep -F "error[" app.log                 # fixed string (no regex; faster)
grep -Ff patterns.txt access.log         # patterns from file
  

sed — Stream Editor

Non-interactive line editing and filtering:

  # Print specific lines
sed -n '10,20p' file.txt
sed -n '/error/p' app.log                # lines matching pattern

# Delete lines
sed '/^$/d' file.txt                     # blank lines
sed '/^#/d' config.conf                  # comments

# Substitute (first occurrence per line)
sed 's/foo/bar/' file.txt
sed 's/http:/https:/g' config.conf       # global replace

# In-place edit (GNU sed — Linux default)
sed -i 's/old/new/g' file.txt
sed -i.bak 's/old/new/g' file.txt        # backup first

# Multiple expressions
sed -e 's/ERROR/ERR/g' -e '/^DEBUG/d' app.log
  

macOS/BSD sed: Use sed -i '' 's/old/new/' — syntax differs from GNU sed.

sed Address Ranges

  sed '1,5d' file.txt                      # delete lines 1-5
sed '/START/,/END/d' log.txt             # delete range between markers
sed '0,/pattern/s//replacement/' file    # replace first match only
  

awk — Column-Oriented Processing

awk treats input as records (lines) and fields (columns):

  # Default FS = whitespace
awk '{ print $1, $3 }' data.txt

# Custom field separator
awk -F: '{ print $1 }' /etc/passwd       # usernames
awk -F, '{ print $2 }' data.csv          # CSV column 2

# Conditional printing
awk -F: '$3 == 0 { print $1 }' /etc/passwd   # UID 0 users

# Aggregation
awk '{ sum += $2; count++ } END { print sum/count }' numbers.txt
  

Log Analysis with awk

  # Count HTTP status codes (column 9 in combined log format)
awk '{ codes[$9]++ } END { for (c in codes) print c, codes[c] }' access.log \
    | sort -k2 -nr

# Average response time if in last column
awk '{ sum += $NF; n++ } END { print sum/n }' timings.log

# Print lines where field 5 > 1000
awk '$5 > 1000 { print $0 }' metrics.log
  

Combining Tools

  # Top 10 client IPs
awk '{ print $1 }' access.log | sort | uniq -c | sort -nr | head -10

# Extract errors, strip timestamps, deduplicate
grep -i error app.log \
    | sed 's/^\[[^]]*\] //' \
    | sort -u

# CSV: sum column 3 for rows where column 1 = "US"
awk -F, '$1 == "US" { sum += $3 } END { print sum }' sales.csv

# Multi-step pipeline with intermediate filter
zgrep "POST" /var/log/nginx/access.log.*.gz \
    | awk '$9 ~ /^5/ { print $1, $7, $9 }' \
    | sort | uniq -c | sort -nr | head -20
  

When to Use Which

Tool Strength Avoid when
grep Fast line filtering Complex field math
sed One-off edits, simple transforms Multi-column reports
awk Reports, aggregations, field logic Simple pattern match (use grep)

Modern alternatives: ripgrep (rg) for search, jq for JSON, csvkit for CSV.

Performance Tips

  # Fixed string faster than regex
grep -F "literal string" huge.log

# Limit input early
tail -100000 huge.log | awk '...'

# Parallel grep with xargs (multiple files)
grep -l "pattern" /var/log/app/*.log | xargs -P4 grep -h "pattern"

# Use zgrep for compressed logs
zgrep "error" /var/log/nginx/access.log.*.gz
  

For multi-GB logs, consider dedicated tools: lnav, goaccess, or shipping to Loki/ELK.

Best Practices

Practice Reason
Test sed on copy first In-place edits are irreversible
Quote awk programs in shell Prevents $ expansion by bash
Validate log column positions Format changes break field indexes
Use -F for structured data Explicit delimiter beats whitespace assumptions
  # Quote awk for shell safety
awk -F: '{ print $1 }' /etc/passwd
  

Common Mistakes

Mistake Consequence
Wrong awk field number after log format change Silent wrong reports
GNU vs BSD sed -i syntax Accidental backup files or failed edits
Grepping huge files without limit Minutes of I/O on production disk
Regex special chars unescaped False matches or sed errors

Troubleshooting

sed not changing file: Missing -i flag — output goes to stdout only.

awk empty output: Check -F delimiter; CSV may need -F',' not default whitespace.

grep too slow on compressed logs: Use zgrep or decompress to tmpfs first.

Production Scenario

During a traffic spike, ops needs top failing endpoints in the last hour:

  # Assume ISO timestamp in field 4; adjust for your format
awk -v cutoff="$(date -d '1 hour ago' '+%Y-%m-%dT%H')" \
    '$4 >= cutoff && $9 ~ /^5/ { urls[$7]++ }
     END { for (u in urls) print urls[u], u }' /var/log/nginx/access.log \
    | sort -nr | head -15
  

Results feed a Slack alert; root cause traced to a single API route returning 502 from upstream.

Master grep, sed, and awk and most log investigations become a one-liner pipeline — the fastest path from “something is wrong” to “here is the evidence.”