Joins and Subqueries
JOIN Fundamentals
Joins combine rows from two or more tables based on related columns. The join condition belongs in ON (for INNER/LEFT/RIGHT) — put row filters in WHERE.
-- Correct: join condition in ON, filter in WHERE
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100 AND o.status = 'completed';
INNER JOIN
Returns only rows where the join condition matches in both tables.
SELECT u.id, u.name, o.id AS order_id, o.total, o.created_at
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.created_at DESC;
If a user has no orders, they do not appear in the result.
LEFT (OUTER) JOIN
Returns all rows from the left table; unmatched right-side columns are NULL.
SELECT u.name, COUNT(o.id) AS order_count, COALESCE(SUM(o.total), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'completed'
GROUP BY u.id, u.name
ORDER BY total_spent DESC;
Filter on the right table in ON (not WHERE) when you want to preserve unmatched left rows:
-- Wrong: converts LEFT JOIN to INNER JOIN
SELECT u.name FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE o.status = 'completed';
-- Right: filter in ON clause
SELECT u.name FROM users u LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'completed';
RIGHT JOIN and CROSS JOIN
-- RIGHT JOIN — rarely used; swap tables and use LEFT JOIN instead
SELECT * FROM orders o RIGHT JOIN users u ON o.user_id = u.id;
-- CROSS JOIN — Cartesian product (every combination)
SELECT p.name AS product, c.name AS category
FROM products p
CROSS JOIN categories c
WHERE p.category_id = c.id; -- usually equivalent to INNER JOIN
Self Join
SELECT e.name AS employee, m.name AS manager, e.department
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id
WHERE e.status = 'active';
Useful for hierarchies: org charts, category trees, threaded comments.
Multiple Joins
SELECT
u.name,
o.id AS order_id,
p.name AS product,
oi.quantity,
oi.unit_price
FROM users u
INNER JOIN orders o ON u.id = o.user_id
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= CURDATE() - INTERVAL 30 DAY;
Join order is chosen by the optimizer — ensure indexes exist on all join columns.
Subqueries
Scalar Subquery
SELECT name, (
SELECT COUNT(*) FROM orders WHERE user_id = u.id
) AS order_count
FROM users u;
Correlated subqueries execute once per outer row — can be slow. Rewrite as JOIN when possible.
IN Subquery
SELECT * FROM products
WHERE category_id IN (
SELECT id FROM categories WHERE active = 1
);
For large inner result sets, prefer EXISTS:
SELECT * FROM products p
WHERE EXISTS (
SELECT 1 FROM categories c
WHERE c.id = p.category_id AND c.active = 1
);
EXISTS short-circuits on first match and often performs better.
Derived Tables (Subquery in FROM)
SELECT tier, COUNT(*) AS user_count, AVG(avg_total) AS mean_avg
FROM (
SELECT u.id, u.tier, AVG(o.total) AS avg_total
FROM users u
INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.tier
) AS user_stats
GROUP BY tier;
Common Table Expressions (CTEs)
MySQL 8.0+ supports WITH clauses:
WITH monthly_revenue AS (
SELECT
DATE_FORMAT(created_at, '%Y-%m') AS month,
SUM(total) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE_FORMAT(created_at, '%Y-%m')
)
SELECT month, revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;
Recursive CTEs handle hierarchies:
WITH RECURSIVE category_tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.depth + 1
FROM categories c
INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY depth, name;
Window Functions
SELECT
user_id,
id AS order_id,
total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn,
SUM(total) OVER (PARTITION BY user_id) AS user_lifetime_total
FROM orders;
UNION and UNION ALL
SELECT email, 'user' AS source FROM users WHERE status = 'active'
UNION
SELECT email, 'subscriber' AS source FROM newsletter_subscribers;
-- UNION ALL keeps duplicates — faster when duplicates are acceptable
SELECT id FROM active_users
UNION ALL
SELECT id FROM archived_users;
Performance Tips
- Index all columns used in
JOINandWHERE - Avoid
SELECT *in joins — fetch only needed columns - Replace correlated subqueries with JOINs when the subquery runs per row
- Use
EXPLAIN ANALYZEto compare join strategies - Limit result sets — do not join millions of rows without filters
Common Mistakes
| Mistake | Fix |
|---|---|
| Filter on right table in WHERE with LEFT JOIN | Move filter to ON clause |
| Subquery returns multiple rows in scalar context | Add LIMIT 1 or use aggregation |
| Missing index on FK column | Add index matching foreign key |
| Implicit join (comma syntax) | Use explicit INNER JOIN ... ON |
Troubleshooting
Unexpected duplicate rows:
-- Diagnose: join may be one-to-many
SELECT o.id, COUNT(*) FROM orders o JOIN order_items oi ON o.id = oi.order_id GROUP BY o.id HAVING COUNT(*) > 1;
-- Fix: aggregate order_items first, then join
Production Scenario
A dashboard query aggregates daily revenue by product category using a CTE for filtered orders, a JOIN to products and categories, and a window function for running totals. The query runs against a read replica with a 15-minute cached materialized summary table refreshed by a background job.
Joins and subqueries are how relational databases deliver their core value — combining normalized data into meaningful business answers.