JOIN Fundamentals

Joins combine rows from two or more tables based on related columns. The join condition belongs in ON (for INNER/LEFT/RIGHT) — put row filters in WHERE.

  -- Correct: join condition in ON, filter in WHERE
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.total > 100 AND o.status = 'completed';
  

INNER JOIN

Returns only rows where the join condition matches in both tables.

  SELECT u.id, u.name, o.id AS order_id, o.total, o.created_at
FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.created_at >= '2024-01-01'
ORDER BY o.created_at DESC;
  

If a user has no orders, they do not appear in the result.

LEFT (OUTER) JOIN

Returns all rows from the left table; unmatched right-side columns are NULL.

  SELECT u.name, COUNT(o.id) AS order_count, COALESCE(SUM(o.total), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'completed'
GROUP BY u.id, u.name
ORDER BY total_spent DESC;
  

Filter on the right table in ON (not WHERE) when you want to preserve unmatched left rows:

  -- Wrong: converts LEFT JOIN to INNER JOIN
SELECT u.name FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE o.status = 'completed';

-- Right: filter in ON clause
SELECT u.name FROM users u LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'completed';
  

RIGHT JOIN and CROSS JOIN

  -- RIGHT JOIN — rarely used; swap tables and use LEFT JOIN instead
SELECT * FROM orders o RIGHT JOIN users u ON o.user_id = u.id;

-- CROSS JOIN — Cartesian product (every combination)
SELECT p.name AS product, c.name AS category
FROM products p
CROSS JOIN categories c
WHERE p.category_id = c.id;  -- usually equivalent to INNER JOIN
  

Self Join

  SELECT e.name AS employee, m.name AS manager, e.department
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id
WHERE e.status = 'active';
  

Useful for hierarchies: org charts, category trees, threaded comments.

Multiple Joins

  SELECT
    u.name,
    o.id AS order_id,
    p.name AS product,
    oi.quantity,
    oi.unit_price
FROM users u
INNER JOIN orders o ON u.id = o.user_id
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= CURDATE() - INTERVAL 30 DAY;
  

Join order is chosen by the optimizer — ensure indexes exist on all join columns.

Subqueries

Scalar Subquery

  SELECT name, (
    SELECT COUNT(*) FROM orders WHERE user_id = u.id
) AS order_count
FROM users u;
  

Correlated subqueries execute once per outer row — can be slow. Rewrite as JOIN when possible.

IN Subquery

  SELECT * FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE active = 1
);
  

For large inner result sets, prefer EXISTS:

  SELECT * FROM products p
WHERE EXISTS (
    SELECT 1 FROM categories c
    WHERE c.id = p.category_id AND c.active = 1
);
  

EXISTS short-circuits on first match and often performs better.

Derived Tables (Subquery in FROM)

  SELECT tier, COUNT(*) AS user_count, AVG(avg_total) AS mean_avg
FROM (
    SELECT u.id, u.tier, AVG(o.total) AS avg_total
    FROM users u
    INNER JOIN orders o ON u.id = o.user_id
    GROUP BY u.id, u.tier
) AS user_stats
GROUP BY tier;
  

Common Table Expressions (CTEs)

MySQL 8.0+ supports WITH clauses:

  WITH monthly_revenue AS (
    SELECT
        DATE_FORMAT(created_at, '%Y-%m') AS month,
        SUM(total) AS revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY DATE_FORMAT(created_at, '%Y-%m')
)
SELECT month, revenue,
    revenue - LAG(revenue) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;
  

Recursive CTEs handle hierarchies:

  WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id, 0 AS depth
    FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, c.name, c.parent_id, ct.depth + 1
    FROM categories c
    INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY depth, name;
  

Window Functions

  SELECT
    user_id,
    id AS order_id,
    total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn,
    SUM(total) OVER (PARTITION BY user_id) AS user_lifetime_total
FROM orders;
  

UNION and UNION ALL

  SELECT email, 'user' AS source FROM users WHERE status = 'active'
UNION
SELECT email, 'subscriber' AS source FROM newsletter_subscribers;

-- UNION ALL keeps duplicates — faster when duplicates are acceptable
SELECT id FROM active_users
UNION ALL
SELECT id FROM archived_users;
  

Performance Tips

  1. Index all columns used in JOIN and WHERE
  2. Avoid SELECT * in joins — fetch only needed columns
  3. Replace correlated subqueries with JOINs when the subquery runs per row
  4. Use EXPLAIN ANALYZE to compare join strategies
  5. Limit result sets — do not join millions of rows without filters

Common Mistakes

Mistake Fix
Filter on right table in WHERE with LEFT JOIN Move filter to ON clause
Subquery returns multiple rows in scalar context Add LIMIT 1 or use aggregation
Missing index on FK column Add index matching foreign key
Implicit join (comma syntax) Use explicit INNER JOIN ... ON

Troubleshooting

Unexpected duplicate rows:

  -- Diagnose: join may be one-to-many
SELECT o.id, COUNT(*) FROM orders o JOIN order_items oi ON o.id = oi.order_id GROUP BY o.id HAVING COUNT(*) > 1;
-- Fix: aggregate order_items first, then join
  

Production Scenario

A dashboard query aggregates daily revenue by product category using a CTE for filtered orders, a JOIN to products and categories, and a window function for running totals. The query runs against a read replica with a 15-minute cached materialized summary table refreshed by a background job.

Joins and subqueries are how relational databases deliver their core value — combining normalized data into meaningful business answers.