Schema Design Principles

Good schema design balances normalization (reduce duplication), query performance (strategic denormalization), and operational simplicity (migrations, backups). Start normalized; denormalize only when profiling proves a bottleneck.

Numeric Types

  CREATE TABLE metrics (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    product_id INT UNSIGNED NOT NULL,
    view_count INT UNSIGNED NOT NULL DEFAULT 0,
    price DECIMAL(10, 2) NOT NULL,
    tax_rate DECIMAL(5, 4) NOT NULL DEFAULT 0.0825,
    weight_kg DOUBLE,
    is_featured TINYINT(1) NOT NULL DEFAULT 0
);
  
Type Storage Use Case
TINYINT 1 byte Flags, small enums (0–255)
INT / BIGINT 4 / 8 bytes IDs, counts — use UNSIGNED for non-negative
DECIMAL(p,s) Variable Money, exact fractions — never FLOAT for currency
FLOAT / DOUBLE 4 / 8 bytes Scientific approximations only
  -- Wrong: floating point money
price FLOAT   -- 0.1 + 0.2 != 0.3

-- Right: fixed precision
price DECIMAL(10, 2)
  

Strings and Text

  CREATE TABLE articles (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    slug VARCHAR(64) NOT NULL,
    title VARCHAR(255) NOT NULL,
    summary VARCHAR(500),
    body MEDIUMTEXT,
    metadata JSON,
    UNIQUE KEY uk_slug (slug),
    FULLTEXT KEY ft_title_body (title, body)
);
  
Type Max Length Use Case
CHAR(n) Fixed n Country codes (US), fixed tokens
VARCHAR(n) Up to 65,535 bytes Names, emails, slugs
TEXT / MEDIUMTEXT / LONGTEXT 64KB / 16MB / 4GB Long content
JSON Document storage Flexible attributes with indexing

Index only the prefix of long VARCHAR columns when necessary:

  CREATE INDEX idx_body_prefix ON articles (body(100));
  

Date and Time

  CREATE TABLE events (
    id INT UNSIGNED PRIMARY KEY,
    starts_at DATETIME(3) NOT NULL,
    ends_at DATETIME(3) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    timezone VARCHAR(50) DEFAULT 'UTC'
);
  
Type Range Notes
DATE 1000-01-01 to 9999-12-31 Date only
TIME Time of day Duration or clock time
DATETIME 1000 to 9999 No timezone conversion — store UTC
TIMESTAMP 1970–2038 (32-bit) Converts to session timezone

Best practice: store UTC in DATETIME; convert to local timezone in application layer.

ENUM and SET

  CREATE TABLE orders (
    id INT UNSIGNED PRIMARY KEY,
    status ENUM('pending', 'paid', 'shipped', 'delivered', 'cancelled') DEFAULT 'pending',
    flags SET('gift', 'fragile', 'express') DEFAULT ''
);
  

ENUM is compact but altering values requires DDL. For frequently changing options, use a lookup table:

  CREATE TABLE order_statuses (
    id TINYINT UNSIGNED PRIMARY KEY,
    code VARCHAR(20) NOT NULL UNIQUE,
    label VARCHAR(50) NOT NULL
);
  

JSON Columns (MySQL 8.0+)

  CREATE TABLE products (
    id INT PRIMARY KEY,
    attributes JSON,
    CHECK (JSON_VALID(attributes))
);

INSERT INTO products VALUES (1, '{"color": "red", "size": "L", "tags": ["sale"]}');

SELECT id FROM products
WHERE attributes->>'$.color' = 'red';

CREATE INDEX idx_color ON products ((CAST(attributes->>'$.color' AS CHAR(20))));
  

Use JSON for semi-structured data that does not warrant a separate table — not as a replacement for proper relational design.

Constraints

  CREATE TABLE order_items (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    order_id INT UNSIGNED NOT NULL,
    product_id INT UNSIGNED NOT NULL,
    quantity INT UNSIGNED NOT NULL,
    unit_price DECIMAL(10, 2) NOT NULL,
    CONSTRAINT chk_quantity CHECK (quantity > 0 AND quantity <= 999),
    CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
    UNIQUE KEY uk_order_product (order_id, product_id)
);
  
Constraint Purpose
PRIMARY KEY Unique row identifier
FOREIGN KEY Referential integrity
UNIQUE Alternate key (email, slug)
NOT NULL Required field
CHECK Domain validation (8.0.16+)
DEFAULT Fallback value

Normalization vs Denormalization

Third Normal Form (3NF): every non-key column depends on the key, the whole key, and nothing but the key.

  -- Normalized: no redundant user name on every order
CREATE TABLE orders (
    id INT PRIMARY KEY,
    user_id INT NOT NULL,
    total DECIMAL(10,2),
    FOREIGN KEY (user_id) REFERENCES users(id)
);
  

Denormalize when reads dominate and joins are expensive:

  -- Denormalized: cache user name on order for reporting speed
ALTER TABLE orders ADD COLUMN user_name VARCHAR(100);
-- Maintain via trigger or application on user update
  

Index-Friendly Schema Choices

  1. Use smallest integer type for IDs (INT UNSIGNED until you exceed 4 billion rows)
  2. Put high-cardinality columns first in composite indexes
  3. Avoid storing computed values unless they are expensive — use generated columns:
  ALTER TABLE order_items
ADD COLUMN line_total DECIMAL(10,2) AS (quantity * unit_price) STORED;
  

Best Practices

  1. Use utf8mb4 for all text columns
  2. Add created_at / updated_at to every table
  3. Use soft deletes (deleted_at) when audit history matters
  4. Name tables and columns consistently (snake_case, singular or plural — pick one)
  5. Version-control every schema change

Common Mistakes

Mistake Impact
VARCHAR(255) everywhere Wastes index space — size to actual need
No foreign keys “for performance” Orphan rows, application-enforced integrity fails
Storing dates as strings Breaks sorting and range queries
Over-normalizing 12-table join for a simple dashboard

Troubleshooting

“Row size too large” — MySQL row limit is ~65KB. Split wide tables or use TEXT/JSON off-page storage.

“Data too long for column” — check actual byte length with multibyte UTF-8:

  SELECT LENGTH('emoji😀'), CHAR_LENGTH('emoji😀');
-- LENGTH = 9 bytes, CHAR_LENGTH = 6 characters
  

Production Scenario

A multi-tenant SaaS uses tenant_id on every table with composite indexes leading with tenant_id. Row-level security is enforced in application queries (WHERE tenant_id = ?), and foreign keys within tenant scope prevent cross-tenant references. Schema migrations run through Flyway in CI with rollback scripts tested on staging clones.

Thoughtful schema design prevents costly migrations and performance problems as data grows from thousands to billions of rows.