Data Types and Schema Design
Schema Design Principles
Good schema design balances normalization (reduce duplication), query performance (strategic denormalization), and operational simplicity (migrations, backups). Start normalized; denormalize only when profiling proves a bottleneck.
Numeric Types
CREATE TABLE metrics (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
product_id INT UNSIGNED NOT NULL,
view_count INT UNSIGNED NOT NULL DEFAULT 0,
price DECIMAL(10, 2) NOT NULL,
tax_rate DECIMAL(5, 4) NOT NULL DEFAULT 0.0825,
weight_kg DOUBLE,
is_featured TINYINT(1) NOT NULL DEFAULT 0
);
| Type | Storage | Use Case |
|---|---|---|
| TINYINT | 1 byte | Flags, small enums (0–255) |
| INT / BIGINT | 4 / 8 bytes | IDs, counts — use UNSIGNED for non-negative |
| DECIMAL(p,s) | Variable | Money, exact fractions — never FLOAT for currency |
| FLOAT / DOUBLE | 4 / 8 bytes | Scientific approximations only |
-- Wrong: floating point money
price FLOAT -- 0.1 + 0.2 != 0.3
-- Right: fixed precision
price DECIMAL(10, 2)
Strings and Text
CREATE TABLE articles (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
slug VARCHAR(64) NOT NULL,
title VARCHAR(255) NOT NULL,
summary VARCHAR(500),
body MEDIUMTEXT,
metadata JSON,
UNIQUE KEY uk_slug (slug),
FULLTEXT KEY ft_title_body (title, body)
);
| Type | Max Length | Use Case |
|---|---|---|
| CHAR(n) | Fixed n | Country codes (US), fixed tokens |
| VARCHAR(n) | Up to 65,535 bytes | Names, emails, slugs |
| TEXT / MEDIUMTEXT / LONGTEXT | 64KB / 16MB / 4GB | Long content |
| JSON | Document storage | Flexible attributes with indexing |
Index only the prefix of long VARCHAR columns when necessary:
CREATE INDEX idx_body_prefix ON articles (body(100));
Date and Time
CREATE TABLE events (
id INT UNSIGNED PRIMARY KEY,
starts_at DATETIME(3) NOT NULL,
ends_at DATETIME(3) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
timezone VARCHAR(50) DEFAULT 'UTC'
);
| Type | Range | Notes |
|---|---|---|
| DATE | 1000-01-01 to 9999-12-31 | Date only |
| TIME | Time of day | Duration or clock time |
| DATETIME | 1000 to 9999 | No timezone conversion — store UTC |
| TIMESTAMP | 1970–2038 (32-bit) | Converts to session timezone |
Best practice: store UTC in DATETIME; convert to local timezone in application layer.
ENUM and SET
CREATE TABLE orders (
id INT UNSIGNED PRIMARY KEY,
status ENUM('pending', 'paid', 'shipped', 'delivered', 'cancelled') DEFAULT 'pending',
flags SET('gift', 'fragile', 'express') DEFAULT ''
);
ENUM is compact but altering values requires DDL. For frequently changing options, use a lookup table:
CREATE TABLE order_statuses (
id TINYINT UNSIGNED PRIMARY KEY,
code VARCHAR(20) NOT NULL UNIQUE,
label VARCHAR(50) NOT NULL
);
JSON Columns (MySQL 8.0+)
CREATE TABLE products (
id INT PRIMARY KEY,
attributes JSON,
CHECK (JSON_VALID(attributes))
);
INSERT INTO products VALUES (1, '{"color": "red", "size": "L", "tags": ["sale"]}');
SELECT id FROM products
WHERE attributes->>'$.color' = 'red';
CREATE INDEX idx_color ON products ((CAST(attributes->>'$.color' AS CHAR(20))));
Use JSON for semi-structured data that does not warrant a separate table — not as a replacement for proper relational design.
Constraints
CREATE TABLE order_items (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
order_id INT UNSIGNED NOT NULL,
product_id INT UNSIGNED NOT NULL,
quantity INT UNSIGNED NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL,
CONSTRAINT chk_quantity CHECK (quantity > 0 AND quantity <= 999),
CONSTRAINT fk_order FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
UNIQUE KEY uk_order_product (order_id, product_id)
);
| Constraint | Purpose |
|---|---|
| PRIMARY KEY | Unique row identifier |
| FOREIGN KEY | Referential integrity |
| UNIQUE | Alternate key (email, slug) |
| NOT NULL | Required field |
| CHECK | Domain validation (8.0.16+) |
| DEFAULT | Fallback value |
Normalization vs Denormalization
Third Normal Form (3NF): every non-key column depends on the key, the whole key, and nothing but the key.
-- Normalized: no redundant user name on every order
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
total DECIMAL(10,2),
FOREIGN KEY (user_id) REFERENCES users(id)
);
Denormalize when reads dominate and joins are expensive:
-- Denormalized: cache user name on order for reporting speed
ALTER TABLE orders ADD COLUMN user_name VARCHAR(100);
-- Maintain via trigger or application on user update
Index-Friendly Schema Choices
- Use smallest integer type for IDs (
INT UNSIGNEDuntil you exceed 4 billion rows) - Put high-cardinality columns first in composite indexes
- Avoid storing computed values unless they are expensive — use generated columns:
ALTER TABLE order_items
ADD COLUMN line_total DECIMAL(10,2) AS (quantity * unit_price) STORED;
Best Practices
- Use
utf8mb4for all text columns - Add
created_at/updated_atto every table - Use soft deletes (
deleted_at) when audit history matters - Name tables and columns consistently (
snake_case, singular or plural — pick one) - Version-control every schema change
Common Mistakes
| Mistake | Impact |
|---|---|
| VARCHAR(255) everywhere | Wastes index space — size to actual need |
| No foreign keys “for performance” | Orphan rows, application-enforced integrity fails |
| Storing dates as strings | Breaks sorting and range queries |
| Over-normalizing | 12-table join for a simple dashboard |
Troubleshooting
“Row size too large” — MySQL row limit is ~65KB. Split wide tables or use TEXT/JSON off-page storage.
“Data too long for column” — check actual byte length with multibyte UTF-8:
SELECT LENGTH('emoji😀'), CHAR_LENGTH('emoji😀');
-- LENGTH = 9 bytes, CHAR_LENGTH = 6 characters
Production Scenario
A multi-tenant SaaS uses tenant_id on every table with composite indexes leading with tenant_id. Row-level security is enforced in application queries (WHERE tenant_id = ?), and foreign keys within tenant scope prevent cross-tenant references. Schema migrations run through Flyway in CI with rollback scripts tested on staging clones.
Thoughtful schema design prevents costly migrations and performance problems as data grows from thousands to billions of rows.