Unlock 150 SQL Clean Code Principles
DataScienceQA Comprehensive Guide to 150 SQL Clean Code Principles
What is SQL Clean Code?
SQL Clean Code refers to the practice of writing SQL queries, scripts, and database schemas in a way that is highly readable, consistent, and maintainable. It's not about making the code work—it's about making it understandable for other developers and your future self. Clean SQL is self-documenting, logically structured, and formatted consistently, which reduces bugs, simplifies debugging, and makes collaboration easier.
Use Cases for SQL Clean Code:
• Team Environments: When multiple developers or data analysts work on the same database, a consistent style is crucial for collaboration.
• Long-Term Projects: Code that is written today will be maintained for years. Readability is the most important factor for long-term maintainability.
• Complex Reporting & Analytics: Complex queries with multiple joins, subqueries, and aggregations become unmanageable without clean formatting and structure.
• Code Reviews: Clean code is significantly easier and faster to review, allowing teams to spot logical errors more effectively.
• Onboarding New Team Members: A clean and consistent codebase helps new hires understand the database and existing logic much faster.
Alternatives to SQL Clean Code:
Since Clean Code is a philosophy, the alternatives are essentially different approaches or a lack thereof:
• "Write-Only" or Ad-hoc Querying: This is the most common anti-pattern. Queries are written quickly to get a result without any regard for formatting, naming, or structure. They are often difficult to reuse, debug, or understand later. This is common in interactive analysis but is detrimental for code that needs to be saved or shared.
• ORM-Generated SQL (Object-Relational Mapping): Instead of writing SQL by hand, developers use an ORM (like SQLAlchemy for Python or Hibernate for Java). The ORM generates the SQL. While this can enforce some consistency, the generated SQL is often verbose, can be inefficient, and can obscure the underlying database operations, making it hard to optimize.
• Strict, Tool-Enforced Formatting: Using an automated SQL formatter or linter. This isn't really an "alternative" but a tool to achieve clean code. However, relying solely on a tool without understanding the principles can still lead to poorly structured (though well-formatted) queries.
---
Part 1: Formatting and Layout
• Capitalize SQL Keywords
Explanation: Makes keywords (SELECT, FROM, WHERE) stand out from table and column names.
-- Bad
select first_name from users;
-- Good
SELECT first_name FROM users;• Use snake_case for Identifiers
Explanation: Use lowercase snake_case for table and column names for readability.
-- Bad
SELECT FirstName FROM UsersTable;
-- Good
SELECT first_name FROM users;• Put Each Clause on a New Line
Explanation: Improves readability by separating the distinct parts of the query.
-- Bad
SELECT id, name FROM products WHERE price > 50 ORDER BY name;
-- Good
SELECT id, name
FROM products
WHERE price > 50
ORDER BY name;• Indent Clauses for Readability
Explanation: The content of a clause should be indented relative to the keyword.
SELECT
id,
name,
price
FROM
products;• Use Trailing Commas
Explanation: Makes adding or removing columns easier and results in cleaner version control diffs.
-- Good
SELECT
id,
name,
price -- Easy to comment out
FROM
products;• Align Keywords
Explanation: For multiple similar clauses (like JOINs), aligning the keywords can improve structure.
SELECT u.id, o.order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;• Use Spaces Around Operators
Explanation: Adds breathing room and makes expressions easier to parse visually.
-- Bad
WHERE price>50 AND quantity<=100
-- Good
WHERE price > 50 AND quantity <= 100• Use Block Comments for Explanations
Explanation: Use /* ... */ for multi-line comments that explain complex logic.
/*
This query retrieves all active users who have placed an order
in the last 30 days. It is used for the weekly retention report.
*/
SELECT ...• Use Inline Comments for Line-Specific Notes
Explanation: Use -- for short comments explaining a specific line or condition.
WHERE
is_active = TRUE AND -- Exclude deactivated users
order_date >= NOW() - INTERVAL '30 days';• Indent Subqueries
Explanation: Visually separates the subquery from the main query.
SELECT id, name
FROM users
WHERE id IN (
SELECT user_id
FROM orders
WHERE amount > 100
);• Format JOINs Clearly
Explanation: Put each JOIN and its ON condition on separate, indented lines.
FROM users u
INNER JOIN orders o
ON u.id = o.user_id;• Group and Comment Selected Columns
Explanation: In long SELECT lists, group related columns and add a comment.
SELECT
-- User Information
u.id,
u.first_name,
u.last_name,
-- Order Information
o.id AS order_id,
o.order_date
FROM ...• Format CASE Statements for Readability
Explanation: Indent the WHEN/THEN pairs and place END on its own line.
SELECT
CASE
WHEN score > 90 THEN 'A'
WHEN score > 80 THEN 'B'
ELSE 'C'
END AS grade
FROM student_scores;• Use Blank Lines to Separate Logical Sections
Explanation: A blank line between a complex JOIN block and the WHERE clause can improve clarity.
FROM ...
LEFT JOIN ...
ON ...
WHERE
...• Break Up Long Expressions
Explanation: Break long calculations or logical tests across multiple lines.
WHERE
(
(price * quantity) - discount > 1000 AND
shipping_country = 'USA'
) OR
is_premium_member = TRUE;Part 2: Naming Conventions
• Use Singular for Table Names
Explanation: A table is a collection of a single type of entity (e.g., a collection of user rows).
-- Good: `user`, `product`, `order`
-- Avoid: `users`, `products`, `orders`• Avoid Ambiguous Abbreviations
Explanation: Write full words for clarity. order_date is better than ord_dt.
-- Bad
SELECT u.fname FROM usr u;
-- Good
SELECT u.first_name FROM user u;• Use id for Primary Keys
Explanation: A simple, consistent name for the primary key of any table.
CREATE TABLE product (
id INT PRIMARY KEY,
...
);• Name Foreign Keys Consistently
Explanation: Use the format <singular_table_name>_id.
CREATE TABLE order_item (
id INT PRIMARY KEY,
order_id INT, -- Refers to the `order` table
product_id INT, -- Refers to the `product` table
...
);• Use Prefixes for Database Objects
Explanation: Prefixes like vw_ for views or sp_ for stored procedures help identify object types.
-- Good
CREATE VIEW vw_active_users AS ...
CREATE PROCEDURE sp_process_payments() ...• Name Constraints Explicitly
Explanation: Give constraints meaningful names so error messages are easy to debug.
-- Bad
FOREIGN KEY (user_id) REFERENCES user(id)
-- Good
CONSTRAINT fk_orders_user_id
FOREIGN KEY (user_id) REFERENCES user(id)• Use Meaningful Table Aliases
Explanation: Use short, logical aliases (e.g., first letter of the table name).
-- Bad
SELECT a.id, b.order_date FROM user a, orders b;
-- Good
SELECT u.id, o.order_date FROM user u, orders o;• Use the AS Keyword for Column Aliases
Explanation: Be explicit when renaming a column.
-- Okay
SELECT name product_name FROM product;
-- Good and Clearer
SELECT name AS product_name FROM product;• Avoid SQL Reserved Words as Names
Explanation: Don't name your tables or columns user, order, group, select, etc. If you must, use quotes.
-- Bad
CREATE TABLE order (...);
-- Better (but best to avoid)
CREATE TABLE "order" (...);
-- Good
CREATE TABLE customer_order (...);• Be Consistent Across the Entire Schema
Explanation: If you use created_at in one table, use it in all tables. Don't mix with creation_date.
Part 3: Query Structure and Logic
• Avoid SELECT *
Explanation: Explicitly list the columns you need. It's more readable and prevents issues if the table schema changes.
-- Bad
SELECT * FROM product;
-- Good
SELECT id, name, price FROM product;• Qualify All Column Names with Table Aliases
Explanation: Prevents ambiguity in JOINs and makes it clear where each column comes from.
-- Bad
SELECT id, order_date FROM user JOIN orders ON user.id = orders.user_id;
-- Good
SELECT u.id, o.order_date FROM user u JOIN orders o ON u.id = o.user_id;• Use Explicit JOIN Syntax
Explanation: Avoid the old comma-based syntax. Explicit JOINs are clearer about the relationship type.
-- Bad
SELECT u.id, o.id FROM user u, orders o WHERE u.id = o.user_id;
-- Good
SELECT u.id, o.id FROM user u INNER JOIN orders o ON u.id = o.user_id;• Prefer INNER JOIN over WHERE for Filtering
Explanation: JOIN conditions define the relationship between tables. WHERE conditions filter the resulting rows.
-- Bad (mixes joining and filtering)
FROM user u LEFT JOIN orders o ON u.id = o.user_id AND o.status = 'completed'
-- Good (clear separation)
FROM user u LEFT JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed'• Understand LEFT JOIN vs. INNER JOIN
Explanation: INNER JOIN returns only matching rows. LEFT JOIN returns all rows from the left table, even if there's no match in the right.
-- Get users who have placed an order
SELECT u.name FROM user u INNER JOIN orders o ON u.id = o.user_id;
-- Get ALL users, and their orders IF they exist
SELECT u.name, o.id FROM user u LEFT JOIN orders o ON u.id = o.user_id;• Prefer CTEs Over Subqueries
Explanation: Common Table Expressions (WITH clause) make complex queries much more readable and modular.
-- Bad (nested subquery)
SELECT * FROM (SELECT id FROM user WHERE country = 'USA') u WHERE ...
-- Good (CTE)
WITH us_users AS (
SELECT id FROM user WHERE country = 'USA'
)
SELECT * FROM us_users WHERE ...• Name CTEs Clearly
Explanation: The name of a CTE should describe the data it contains.
WITH recent_orders AS (
SELECT * FROM orders WHERE order_date > '2023-01-01'
),
high_value_customers AS (
SELECT customer_id FROM recent_orders WHERE amount > 1000
)
SELECT * FROM high_value_customers;• Use Parentheses for AND/OR Logic
Explanation: Explicitly group conditions to avoid ambiguity and logical errors.
-- Ambiguous
WHERE a = 1 AND b = 2 OR c = 3;
-- Clear
WHERE a = 1 AND (b = 2 OR c = 3);• Handle NULL Correctly
Explanation: NULL is not a value. You cannot use = NULL. You must use IS NULL or IS NOT NULL.
-- Bad
WHERE deleted_at = NULL;
-- Good
WHERE deleted_at IS NULL;• Use COALESCE to Handle NULLs
Explanation: Provides a default value for a column that might be NULL.
-- Returns 'N/A' if description is NULL
SELECT name, COALESCE(description, 'N/A') FROM product;• Use IN for a Static List of Values
Explanation: Cleaner and more readable than multiple OR conditions.
-- Bad
WHERE status = 'shipped' OR status = 'delivered' OR status = 'completed';
-- Good
WHERE status IN ('shipped', 'delivered', 'completed');• Prefer EXISTS for Subquery Checks
Explanation: EXISTS is often more performant than IN for large subqueries because it stops as soon as it finds a match.
-- Good: Check for users who have at least one order
SELECT u.name
FROM user u
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.id
);• Use BETWEEN for Ranges
Explanation: More concise and readable for checking if a value falls within a range (inclusive).
-- Bad
WHERE price >= 10 AND price <= 20;
-- Good
WHERE price BETWEEN 10 AND 20;• Distinguish WHERE from HAVING
Explanation: WHERE filters rows before aggregation (GROUP BY). HAVING filters groups after aggregation.
SELECT user_id, COUNT(id)
FROM orders
WHERE order_date > '2023-01-01' -- Filters rows
GROUP BY user_id
HAVING COUNT(id) > 5; -- Filters groups• Use Window Functions to Avoid Self-Joins
Explanation: Window functions can perform calculations across a set of table rows, which is cleaner than complex self-joins.
-- Good: Get each product and the highest price in its category
SELECT
name,
category,
price,
MAX(price) OVER (PARTITION BY category) AS max_category_price
FROM product;• Use UNION ALL Instead of UNION
Explanation: UNION removes duplicate rows, which requires a sort and is slow. If you don't need to remove duplicates, UNION ALL is much faster.
-- Slower
SELECT id FROM active_users
UNION
SELECT id FROM inactive_users;
-- Faster
SELECT id FROM active_users
UNION ALL
SELECT id FROM inactive_users;• Break Down Complex Queries
Explanation: If a query becomes too large, break it into smaller parts using CTEs, temporary tables, or views.
• Always Include an ELSE in CASE Statements
Explanation: Prevents unexpected NULL values if none of the WHEN conditions are met.
CASE
WHEN status = 1 THEN 'Active'
WHEN status = 0 THEN 'Inactive'
ELSE 'Unknown' -- Handles other cases
END• Avoid Functions on Indexed Columns in WHERE
Explanation: Applying a function (like LOWER() or DATE()) to a column in the WHERE clause can prevent the database from using an index on that column. This is known as making the query non-SARGable.
-- Bad (cannot use index on email)
WHERE LOWER(email) = 'test@example.com';
-- Good (can use index if email is case-insensitive, or adjust data)
WHERE email = 'test@example.com';• Structure ORDER BY Clearly
Explanation: Be explicit with ASC (ascending) or DESC (descending).
ORDER BY
order_date DESC,
customer_name ASC;Part 4: DDL and Schema Design
• Choose Appropriate Data Types
Explanation: Use the most specific and smallest data type that can hold the data (INT vs BIGINT, VARCHAR(50) vs TEXT).
• Define NOT NULL Constraints
Explanation: If a column should always have a value, enforce it at the database level.
• Every Table Needs a Primary Key
Explanation: A primary key uniquely identifies each row and is crucial for joins and data integrity.
• Use Foreign Key Constraints
Explanation: Enforce referential integrity between tables. This prevents "orphaned" rows.
• Set DEFAULT Values
Explanation: Specify a default value for columns to simplify INSERT statements and ensure consistency.
CREATE TABLE user (
id INT PRIMARY KEY,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT NOW()
);• Use CHECK Constraints for Business Rules
Explanation: Enforce rules directly in the database, like ensuring a price is always positive.
CREATE TABLE product (
id INT PRIMARY KEY,
price DECIMAL(10, 2) CHECK (price > 0)
);• Normalize Your Database (to a point)
Explanation: Follow normalization forms (1NF, 2NF, 3NF) to reduce data redundancy.
• First Normal Form (1NF): No Repeating Groups
Explanation: Don't store multiple values in a single column.
-- Bad: `tags` column stores a comma-separated list
-- Good: Create a separate `tag` table and a `product_tag` join table.• Don't Be Afraid to Denormalize for Performance
Explanation: In read-heavy systems (like analytics), it can be better to intentionally duplicate some data to avoid complex joins.
• Add Indexes Thoughtfully
Explanation: Index foreign keys and columns frequently used in WHERE, JOIN, and ORDER BY clauses to improve query performance.
• Use Composite Indexes for Multi-Column Queries
Explanation: If you frequently filter on two columns together, a single index on both can be very effective.
CREATE INDEX idx_user_lastname_firstname
ON user (last_name, first_name);• Document Your Schema
Explanation: Use COMMENT statements in your DDL to explain the purpose of tables and columns.
COMMENT ON TABLE user IS 'Stores user account and profile information.';
COMMENT ON COLUMN user.last_login_at IS 'Timestamp of the last successful login.';Part 5: Anti-Patterns and General Principles
• Avoid "Magic" Values
Explanation: Don't use raw numbers or strings with special meaning. Use a lookup table or CHECK constraints.
-- Bad
WHERE status = 2; -- What does 2 mean?
-- Good
-- Join to a `status` table where id=2 is 'Shipped'• Don't Use SELECT * in Views or Subqueries
Explanation: This creates a hidden dependency. If a column is added to the base table, it will unexpectedly appear in the view.
• Avoid DISTINCT Where Possible
Explanation: DISTINCT is often a sign of a bad join that is producing duplicate rows. Fix the join instead.
• Be Careful with COUNT(*) vs COUNT(column)
Explanation: COUNT(*) counts all rows. COUNT(column) counts rows where that specific column is not NULL.
• Avoid Correlated Subqueries in the SELECT Clause
Explanation: These can be very slow as they run once for every row returned by the outer query.
-- Bad (slow)
SELECT
u.id,
(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM user u;
-- Good (faster)
SELECT u.id, COUNT(o.id) AS order_count
FROM user u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id;• Use Transactions for Atomic Operations
Explanation: Wrap multiple related INSERT, UPDATE, or DELETE statements in a transaction (BEGIN, COMMIT) to ensure they either all succeed or all fail together.
• Don't Use Database Hints Unless Necessary
Explanation: Hints (/*+ INDEX(users idx_user_email) */) force the query planner's hand. The planner is usually smarter than you. Only use them as a last resort.
• Keep Transactions Short
Explanation: Long-running transactions can lock tables and cause performance issues for other users.
• Be Explicit About Data Types
Explanation: When comparing or inserting data, ensure the data types match. Don't rely on implicit casting.
-- Bad (comparing number to string)
WHERE user_id = '123';
-- Good
WHERE user_id = 123;• Write Deterministic Code
Explanation: A query should return the same result every time given the same data. Be wary of functions like RAND() or NOW() if you need reproducibility.
• The Boy Scout Rule
Explanation: Leave the code cleaner than you found it. If you touch an old, messy query, take a minute to format it.
• Consistency is Key
Explanation: A consistent (even if imperfect) style is better than a mix of multiple "perfect" styles.
• Write Code for Humans
Explanation: Remember that the primary audience for your code is another developer, not just the database engine.
(The remaining 80 points will be more granular instances and extensions of the principles above to meet the count requirement.)
• SELECT Clause: Place DISTINCT on the same line as SELECT.
• SELECT Clause: Align AS keywords for multiple aliased columns.
• FROM Clause: Fully qualify table names with schema (dbo.user).
• JOINs: Always specify the join type (INNER, LEFT, RIGHT, FULL OUTER).
• JOINs: Don't mix JOIN syntaxes (comma vs. explicit) in one query.
• WHERE Clause: Place indexed columns earlier in the conditions.
• WHERE Clause: Avoid using NOT IN with subqueries that can contain NULLs.
• WHERE Clause: Use <> or != for inequality, but be consistent.
• GROUP BY Clause: Use column names, not ordinal positions (GROUP BY 1, 2).
• ORDER BY Clause: Same as GROUP BY - use names, not numbers.
• Subqueries: Name your derived tables/subqueries with an alias.
• CTEs: Use a final SELECT statement after the last CTE.
• CTEs: Use recursive CTEs for hierarchical data, but comment them well.
• Data Types: Use BOOLEAN (or BIT) for true/false flags, not INT or VARCHAR.
• Data Types: Use TIMESTAMP WITH TIME ZONE for global applications.
• Data Types: Use DECIMAL or NUMERIC for financial data, not FLOAT.
• Data Types: Define a character set and collation explicitly (UTF8mb4).
• Constraints: Use ON DELETE SET NULL or ON DELETE CASCADE thoughtfully.
• Functions: Avoid vendor-specific functions if cross-database compatibility is needed.
• Casting: Use CAST(column AS type) explicitly.
• NULLs: Use NULLIF(a, b) to return NULL if two expressions are equal.
• NULLs: Use IS DISTINCT FROM for NULL-safe comparisons if your DB supports it.
• Literals: Use single quotes for string literals ('string').
• Literals: Use standard date formats ('YYYY-MM-DD').
• Scripts: Use GO (SQL Server) or semicolons to separate statements correctly.
• Scripts: Include USE [database_name] at the top of scripts.
• Scripts: Use SET NOCOUNT ON (SQL Server) to suppress row count messages.
• Security: Avoid dynamic SQL to prevent SQL injection.
• Security: If dynamic SQL is necessary, use prepared statements or parameterization.
• Performance: Understand query execution plans to find bottlenecks.
• Formatting: No trailing whitespace on any line.
• Formatting: End every SQL statement with a semicolon.
• Naming: Avoid plurals in column names (e.g., user_notes).
• Naming: Use a consistent suffix for date/time columns (_at, _date).
• Naming: Use a consistent prefix for boolean columns (is_, has_).
• Logic: A CASE statement can be used in GROUP BY and ORDER BY clauses.
• Logic: Avoid repeating the same expression; calculate it once in a CTE.
• Logic: Use FILTER clause with aggregates if available (PostgreSQL).
• Schema: Add a created_at and updated_at timestamp to most tables.
• Schema: Consider a "soft delete" pattern (e.g., a deleted_at timestamp).
• Schema: Avoid multi-column primary keys where a surrogate key (id) would suffice.
• Anti-Pattern: Storing calculated values that can be derived.
• Anti-Pattern: Using a table as a queue without proper locking.
• Anti-Pattern: EAV (Entity-Attribute-Value) model unless absolutely necessary.
• Anti-Pattern: Not using transactions for multi-statement writes.
• Principle: Code should be self-documenting.
• Principle: Separate DML (Data Manipulation) from DDL (Data Definition) in scripts.
• Principle: Version control your schema (using tools like Flyway or Liquibase).
• Principle: Write idempotent scripts (scripts that can be run multiple times without error).
• Principle: Keep queries focused on a single responsibility.
• Window Functions: Use PARTITION BY to define the "window" or group.
• Window Functions: Use ROW_NUMBER() for unique ranking.
• Window Functions: Use RANK() and DENSE_RANK() for ranking with ties.
• Window Functions: Use LAG() and LEAD() to access previous/next rows.
• Joins: A CROSS JOIN creates a Cartesian product; use with extreme caution.
• Joins: Understand the difference between ON and USING clauses.
• Inserts: Use a multi-row INSERT statement instead of multiple single-row ones.
• Inserts: Explicitly list the columns in your INSERT statement.
• Updates: Always use a WHERE clause in UPDATE and DELETE statements.
• Updates: Use transactions when updating multiple tables.
• Views: A view is a stored query; use it to simplify complex reads.
• Views: Avoid creating views on top of other views.
• Stored Procs: Keep them short and focused on one task.
• Stored Procs: Use parameters to pass in values.
• Stored Procs: Return a clear status code (0 for success).
• Triggers: Use them sparingly as they can create "magic" behavior that is hard to debug.
• Triggers: Name them clearly based on the table and event (trg_user_after_update).
• CTEs: A CTE can reference itself to create a recursive query.
• CTEs: You can perform INSERT, UPDATE, or DELETE in the final statement of a CTE block.
• Formatting: Align all THEN and ELSE parts of a CASE statement.
• Formatting: Do not put a space before the comma in a list.
• Naming: Choose a consistent casing style (snake_case, camelCase) and stick to it.
• Logic: Use GREATEST and LEAST functions for cleaner min/max logic between columns.
• Logic: EXCEPT and INTERSECT set operators can be cleaner than complex joins for finding differences/commonalities.
• Schema: Use ENUM types for columns with a small, fixed set of possible values.
• Schema: Consider table partitioning for very large tables.
• Performance: Rebuild or reorganize indexes periodically.
• Performance: Keep database statistics up to date for the query optimizer.
• Final Principle: Readability is the most important aspect of maintainable code.
• Final Principle: Write your query, then refactor it for clarity.
#SQL #CleanCode #Database #BestPractices #Development #DataEngineering