Optimizing MySQL Queries with Indexing: Higher Cardinality vs. Lower Cardinality for Ranges

2024-07-27

An index is a special data structure in a database table that helps speed up retrieval of specific rows. It's like an organized catalog in a library that allows you to quickly find books based on author, title, or other criteria.
Indexes work by creating sorted entries for one or more columns in a table. These entries map the column values to the corresponding row locations.

Cardinality

Cardinality refers to the number of distinct values in a column. A low cardinality column has few unique values (e.g., a gender column with only "male" and "female"), while a high cardinality column has many (e.g., a user ID column with a unique value for each user).

Impact on Performance with Ranges

When a query involves a range condition on a column, the order of columns in the index becomes crucial for optimal performance.
Generally, you want to place the lower cardinality column first in the index. This is because the index is traversed in a specific order (often using a B-Tree structure), and a lower number of unique values allows the database engine to narrow down the search path more efficiently.
Imagine searching for books by genre ("Science Fiction") in a library catalog. If the catalog is indexed by genre first (low cardinality) and then by author (high cardinality), you'll quickly find all science fiction books without having to check every author.

Example

Suppose you have a table orders with columns customer_id (low cardinality, few unique customers) and order_date (high cardinality, many unique dates). Here, an index on (customer_id, order_date) would be more beneficial for queries that filter by customer_id and then have a range condition on order_date (e.g., finding all orders for a specific customer between two dates).

However, there are exceptions:

If your query typically uses all columns in the index for filtering or sorting, a higher cardinality column might be better first. This is because a higher cardinality column filters out more rows earlier in the search process.
The optimal column order depends on your specific workload and query patterns. It's always recommended to analyze your query patterns and use tools like EXPLAIN in MySQL to understand how the database engine is using indexes and identify potential improvements.

Imagine a table products with columns:

product_id (INT, primary key, high cardinality - many unique products)
category (VARCHAR(50), low cardinality - limited number of categories)
price (DECIMAL, medium cardinality)

Case 1: Filtering by category and then range on price (Higher cardinality column - price - first might be better)

This query retrieves products within a specific price range for a given category:

SELECT *
FROM products
WHERE category = 'Electronics'
  AND price BETWEEN 100 AND 200;

Here, even though category has lower cardinality, price might be a better first column for the index if filtering by category is followed by a price range. This allows the index to filter by price more efficiently after narrowing down rows by category.

Index:

CREATE INDEX product_price_category ON products (price, category);

Explanation:

The index starts searching by price (higher cardinality), potentially filtering out a larger number of rows earlier.
Once rows are narrowed down by price, the index then uses the category (lower cardinality) to further refine the result.

Case 2: Filtering by category with no range (Lower cardinality column - category - first is better)

This query retrieves all products in a specific category:

SELECT *
FROM products
WHERE category = 'Books';

For this scenario, an index with category first would be more beneficial:

CREATE INDEX product_category ON products (category, price);

Since there's no range on price, filtering by category (lower cardinality) first allows the index to quickly locate relevant rows using its fewer distinct values.
price can still be used for sorting or filtering within the narrowed-down category results.

A covering index includes all columns needed for both filtering (WHERE clause) and retrieving data (SELECT clause) in a query. This eliminates the need to access the actual table data, significantly improving performance.
However, covering indexes can become large and require careful design to avoid redundancy and maintain efficiency.

If your query often retrieves both category and price after filtering by category:

SELECT category, price
FROM products
WHERE category = 'Electronics';

A covering index on (category, price) would be ideal.

Multiple Indexes and Index Merging:

You can create separate indexes on each column involved in the WHERE clause, especially for OR conditions.
MySQL can then use a technique called "index merge" to efficiently combine the results from multiple indexes, potentially offering better performance than a single composite index.

If your query filters by either category or price:

SELECT *
FROM products
WHERE category = 'Electronics' OR price BETWEEN 100 AND 200;

Create separate indexes on category and price. MySQL might use index merging to leverage both indexes for faster retrieval.

Materialized Views (Limited use case):

In specific scenarios, a materialized view (a pre-computed table summarizing data from the base table) can be beneficial. However, maintaining materialized views adds overhead and requires careful management for data consistency.

Denormalization (Extreme caution advised):

In rare cases, denormalization, where you strategically duplicate data in certain tables, might improve query performance. However, this can lead to data redundancy and consistency issues. It should only be considered as a last resort.

mysql performance indexing