Unlocking Flexibility: Using DISTINCT ON for Distinct Rows with Custom Ordering in PostgreSQL
In PostgreSQL, DISTINCT ON
is a window function that helps you retrieve distinct rows based on specific criteria while maintaining a desired order. It's particularly useful when you want the "first" occurrence of a distinct value within a group, but the exact "first" row might be ambiguous without proper sorting.
How it Works:
- Specify Columns for Distinction: You provide a comma-separated list of columns within parentheses after
DISTINCT ON
. These columns determine how rows are grouped to identify duplicates. - Order the Results (Optional): You can use the
ORDER BY
clause to specify how the rows within each distinct group should be ordered. This becomes crucial if you want a specific "first" row to be returned.
Challenge: Conflicting Requirements
DISTINCT ON
relies on sorting to identify distinct groups efficiently.- You might want the distinct rows ordered differently than the grouping criteria.
Solution: Subquery with Nested Sorting
- Inner Query: Here, you use
DISTINCT ON
with the grouping columns. You can omitORDER BY
in this inner query. - Outer Query: Wrap the inner query with a
SELECT
statement. UseORDER BY
here to specify the final desired order for the distinct rows.
Example:
Imagine a table purchases
with columns address_id
, product_id
, and purchase_date
. You want the most recent purchase for each distinct address_id
, but ordered by product_id
in ascending order:
SELECT *
FROM (
SELECT DISTINCT ON (address_id) *
FROM purchases
ORDER BY address_id, purchase_date DESC -- Group by address, sort DESC for most recent
) AS sub
ORDER BY product_id ASC; -- Final order by product (ascending)
Explanation:
- The inner query uses
DISTINCT ON (address_id)
to identify distinctaddress_id
values. - It sorts by
address_id
(not strictly necessary here) and thenpurchase_date
descending (DESC) to ensure the most recent purchase for each address is selected. - The outer query wraps the inner query and uses
ORDER BY product_id ASC
to sort the final results byproduct_id
in ascending order.
Key Points:
- The leading columns in
ORDER BY
of the outer query must match the columns specified inDISTINCT ON
. - For optimal performance, consider creating indexes on the columns involved in
DISTINCT ON
andORDER BY
.
This example uses the orders
table with columns customer_id
, product_name
, and order_date
:
SELECT c.customer_name, o.product_name, o.order_date
FROM customers c
JOIN (
SELECT DISTINCT ON (customer_id) customer_id, product_name, order_date
FROM orders
ORDER BY customer_id, order_date DESC -- Group by customer, sort DESC for most recent
) AS sub
ON c.customer_id = sub.customer_id
ORDER BY sub.product_name ASC; -- Final order by product name (ascending)
- We join the
customers
table with a subquery that usesDISTINCT ON (customer_id)
to identify distinct customers. - The inner query sorts by
customer_id
(not strictly necessary here) and then byorder_date
descending (DESC) to select the most recent order for each customer.
Example 2: Finding Highest-Earning Employees by Department (Ordered by Salary)
This example assumes a employees
table with columns department_id
, employee_name
, and salary
:
SELECT e.department_name, e.employee_name, e.salary
FROM employees e
JOIN (
SELECT DISTINCT ON (department_id) department_id, employee_name, salary
FROM employees
ORDER BY department_id, salary DESC -- Group by department, sort DESC for highest salary
) AS sub
ON e.department_id = sub.department_id
ORDER BY sub.salary DESC; -- Final order by salary (descending)
- The inner query sorts by
department_id
(not strictly necessary here) and then bysalary
descending (DESC) to find the employee with the highest salary in each department. - The outer query uses
ORDER BY sub.salary DESC
to sort the final results by employee salary in descending order (highest to lowest).
This method is suitable when you want to retrieve the "maximum" value (like the most recent date or highest value) based on a specific column within each distinct group. However, it might return additional columns depending on your query.
SELECT c.customer_name, o.product_name, o.order_date
FROM customers c
JOIN (
SELECT customer_id, product_name, MAX(order_date) AS max_order_date
FROM orders
GROUP BY customer_id, product_name -- Group by customer and product
) AS sub
ON c.customer_id = sub.customer_id
AND o.customer_id = sub.customer_id -- Join on both customer and product
AND o.order_date = sub.max_order_date; -- Match on maximum date
- The subquery groups rows by
customer_id
andproduct_name
. - It uses
MAX(order_date)
to find the maximum order date (most recent) within each group. - The outer query joins back to the
orders
table to retrieve the specific row matching the maximum date.
Window Function ROW_NUMBER():
This approach uses the ROW_NUMBER()
window function to assign a unique row number within each distinct group based on the specified ORDER BY
criteria. You can then filter for rows with a row number of 1 (the first occurrence).
SELECT c.customer_name, o.product_name, o.order_date
FROM customers c
JOIN (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders
) AS sub
ON c.customer_id = sub.customer_id
WHERE sub.row_num = 1; -- Filter for rows with row number 1 (first occurrence)
- The subquery uses
ROW_NUMBER()
withPARTITION BY customer_id
andORDER BY order_date DESC
to assign a row number for each customer, starting with 1 for the most recent order. - The outer query joins back to the
customers
table and filters for rows whererow_num
is 1, selecting the first occurrence for each customer.
LIMIT with ORDER BY (Limited to Top N)
If you only need the top N distinct values based on an ORDER BY
clause, you can use LIMIT
with ORDER BY
. However, this approach won't guarantee a specific order within each distinct group beyond the top N.
SELECT *
FROM (
SELECT customer_id, product_name, order_date
FROM orders
ORDER BY customer_id, order_date DESC -- Sort by customer and order date DESC
) AS sub
LIMIT 1; -- Retrieve only the first row (top 1) for each customer (based on order)
- The subquery sorts the data by
customer_id
andorder_date
descending. - The outer query uses
LIMIT 1
to retrieve only the first row in the sorted subquery, which corresponds to the most recent order for each customer based on the sort order.
Choosing the Right Method:
- Use the subquery with
GROUP BY
andMAX
when you specifically want the maximum value within each group (e.g., most recent date, highest value). - Consider
ROW_NUMBER()
for more flexibility in filtering based on row number within each distinct group. - Use
LIMIT
withORDER BY
if you only require the top N distinct values and don't need a specific order within each group. DISTINCT ON
with differentORDER BY
might be the most efficient choice if performance is a concern and you need both distinct groups and a specific order within those groups.
sql postgresql sql-order-by