This is one of the most common questions asked by developers who write SQL queries against the PostgreSQL database. There are multiple ways in which a sub select or lookup can be framed in a SQL statement. PostgreSQL optimizer is very smart at optimizing queries, and many of the queries can be rewritten/transformed for better performance.
Let’s discuss the topic with an example, for which I am using schema created by pgbench.
Note: For those not familiar with pgbench, it is a micro benchmarking tool shipped with PostgreSQL. A sample pgbench schema can be initialized with some data as follows:
1 | pgbench -i -s 10 |
For this example, I have updated the branch balance of a couple of branches:
1 | update pgbench_branches set bbalance=4500000 where bid in (4,7); |
INclusion Queries
The SQL Challenge for this example is: Find out the number of accounts per branch from pgbench_accounts for those branches where branch level balance is greater than zero. This query can be written in four different ways, as per ANSI SQL Standards.
1. Using IN Clause
1 2 3 | SELECT count(aid),bid FROM pgbench_accounts WHERE bid in (SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
2. Using ANY Clause
1 2 3 | SELECT count(aid),bid FROM pgbench_accounts WHERE bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
3. Using EXISTS Clause
1 2 3 4 5 | SELECT count(aid),bid FROM pgbench_accounts WHERE EXISTS (SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid) GROUP BY bid; |
4. Using INNER JOIN
1 2 3 4 5 | SELECT count(aid),a.bid FROM pgbench_accounts a JOIN pgbench_branches b ON a.bid = b.bid WHERE b.bbalance > 0 GROUP BY a.bid; |
While writing the query, one might assume that EXISTS and INNER JOIN might be better because they can use all the logic and optimization for joining two tables, while IN and ANY clauses need to deal with subqueries. However, PostgreSQL (at least PG 10 and above) is smart enough to produce the same execution plan for all four options!.
All of the above queries will be generating the same execution plan as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 | HashAggregate (cost=31132.65..31132.75 rows=10 width=12) (actual time=279.625..279.626 rows=2 loops=1) Group Key: a.bid -> Hash Join (cost=1.15..30132.65 rows=200000 width=8) (actual time=63.686..242.956 rows=200000 loops=1) Hash Cond: (a.bid = b.bid) -> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.012..86.250 rows=1000000 loops=1) -> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.016..0.016 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: 8 Planning Time: 0.257 ms Execution Time: 279.703 ms (12 rows) |
Note: Suppress the parallel execution for better readability and a simple execution plan. Even with a parallel execution plan, all the queries are producing the same execution plan.
So can we conclude that we can write the query as we are comfortable and PostgreSQL’s intelligence will take care of the rest? Wait! Things can go differently if we take the exclusion scenario.
Exclusion Queries
The SQL challenge becomes: Find out the number of accounts per branch from pgbench_accounts EXCEPT for those branches where branch level balance is greater than zero.
So the four ways to write queries becomes:
1. Using NOT IN
1 2 3 | SELECT count(aid),bid FROM pgbench_accounts WHERE bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
2. Using <> ALL
1 2 3 | SELECT count(aid),bid FROM pgbench_accounts WHERE bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
3. Using NOT EXISTS
1 2 3 4 5 | SELECT count(aid),bid FROM pgbench_accounts WHERE NOT EXISTS (SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid) GROUP BY bid; |
4. Using LEFT JOIN and IS NULL
1 2 3 4 5 | SELECT count(aid),a.bid FROM pgbench_accounts a LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0 WHERE b.bid IS NULL GROUP BY a.bid; |
The “NOT IN” and “<> ALL” produces an execution plan with sub-queries (SubPlan). They are respectively:
1 2 3 4 5 6 7 8 9 10 11 12 | HashAggregate (cost=31395.13..31395.23 rows=10 width=12) (actual time=395.297..395.299 rows=8 loops=1) Group Key: pgbench_accounts.bid -> Seq Scan on pgbench_accounts (cost=1.13..28895.13 rows=500000 width=8) (actual time=0.042..250.086 rows=800000 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 200000 SubPlan 1 -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: 8 Planning Time: 0.197 ms Execution Time: 395.363 ms (11 rows) |
and
1 2 3 4 5 6 7 8 9 10 11 12 13 | HashAggregate (cost=601394.00..601394.10 rows=10 width=12) (actual time=731.987..731.989 rows=8 loops=1) Group Key: pgbench_accounts.bid -> Seq Scan on pgbench_accounts (cost=0.00..598894.00 rows=500000 width=8) (actual time=0.041..579.264 rows=800000 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 200000 SubPlan 1 -> Materialize (cost=0.00..1.14 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1000000) -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: 8 Planning Time: 0.203 ms Execution Time: 732.142 ms (12 rows) |
While NOT EXISTS and LEFT JOIN produces the same execution plan without a sub-plan as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 | HashAggregate (cost=41245.15..41245.25 rows=10 width=12) (actual time=500.193..500.195 rows=8 loops=1) Group Key: a.bid -> Hash Anti Join (cost=1.15..37245.15 rows=800000 width=8) (actual time=0.041..344.845 rows=800000 loops=1) Hash Cond: (a.bid = b.bid) -> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.013..110.645 rows=1000000 loops=1) -> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.018..0.018 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.011..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: 8 Planning Time: 0.248 ms Execution Time: 500.266 ms (12 rows) |
These direct hash (anti) joins between the tables is the smartest way to answer the query. So this stands as a strong reason for recommending the EXISTS syntax or JOIN syntax. So the general rule of thumb favoring EXISTS/JOINs is holding good.
But wait! Do we see a better execution time with the NOT IN clause even with a sub-plan? Yes. PostgreSQL has done excellent optimization, thereby preparing a Hash of sub-plan NOT (hashed SubPlan 1). So PostgreSQL has a better understanding of how to deal with an IN clause, which is the logical way of thinking, as many people tend to write with IN clause. But we have very few rows (two) returned by sub-plan. The same happens even if the subquery returns a few hundred rows.
But what if there is a large number of rows (few hundreds of thousands of rows) returned by subquery? Let’s try a simple example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | CREATE TABLE t1 AS SELECT * FROM generate_series(0, 500000) id; CREATE TABLE t2 AS SELECT (random() * 4000000)::integer id FROM generate_series(0, 4000000); ANALYZE t1; ANALYZE t2; EXPLAIN SELECT id FROM t1 WHERE id NOT IN (SELECT id FROM t2); |
In this case, the execution plan is:
1 2 3 4 5 6 | Seq Scan on t1 (cost=0.00..2583268004.52 rows=250000 width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..9333.01 rows=400001 width=4) -> Seq Scan on t2 (cost=0.00..5770.01 rows=400001 width=4) (5 rows) |
In this case, the execution plan switches to the materialization of the result of the sub-plan, and the estimated cost jumps to 25831564501.02! (With PostgreSQL default settings, if the number of rows from t2 is lesser than 100k approximately, it uses the hashed sub-plan as we discussed.)
This will result in substantial degradation of performance. So the IN clause works great if the sub-plan selects a fewer number of rows.
The catch here is when development happens, there will be fewer rows in tables, and it works differently as the number of rows increases, as the execution plan drifts and can result in big performance issues in live production.
Is There More Complexity We Should Be Aware Of?
Yes, there could be datatype conversions happening when we write the query in a different way.
For example, a statement like:
1 | EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']); |
is resulting in implicit datatype conversion of the values of the fields to text.
1 2 | Seq Scan on emp (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1) Filter: ((gen)::text = ANY ('{M,F}'::text[])) |
Please note the datatype conversion: (gen)::text . On a big table, this type of conversion will have overhead, whereas PostgreSQL does a better job in dealing with the IN clause.
1 2 3 4 | EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F'); Seq Scan on emp (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1) Filter: (gen = ANY ('{M,F}'::bpchar[])) |
Even though the IN clause is converted into the ANY clause, there is no data type conversion of the “gen” field. And the specified values ‘M’,’F’ are converted into bpchar, which is an internal equivalent of CHAR.
Summary
My intention while writing this blog post is not to favor any particular way of writing a query, but to shed some light on where things can go wrong and what should be considered.
In general, I used to suggest to developers that the key to writing a good SQL statement is to follow step by step process.
- First, make a list of tables from which the data should be retrieved.
- Then think about how to JOIN those tables.
- Think about how to have the minimum records participating in the join condition.
Avoid thinking from “How to break the logic” into subqueries.
Never assume that the query is performing well with a small amount of data in the table.
Use an EXPLAIN plan to understand what is going on in the background.
In general, EXISTS and direct JOIN of tables often results in good results. PostgreSQL optimizes the IN clause to a hashed sub-plan in many cases. “IN” can result in a better plan and execution in some specific situations. Again, everything depends on how a query is rewritten/transformed internally. It is worth investing time in rewriting queries for better optimization.
Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.
I will definitely revisit some of my queries. Thank you for this post.
The exclusion queries are not equivalent and can produce vastly different results. Just try it where the exclusion set contains 100 values, one of which is null.
Really nice post.
I know that is centered on performance but as some people points that NOT IN is a don’t [1], include a note about how NULLs are handled in the different options or at least writing a warn about it will be nice.
[1] https://wiki.postgresql.org/wiki/Don't_Do_This
I personally like the first scenario, no problem you can call me a lazy DBA. I always afraid of exclusion queries triggered by tall developers.
Trying to assess a situation where from a heavy table in PostgreSQL, data is fetched using an API and filter is applied making use of an IN clause and the values to filter are passed in a measure of few thousands. Will psql still use the in clause optimally? or some other construct suits well?