SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN

SQL optimizations in PostgreSQL This is one of the most common questions asked by developers who write SQL queries against the PostgreSQL database. There are multiple ways in which a sub select or lookup can be framed in a SQL statement. PostgreSQL optimizer is very smart at optimizing queries, and many of the queries can be rewritten/transformed for better performance.

Let’s discuss the topic with an example, for which I am using schema created by pgbench.

Note: For those not familiar with pgbench, it is a micro benchmarking tool shipped with PostgreSQL. A sample pgbench schema can be initialized with some data as follows:

pgbench -i -s 10

1	pgbench -i -s 10

For this example, I have updated the branch balance of a couple of branches:

update pgbench_branches set bbalance=4500000 where bid in (4,7);

1	update pgbench_branches set bbalance=4500000 where bid in (4,7);

INclusion Queries

The SQL Challenge for this example is: Find out the number of accounts per branch from pgbench_accounts for those branches where branch level balance is greater than zero. This query can be written in four different ways, as per ANSI SQL Standards.

1. Using IN Clause

SELECT count(aid),bid FROM pgbench_accounts WHERE
bid in (SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts WHERE

bid in (SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

2. Using ANY Clause

SELECT count(aid),bid FROM pgbench_accounts WHERE
bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts WHERE

bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

3. Using EXISTS Clause

SELECT count(aid),bid
FROM pgbench_accounts WHERE EXISTS
(SELECT bid FROM pgbench_branches WHERE bbalance > 0
AND pgbench_accounts.bid = pgbench_branches.bid)
GROUP BY bid;

SELECT count(aid),bid

FROM pgbench_accounts WHERE EXISTS

(SELECT bid FROM pgbench_branches WHERE bbalance > 0

AND pgbench_accounts.bid = pgbench_branches.bid)

GROUP BY bid;

4. Using INNER JOIN

SELECT count(aid),a.bid
FROM pgbench_accounts a
JOIN pgbench_branches b ON a.bid = b.bid
WHERE b.bbalance > 0
GROUP BY a.bid;

SELECT count(aid),a.bid

FROM pgbench_accounts a

JOIN pgbench_branches b ON a.bid = b.bid

WHERE b.bbalance > 0

GROUP BY a.bid;

While writing the query, one might assume that EXISTS and INNER JOIN might be better because they can use all the logic and optimization for joining two tables, while IN and ANY clauses need to deal with subqueries. However, PostgreSQL (at least PG 10 and above) is smart enough to produce the same execution plan for all four options!.

All of the above queries will be generating the same execution plan as follows:

 HashAggregate  (cost=31132.65..31132.75 rows=10 width=12) (actual time=279.625..279.626 rows=2 loops=1)
   Group Key: a.bid
   ->  Hash Join  (cost=1.15..30132.65 rows=200000 width=8) (actual time=63.686..242.956 rows=200000 loops=1)
         Hash Cond: (a.bid = b.bid)
         ->  Seq Scan on pgbench_accounts a  (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.012..86.250 rows=1000000 loops=1)
         ->  Hash  (cost=1.12..1.12 rows=2 width=4) (actual time=0.016..0.016 rows=2 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on pgbench_branches b  (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)
                     Filter: (bbalance > 0)
                     Rows Removed by Filter: 8
 Planning Time: 0.257 ms
 Execution Time: 279.703 ms
(12 rows)

HashAggregate (cost=31132.65..31132.75 rows=10 width=12) (actual time=279.625..279.626 rows=2 loops=1)

Group Key: a.bid

-> Hash Join (cost=1.15..30132.65 rows=200000 width=8) (actual time=63.686..242.956 rows=200000 loops=1)

Hash Cond: (a.bid = b.bid)

-> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.012..86.250 rows=1000000 loops=1)

-> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.016..0.016 rows=2 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 9kB

-> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)

Filter: (bbalance > 0)

Rows Removed by Filter: 8

Planning Time: 0.257 ms

Execution Time: 279.703 ms

(12 rows)

Note: Suppress the parallel execution for better readability and a simple execution plan. Even with a parallel execution plan, all the queries are producing the same execution plan.

So can we conclude that we can write the query as we are comfortable and PostgreSQL’s intelligence will take care of the rest? Wait! Things can go differently if we take the exclusion scenario.

Exclusion Queries

The SQL challenge becomes: Find out the number of accounts per branch from pgbench_accounts EXCEPT for those branches where branch level balance is greater than zero.

So the four ways to write queries becomes:

1. Using NOT IN

SELECT count(aid),bid FROM pgbench_accounts WHERE
bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts WHERE

bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

2. Using <> ALL

SELECT count(aid),bid FROM pgbench_accounts WHERE
bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts WHERE

bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

3. Using NOT EXISTS

SELECT count(aid),bid
FROM pgbench_accounts WHERE NOT EXISTS
(SELECT bid FROM pgbench_branches WHERE bbalance > 0
AND pgbench_accounts.bid = pgbench_branches.bid)
GROUP BY bid;

SELECT count(aid),bid

FROM pgbench_accounts WHERE NOT EXISTS

(SELECT bid FROM pgbench_branches WHERE bbalance > 0

AND pgbench_accounts.bid = pgbench_branches.bid)

GROUP BY bid;

4. Using LEFT JOIN and IS NULL

SELECT count(aid),a.bid
FROM pgbench_accounts a
LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0
WHERE b.bid IS NULL
GROUP BY a.bid;

SELECT count(aid),a.bid

FROM pgbench_accounts a

LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0

WHERE b.bid IS NULL

GROUP BY a.bid;

The “NOT IN” and “<> ALL” produces an execution plan with sub-queries (SubPlan). They are respectively:

 HashAggregate  (cost=31395.13..31395.23 rows=10 width=12) (actual time=395.297..395.299 rows=8 loops=1)
   Group Key: pgbench_accounts.bid
   ->  Seq Scan on pgbench_accounts  (cost=1.13..28895.13 rows=500000 width=8) (actual time=0.042..250.086 rows=800000 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         Rows Removed by Filter: 200000
         SubPlan 1
           ->  Seq Scan on pgbench_branches  (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)
                 Filter: (bbalance > 0)
                 Rows Removed by Filter: 8
 Planning Time: 0.197 ms
 Execution Time: 395.363 ms
(11 rows)

HashAggregate (cost=31395.13..31395.23 rows=10 width=12) (actual time=395.297..395.299 rows=8 loops=1)

Group Key: pgbench_accounts.bid

-> Seq Scan on pgbench_accounts (cost=1.13..28895.13 rows=500000 width=8) (actual time=0.042..250.086 rows=800000 loops=1)

Filter: (NOT (hashed SubPlan 1))

Rows Removed by Filter: 200000

SubPlan 1

-> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)

Filter: (bbalance > 0)

Rows Removed by Filter: 8

Planning Time: 0.197 ms

Execution Time: 395.363 ms

(11 rows)

and

 HashAggregate  (cost=601394.00..601394.10 rows=10 width=12) (actual time=731.987..731.989 rows=8 loops=1)
   Group Key: pgbench_accounts.bid
   ->  Seq Scan on pgbench_accounts  (cost=0.00..598894.00 rows=500000 width=8) (actual time=0.041..579.264 rows=800000 loops=1)
         Filter: (SubPlan 1)
         Rows Removed by Filter: 200000
         SubPlan 1
           ->  Materialize  (cost=0.00..1.14 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1000000)
                 ->  Seq Scan on pgbench_branches  (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)
                       Filter: (bbalance > 0)
                       Rows Removed by Filter: 8
 Planning Time: 0.203 ms
 Execution Time: 732.142 ms
(12 rows)

HashAggregate (cost=601394.00..601394.10 rows=10 width=12) (actual time=731.987..731.989 rows=8 loops=1)

Group Key: pgbench_accounts.bid

-> Seq Scan on pgbench_accounts (cost=0.00..598894.00 rows=500000 width=8) (actual time=0.041..579.264 rows=800000 loops=1)

Filter: (SubPlan 1)

Rows Removed by Filter: 200000

SubPlan 1

-> Materialize (cost=0.00..1.14 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1000000)

-> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)

Filter: (bbalance > 0)

Rows Removed by Filter: 8

Planning Time: 0.203 ms

Execution Time: 732.142 ms

(12 rows)

While NOT EXISTS and LEFT JOIN produces the same execution plan without a sub-plan as follows:

 HashAggregate  (cost=41245.15..41245.25 rows=10 width=12) (actual time=500.193..500.195 rows=8 loops=1)
   Group Key: a.bid
   ->  Hash Anti Join  (cost=1.15..37245.15 rows=800000 width=8) (actual time=0.041..344.845 rows=800000 loops=1)
         Hash Cond: (a.bid = b.bid)
         ->  Seq Scan on pgbench_accounts a  (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.013..110.645 rows=1000000 loops=1)
         ->  Hash  (cost=1.12..1.12 rows=2 width=4) (actual time=0.018..0.018 rows=2 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on pgbench_branches b  (cost=0.00..1.12 rows=2 width=4) (actual time=0.011..0.012 rows=2 loops=1)
                     Filter: (bbalance > 0)
                     Rows Removed by Filter: 8
 Planning Time: 0.248 ms
 Execution Time: 500.266 ms
(12 rows)

HashAggregate (cost=41245.15..41245.25 rows=10 width=12) (actual time=500.193..500.195 rows=8 loops=1)

Group Key: a.bid

-> Hash Anti Join (cost=1.15..37245.15 rows=800000 width=8) (actual time=0.041..344.845 rows=800000 loops=1)

Hash Cond: (a.bid = b.bid)

-> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.013..110.645 rows=1000000 loops=1)

-> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.018..0.018 rows=2 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 9kB

-> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.011..0.012 rows=2 loops=1)

Filter: (bbalance > 0)

Rows Removed by Filter: 8

Planning Time: 0.248 ms

Execution Time: 500.266 ms

(12 rows)

These direct hash (anti) joins between the tables is the smartest way to answer the query. So this stands as a strong reason for recommending the EXISTS syntax or JOIN syntax. So the general rule of thumb favoring EXISTS/JOINs is holding good.

But wait! Do we see a better execution time with the NOT IN clause even with a sub-plan? Yes. PostgreSQL has done excellent optimization, thereby preparing a Hash of sub-plan NOT (hashed SubPlan 1). So PostgreSQL has a better understanding of how to deal with an IN clause, which is the logical way of thinking, as many people tend to write with IN clause. But we have very few rows (two) returned by sub-plan. The same happens even if the subquery returns a few hundred rows.

But what if there is a large number of rows (few hundreds of thousands of rows) returned by subquery? Let’s try a simple example:

CREATE TABLE t1 AS
SELECT * FROM generate_series(0, 500000) id;

CREATE TABLE t2 AS
SELECT (random() * 4000000)::integer id
FROM generate_series(0, 4000000);

ANALYZE t1;
ANALYZE t2;

EXPLAIN SELECT id
FROM t1
WHERE id NOT IN (SELECT id FROM t2);

CREATE TABLE t1 AS

SELECT * FROM generate_series(0, 500000) id;

CREATE TABLE t2 AS

SELECT (random() * 4000000)::integer id

FROM generate_series(0, 4000000);

ANALYZE t1;

ANALYZE t2;

EXPLAIN SELECT id

FROM t1

WHERE id NOT IN (SELECT id FROM t2);

In this case, the execution plan is:

 Seq Scan on t1  (cost=0.00..2583268004.52 rows=250000 width=4)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..9333.01 rows=400001 width=4)
           ->  Seq Scan on t2  (cost=0.00..5770.01 rows=400001 width=4)
(5 rows)

Seq Scan on t1 (cost=0.00..2583268004.52 rows=250000 width=4)

Filter: (NOT (SubPlan 1))

SubPlan 1

-> Materialize (cost=0.00..9333.01 rows=400001 width=4)

-> Seq Scan on t2 (cost=0.00..5770.01 rows=400001 width=4)

(5 rows)

In this case, the execution plan switches to the materialization of the result of the sub-plan, and the estimated cost jumps to 25831564501.02! (With PostgreSQL default settings, if the number of rows from t2 is lesser than 100k approximately, it uses the hashed sub-plan as we discussed.)
This will result in substantial degradation of performance. So the IN clause works great if the sub-plan selects a fewer number of rows.

The catch here is when development happens, there will be fewer rows in tables, and it works differently as the number of rows increases, as the execution plan drifts and can result in big performance issues in live production.

Is There More Complexity We Should Be Aware Of?

Yes, there could be datatype conversions happening when we write the query in a different way.

For example, a statement like:

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']);

1	EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']);

is resulting in implicit datatype conversion of the values of the fields to text.

 Seq Scan on emp  (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1)
   Filter: ((gen)::text = ANY ('{M,F}'::text[]))

1 2	Seq Scan on emp (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1) Filter: ((gen)::text = ANY ('{M,F}'::text[]))

Please note the datatype conversion: (gen)::text . On a big table, this type of conversion will have overhead, whereas PostgreSQL does a better job in dealing with the IN clause.

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');

 Seq Scan on emp  (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)
   Filter: (gen = ANY ('{M,F}'::bpchar[]))

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');

Seq Scan on emp (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)

Filter: (gen = ANY ('{M,F}'::bpchar[]))

Even though the IN clause is converted into the ANY clause, there is no data type conversion of the “gen” field. And the specified values ‘M’,’F’ are converted into bpchar, which is an internal equivalent of CHAR.

Summary

My intention while writing this blog post is not to favor any particular way of writing a query, but to shed some light on where things can go wrong and what should be considered.

In general, I used to suggest to developers that the key to writing a good SQL statement is to follow step by step process.

First, make a list of tables from which the data should be retrieved.
Then think about how to JOIN those tables.
Think about how to have the minimum records participating in the join condition.

Avoid thinking from “How to break the logic” into subqueries.

Never assume that the query is performing well with a small amount of data in the table.

Use an EXPLAIN plan to understand what is going on in the background.

In general, EXISTS and direct JOIN of tables often results in good results. PostgreSQL optimizes the IN clause to a hashed sub-plan in many cases. “IN” can result in a better plan and execution in some specific situations. Again, everything depends on how a query is rewritten/transformed internally. It is worth investing time in rewriting queries for better optimization.

Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.

Download PDF

Deploy Highly Available PostgreSQL from Percona

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Stanislav Sumariuk

4 years ago

I will definitely revisit some of my queries. Thank you for this post.

Scott C.

4 years ago

The exclusion queries are not equivalent and can produce vastly different results. Just try it where the exclusion set contains 100 values, one of which is null.

Francisco Puga

3 years ago

Really nice post.

I know that is centered on performance but as some people points that NOT IN is a don’t [1], include a note about how NULLs are handled in the different options or at least writing a warn about it will be nice.

[1] https://wiki.postgresql.org/wiki/Don't_Do_This

harisai hari

3 years ago

I personally like the first scenario, no problem you can call me a lazy DBA. I always afraid of exclusion queries triggered by tall developers.

Rajat Gupta

3 years ago

Trying to assess a situation where from a heavy table in PostgreSQL, data is fetched using an API and filter is applied making use of an IN clause and the values to filter are passed in a measure of few thousands. Will psql still use the in clause optimally? or some other construct suits well?

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN

INclusion Queries

1. Using IN Clause

2. Using ANY Clause

3. Using EXISTS Clause

4. Using INNER JOIN

Exclusion Queries

1. Using NOT IN

2. Using <> ALL

3. Using NOT EXISTS

4. Using LEFT JOIN and IS NULL

Is There More Complexity We Should Be Aware Of?

Summary

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN

INclusion Queries

1. Using IN Clause

2. Using ANY Clause

3. Using EXISTS Clause

4. Using INNER JOIN

Exclusion Queries

1. Using NOT IN

2. Using <> ALL

3. Using NOT EXISTS

4. Using LEFT JOIN and IS NULL

Is There More Complexity We Should Be Aware Of?

Summary

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation