In last blog series we learned basics of Indexing. In this blog series part 1 we worked on why query is not using the index. Basically sometimes what happens is, when we run a query we thought it should use a particular index but it actually either use diff. index or do not use index at all and do sequential scan. So in this blogs series we will work on how postgresql choose index.
When we ask Postgres to execute a query it is executed in three steps:
- Parse – to validate the query sytax
- Planner – to plan how the query is executed like whether to do index scan or sequential scan or bitmap scan
- Executor – it will finally execute the planned tasks
Here we are more concerned about the Planner part, the entire purpose why the query needs planning is to reduce the total I/O (reading from disks/memory).
In last part we will work on example where Postgres use the index on the order by column and why it does that.
In this part we will work on why Postgres is doing Bitmap Scan rather than Sequential Scan.
First Lets understand a bit about Index Scan and Bitmap Scan:
- Index Scan
- Postgres scan the index first and then go to the table and scan the table
- In this it does random scan of the table row by row
- It is faster when the index data is sorted in a way which is required by the query.
- Bitmap Scan
- Postgres scan the index completely and mark all the pages which contains a matching value rather than checking the rows
- After completed the index scan (mostly entire index) and have list of all the pages which contains the data
- It then scan all the tables pages (marked one) and again check the validation conditions in the rows.
- It is chosen over index when cost of random scan with index is higher.
How we can get a hint whether Index scan is preferred or Bitmap Scan
- This is based on the correlation between the order in data and its physical representation(in filesystem)
- Lets understand it by example.
- First create a table:
- create table orders as select s as orderno , md5(random()::text) as orderitem , now() as order_created from generate_Series(1,1000000) s;
- You can we we create a table with a generated series 1-1000000 that is generated sequentially and dumped in the same order as well.
- Postgresql calculate certain statistics about the column order (increasing or decreasing) with table physical order increasing or decreasing and this data can be see using the following query:
- First analyze the table : ANALYZE orders;
- Now run: SELECT attname, correlation FROM pg_stats WHERE tablename = ‘orders1’;
- you will see correlation for orderno column is 1
- and this correlation varies from -1 to 1 – desc to asc.
- First create a table:
- When the correlation is high Postgres prefers Index scan but when correlation is less postgresql prefer Bitmap.
Lets create a case where Index Scan will be preferred:
First create table and indexes:
create table orders as select s as orderno , md5(random()::text) as orderitem , now() as order_created from generate_Series(1,1000000) s;
create index orderno_idx on orders(orderno);
create index order_created_idx on orders(order_created);
Now lets check the correlation value :
SELECT attname, correlation FROM pg_stats WHERE tablename = 'orders';
attname | correlation
---------------+-------------
orderno | 1
orderitem | 0.00561977
order_created | 1
(3 rows)
As you can see correlation value is 1 , there is high chance Postgresql will use Index Scan:
Now lets query :
explain analyze select * from orders where orderno < 80000 limit 100 ;
QUERY PLAN
------------------------------------------------------------------------------------
------------------------------------------------
Limit (cost=0.42..4.22 rows=100 width=45) (actual time=7.494..7.602 rows=100 loops
=1)
-> Index Scan using orderno_idx on orders (cost=0.42..3028.31 rows=79879 width=
45) (actual time=7.492..7.579 rows=100 loops=1)
Index Cond: (orderno < 80000)
Planning time: 0.156 ms
Execution time: 7.653 ms
(5 rows)
As we can see here the Postgres is using Index Scan.
Lets create a case where Bitmap Scan will be preferred:
First create table and indexes:
create table orders as select random()*10000 as orderno , md5(random()::text) as orderitem , now() as order_created from generate_Series(1,10000000) s;
create index ord_idx on orders(orderno);
create index order_created_idx on orders(order_created);
Here one important change is rather than using s (generated series) we are generating a random number (random()*10000) such that correlation should be less.
Lets see the coorelation:
analyze orders;
SELECT attname, correlation FROM pg_stats WHERE tablename = 'orders5';
attname | correlation
---------------+-------------
orderno | 0.00156549
orderitem | 0.000700192
order_created | 1
Here we can see that the correlation is very less.
Now lets query the table:
explain analyze select * from orders where orderno >9000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on orders (cost=19145.75..135021.51 rows=1022621 width=49) (actual time=105.159..1042.678 rows=1002305 loops=1)
Recheck Cond: (orderno > '9000'::double precision)
Rows Removed by Index Recheck: 5746548
Heap Blocks: exact=37001 lossy=66089
-> Bitmap Index Scan on ord_idx (cost=0.00..18890.09 rows=1022621 width=0) (actual time=99.446..99.447 rows=1002305 loops=1)
Index Cond: (orderno > '9000'::double precision)
Planning time: 0.158 ms
Execution time: 1067.465 ms
explain analyze select * from orders where orderno >0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Seq Scan on orders5 (cost=0.00..228093.00 rows=10000000 width=49) (actual time=0.032..1013.239 rows=10000000 loops=1)
Filter: (orderno > '0'::double precision)
Planning time: 0.282 ms
Execution time: 1257.460 ms
(4 rows)
Now we can see the system used Bitmap Scan.
Also one more important thing to remember here is that system will sometimes will switch to Sequential Scan also rather than Bitmap Scan when the number of rows selected are very high.
So guys in this tutorial we learned about how Postgres select index primarily Bitmap vs Index.
[…] next blog we will figure out why postgres in doing bitmap scan rather than index […]
LikeLike
is index only scan preferred over index scan ? Should one ensure to use index only scan ? if Yes How ?
LikeLike
Index only scans are like index scans only , its just that when the columns required in the result is available via the index only then index only scan is used and when column required in the result are not present in index then index scan will be used .
Secondly one should always prefer to do index only scan if the columns required in the query are available via index.
Like in the example case if i query like explain analyze select orderno from orders where orderno > 999990 limit 100; then index only scan will be used
LikeLike