Improving Query Performance Related to Schedules

Improving Query Performance Related to Schedules
Photo by Caspar Camille Rubin / Unsplash

Strong foundations are essential for any building, and databases play that role for applications. As the bridge between users and data, applications rely heavily on efficient database performance. This story, from a few years back, exemplifies that perfectly. We identified a slow query related to "schedules" by using Application Performance Monitoring (APM).

So let's simulate this case, use PostgreSQL. The application needs to obtain several cases from a particular "livestreaming", such as

  1. an ongoing schedule (in this post we will focus on this case)
  2. some upcoming schedules
  3. both current and upcoming_schedules
CREATE TABLE livestreamings (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255) NOT NULL
);

CREATE TABLE livestreaming_schedules (
  id SERIAL PRIMARY KEY,
  title VARCHAR(255) NOT NULL,
  livestreaming_id BIGINT NOT NULL,
  start_time TIMESTAMP WITH TIME ZONE NOT NULL,
  end_time TIMESTAMP WITH TIME ZONE NOT NULL
);

Let's provide the only ten thousands of data.

WITH numbers AS (
  SELECT * FROM generate_series(1, 200)
)
INSERT INTO livestreamings (name)
SElECT 'VIDIO - ' || numbers.generate_series FROM numbers;


WITH dates AS (
  SELECT * FROM generate_series(
    NOW() - INTERVAL '2 months', 
    NOW() + INTERVAL '5 days', 
    '20 minutes'::interval
  )
)
INSERT INTO livestreaming_schedules (title, livestreaming_id, start_time, end_time)
SElECT
  MD5(random()::text) as title,
  livestreamings.id as livestreaming_id,
  dates.generate_series as start_time,
  dates.generate_series + INTERVAL '19 minutes' as end_time
FROM dates, livestreamings;

Since the query will involve livestreaming_id, start_time & end_time, intuitively we will create an index to cater

CREATE INDEX CONCURRENTLY index_livestreaming_schedules_on_livestreaming_start_end on livestreaming_schedules (livestreaming_id, start_time, end_time);

So to get an ongoing schedule, it needs to find the data for start_time is past and end_time is not past, the following query will be executed to cater.
Let's use EXPLAIN ANALYZE to get the execution plan

EXPLAIN ANALYZE
SELECT *
FROM livestreaming_schedules
WHERE livestreaming_schedules.livestreaming_id = 135
  AND livestreaming_schedules.start_time <= now()
  AND livestreaming_schedules.end_time > now()
LIMIT 1;

Here is the result

 Limit  (cost=0.43..4.82 rows=1 width=61) (actual time=0.423..0.424 rows=1 loops=1)
   ->  Index Scan using index_livestreaming_schedules_on_livestreaming_start_end on livestreaming_schedules  (cost=0.43..1450.61 rows=330 width=61) (actual time=0.421..0.422 rows=1 loops=1)
         Index Cond: ((livestreaming_id = 115) AND (start_time <= now()) AND (end_time > now()))
 Planning Time: 0.357 ms
 Execution Time: 0.450 ms

The execution plan said it uses "Index Scan using index_livestreaming_schedules_on_livestreaming_start_end", and the total time is good, then we think everything look fine.

Along the time, the data is hike, let's simulate it with more the data.

INSERT INTO livestreamings (id, name) VALUES (99999, 'Livestreaming with lots of schedules');

WITH dates AS (
  SELECT * FROM generate_series(
    NOW() - INTERVAL '6 years', 
    NOW() + INTERVAL '7 days',
    '5 minutes'::interval
  )
)
INSERT INTO livestreaming_schedules (title, livestreaming_id, start_time, end_time)
SElECT
  MD5(random()::text) as title,
  livestreamings.id as livestreaming_id,
  dates.generate_series as start_time,
  dates.generate_series + INTERVAL '4 minutes' as end_time
FROM dates, livestreamings
WHERE livestreamings.id = 99999;

Then execute the query

EXPLAIN ANALYZE
SELECT *
FROM livestreaming_schedules
WHERE livestreaming_schedules.livestreaming_id = 99999
  AND livestreaming_schedules.start_time <= now()
  AND livestreaming_schedules.end_time > now()
LIMIT 1;
 Limit  (cost=0.43..3.87 rows=1 width=61) (actual time=42.020..42.021 rows=1 loops=1)
   ->  Index Scan using index_livestreaming_schedules_on_livestreaming_start_end on livestreaming_schedules  (cost=0.43..59385.33 rows=17284 width=61) (actual time=42.017..42.018 rows=1 loops=1)
         Index Cond: ((livestreaming_id = 99999) AND (start_time <= now()) AND (end_time > now()))
 Planning Time: 0.159 ms
 Execution Time: 42.057 ms

Then the execution time significantly increased even the index is used.

Why does this happen?
index_livestreaming_schedules_on_livestreaming_start_end had composite columns with order (livestreaming_id, start_time, end_time), and the data for past schedules more than future schedules.
Based on this Reference: https://www.postgresql.org/docs/current/indexes-multicolumn.html

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.
Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned.
For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5.
Index entries with c >= 77 would be skipped, but they'd still have to be scanned through.
This index could in principle be used for queries that have constraints on b and/or c with no constraint on a — but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.

So, the query condition "WHERE livestreaming_schedules.livestreaming_id = 99999 AND livestreaming_schedules.start_time <= now()..." will process tons of data thru the "index_livestreaming_schedules_on_livestreaming_start_end" then checking the "end_time" after, so use on this index is not actually optimal.

Since the data for future schedules was not tons, leverage the query condition

WHERE livestreaming_schedules.livestreaming_id = 99999 AND livestreaming_schedules.end_time > now()..

will produce less data to process, so let's provide another index to cater.

CREATE INDEX CONCURRENTLY index_livestreaming_schedules_on_livestreaming_end on livestreaming_schedules (livestreaming_id, end_time);

Then check the query execution

EXPLAIN ANALYZE
SELECT *
FROM livestreaming_schedules
WHERE livestreaming_schedules.livestreaming_id = 99999
  AND livestreaming_schedules.start_time <= now()
  AND livestreaming_schedules.end_time > now()
LIMIT 1;
 Limit  (cost=0.43..2.74 rows=1 width=61) (actual time=0.027..0.029 rows=1 loops=1)
   ->  Index Scan using index_livestreaming_schedules_on_livestreaming_end on livestreaming_schedules  (cost=0.43..37981.06 rows=16480 width=61) (actual time=0.025..0.026 rows=1 loops=1)
         Index Cond: ((livestreaming_id = 99999) AND (end_time > now()))
         Filter: (start_time <= now())
 Planning Time: 0.290 ms
 Execution Time: 0.053 ms

With the improved performance, we're on track to ensure the backend's stability.