Explaining the Explain

Explain? Explain the Explain?!

Problem

Recently, I had to work through a PostgreSQL query going through performance issues, but only under specific conditions. In this case, we were using Django to build querysets from various tables to return an intersection of data.

For example, if we were managing a "to-do" list, but within the list we could have different filters and sections, from "job-related", "grocery-related", and so on. These filters and sections only sometimes applied, depending on how you want to use your list, and could be turned on/off. For our sake, these sections come with pre-determined items, such as "read Pragmatic Programmer", "go for a run", or "buy eggs", and filters would be only viewing "job-related", or more/less.

The query was taking a long time when our list had various items in it, sometimes it would be part of section, sometimes it wasn't, but the way we were fetching data was like so:

  • Grab the user's to-do items
  • Grab all of the available filters
  • Grab all of the available sections
  • Start checking which of the to-do items are filtered and not filtered as part of filters
  • Start checking with of the to-do items are part of which sections, and are they supposed to show
  • Return the user's to-dos

At first glance it makes sense, we want to make sure we only show the appropriate items! But digging deeper, there was some red flags.

How we get there

Normally I turn to Django and start grepping through querysets to find any obvious flags. Such as not prefetching on related tables, selecting specific values, unnecessary joins, not utilizing the queryset for doing calculations, etc. But this time, I turned to the q u e r y p l a n.

PostgreSQL's EXPLAIN is powerful, but super confusing for someone who has never used it. You can use it by wrapping a SQL query with EXPLAIN ...your query... and it will return an estimated plan (this is important!) of the steps it will take to execute and return/"do" your query.

EXPLAIN
SELECT *
FROM user_todos

Would return a single step, SEQ SCAN with some magic numbers surrounding it. You wonder, what the heck is a SEQ SCAN? No problem, this is a Node which every query plan consists of. These nodes can be nested, are executed from the inner-most node outwards. There are A LOT of types of nodes, which can be found in the source code here: Source

Great, we have some idea of what the query is doing, and we're still figuring what it all means, but what if we want to make sure this estimated query plan meets reality? No problem! We can utilize one of EXPLAIN's parameters, specifically ANALYZE which will actually execute the query.

Small pro tip: if you want to test a query that mutates your database, wrap your EXPLAIN ANALYZE in a transaction.

Moving on. We can now use EXPLAIN ANALYZE {query} to execute our problematic query and figure out what's all happening. With each node we can see a cost=0.00...1234, which represents STARTUP COST...TOTAL COST. My first intuition was to go find the "costliest" step which turned out fruitful. Without too much detail, I ran into another node type, the NESTED LOOP, which loops through every record in one set, and tries to find matching records in another set. It was taking 40% of our query! Digging more into the nodes that also existed here were multiple SEQ SCANS across the same table (why!).

Stepping out

I wanted to confirm my hypothesis and if this dreary NESTED LOOP was really the problem, as they aren't inherently bad, but when mixed with poor join conditions and scanning the same table on multiple similar conditions, it definitely stood out. I extracted out the behemoth subquery from its home and saw a pattern, we had tons of these JOIN on x.id = y.id on the same tables in different subqueries! This poor table. Finding the related Django querysets wasn't hard to work through, but sometimes it obfuscates what's really happening, which is why stepping out of your abstraction and diving into the query plan is helpful.

This wasn't meant to be a tutorial or anything, just my experience using query plans for the first time, there's still lots I don't understand, and I feel like I'm only using 10% of its power, but I'll get there!