donderdag 26 februari 2015

Deploying indices to improve database performance

Part of my work at Odoo involves dealing with support questions on a regular basis. One of the common requests is how to improve database performance. Database tuning is a complex subject, and this post won't deal with all the intricacies of this field. But I'll outline a strategy that can yield significant performance gains by simply adding some well chosen indices. The idea is to offer admins without deep database knowledge the information required to speed up their queries without having to wade through reams and reams of documentation.

For starters, I assume the standard low-hanging fruit has already been dealt with. For example, tuning work_mem and shared_buffers to your particular DB load (assuming PostgreSQL since that's what Odoo uses) should already be done before looking any further.

That being said, some queries may still be slow because it involves searching on columns that haven't been indexed. An index in a database serves much the same purpose as an index in a book: it allows you to look up a term by looking at the index and finding the page where the term is defined, rather than having to read the entire book until you find the word in question. The same applies to databases: adding an index on a column will speed up searches involving this column by allowing the database server to home in on the correct values without having to scan the entire table.

So where do we start? Well, first of all, you need to measure which queries are performing badly. For this we need to gather information on all our queries, and pass this through a tool like pgbadger. The details of this setup are well documented elsewhere; I'll simply refer to the pgbadger documentation, or this nice tutorial.

Once the pgbadger report is generated, the interesting bits (for this post) are located in the "Top" menu. The "Time Consuming Queries (N)" and to a lesser extent "Slowest Individual Queries" are the first ones that should be looked at. For the slowest individual queries, the parameter is simple: just look at the duration, and if that's way too high, identify target columns and create indices on them if necessary. For the time consuming queries, both the average duration and the max. duration are parameters to look at before deciding whether a fix is worth it.

For my latest customer, I saw slowest individual queries lasting up to 12 minutes, and time consuming queries with an average duration of several seconds, up to 4m8s for the slowest of the bunch. Clearly a lot of room for improvement.

Let's take this last one to see how it could be improved. The tool had the entire query on one line, which makes it hard to read. But by using a SQL formatter (an online tool can be found here), we can make the code a lot more readable:

SELECT   l.account_id                                                         AS id,
         COALESCE ( Sum ( l.credit ) , 0 )                                    AS credit,
         COALESCE ( Sum ( l.debit ) , 0 ) - COALESCE ( Sum ( l.credit ) , 0 ) AS balance,
         (
                SELECT
                       CASE
                              WHEN currency_id IS NULL THEN 0
                              ELSE COALESCE ( Sum ( l.amount_currency ) , 0 )
                       END
                FROM   account_account
                WHERE  id IN ( l.account_id ) ) AS foreign_balance,
         COALESCE ( Sum ( l.debit ) , 0 )       AS debit
FROM     account_move_line l
WHERE    l.account_id IN ( ... )
AND      l.state <> ''
AND      l.period_id IN
         (
                SELECT id
                FROM   account_period
                WHERE  fiscalyear_id IN ( ... ) )
AND      l.move_id IN
         (
                SELECT id
                FROM   account_move
                WHERE  date >= ''
                AND    date <= '' )
AND      l.journal_id       IN ( ... )
AND      l.cost_analytic_id IN ( ... )  -- This was a list of thousands of zeros and a false
AND      l.account_id IN ( ... )
GROUP BY l.account_id;

What we do here is find all columns involved in a WHERE clause. Now the first WHERE is in the inner SELECT statement:

WHERE id IN (l.account_id)) AS foreign_balance,

This one's not very helpful, as id is a primary key and automatically indexed. The same happens for columns that are part of a UNIQUE constraint, because this constraint requires fast lookup at creation time, to verify uniqueness.

The next WHERE however, involves more columns:
  • l.account_id IN ( ... )
  • l.state <> ''
  • l.period_id IN ( subquery 1 )
  • l.move_id IN ( subquery 2 )
  • l.journal_id IN ( ... )
  • l.cost_analytic_id IN ( ... )
  • l.account_id IN ( ... )
Looking at the database, we can quickly determine whether l, which is an alias for account_move_line in this query, has indices defined on those columns already. For this particular customer, I found that indices were lacking for both state and cost_analytic_id. Therefore, my advice for this particular query starts with adding the following two indices to this table:
CREATE INDEX account_move_line_cost_analytic_id_index ON account_move_line USING btree (cost_analytic_id);
CREATE INDEX account_move_line_state_index ON account_move_line USING btree (state);
A combined index (see example below) may offer even more gains, but there's a limit to the number of columns that can be involved in an index before it becomes ridiculous. A combined index on 7 columns definitely falls in this category, so we're skipping multi-column indices for this particular query.

Looking further, there are two subqueries. Sometimes, they may yield performance gains as well when analyzed. In this case, we have the following:

SELECT id
FROM   account_period
WHERE  fiscalyear_id IN ( ... )
and
SELECT id
FROM   account_move
WHERE  date >= ''
AND    date <= ''

The first one only uses fiscalyear_id for account_period. This index was missing, so:
CREATE INDEX account_period_fiscalyear_id_index ON account_period USING btree (fiscalyear_id);
The second one uses the date column on account_move. Looking at his DB, I saw this index is already present, so no need defining it again.

Another interesting query is the following:
SELECT Min (depreciated_value)
FROM   account_asset_depreciation_line
WHERE  move_check = TRUE
       AND depreciation_date > '' :: DATE
       AND asset_id = 0;
This one was executed a total of 25,683 times, with an average runtime of 901ms, and a max. runtime of 1.67s. Compared to some of the other queries, this is a rather small one. But it is still the third entry in the Time Consuming Queries table, clocking in at 6h25m43s total runtime.

What's interesting about this one is that it searches on multiple columns, and it's a big contributor to the performance problems. Instead of indexing the individual columns, we can improve performance even more by creating a combined index on all 3 columns:
CREATE INDEX aadl_depreciation_date_asset_id_move_check_index ON account_asset_depreciation_line USING btree (depreciation_date,asset_id,move_check);
The benefit of having a multi-column index is that all searches involving either all of the columns, or the first n, will be sped up. The order in which you define these columns matters: if you add a combined index on columns a and b, searches involving both a and b (combined by an AND) will be sped up. Searches on a alone will likewise be sped up, albeit less than an index on only a would achieve (the index becomes bigger, so takes more time to scan). However, a combined index on a and b, in that order, will do nothing to speed up searches on column b alone. To determine the best order, one would have to look at all queries over that table, involving any of those columns, and find the distribution of those columns over the entire set of queries. Since all top queries involving this table use all 3 columns, this case was simple: the order doesn't matter.

Had we created 3 individual indices instead, it would still be faster than using no indices at all. But if the query is sufficiently common a multi-column index is justifiable.


For more information, the PostgreSQL documentation can be consulted. Or you can just do a Google search on "postgresql index" and "postgresql multi-column index" to home in on the relevant bits directly.

Furthermore, while indices can have a dramatic effect on performance, there are many more factors which contribute to database performance. Explaining all the intricacies of database tuning, or index optimisation, falls outside the scope of this post. More tips can be found here and here. But I hope that this at least offers some entry points for the non-DBA to start improving their database performance before needing to delve deeper into the documentation.

Geen opmerkingen:

Een reactie posten