Notes on Analyzing Database Queries

This is an interesting topic to dive into. While I personally prefer PostgreSQL most of the time, this post will mostly have my notes on MySQL side of things.

What’s index cardinality?

Cardinality literally means number of elements of a given set. For the very interested, the relation of having the same cardinality is called equinumerosity, and this is an equivalence relation on the class of all sets.

What does it have to do with the database or queries then?

Well, considering a table holds a set of rows and columns, the number of records in the table can be referred as it’s cardinality. For an index, cardinality is considered to be the number of unique values in the index.

A unique index would have cardinality equal to the number of rows in the table. But a non-unique index can have a cardinality of anywhere from 1 to the number of rows in the table, depending on how many times each index key appears in the table.

Indexes with relatively few unique values are considered to be low cardinality indexes. If a column can only take a yes or no would have a cardinality of 2. Columns like status, color, currency would have a small set of unique values, hence they are good examples to low cardinality indexes.

The lower the cardinality, the more duplicated elements you would have in a column. Thus, a column with the lowest possible cardinality would have the same value for every row.

Indexes work best for columns that have a high cardinality relative to the number of rows in the table (that is, columns that have many unique values and few duplicates).

If a column contains many different age values, an index will differentiate rows readily. An index will not help for a column that is used to record sex and contains only the two values ‘M’ and ‘F’.

If the values occur about equally, you’ll get about half of the rows whichever value you search for. Under these circumstances, the index might never be used at all, because the query optimizer generally skips an index in favor of a full table scan if it determines that a value occurs in a large percentage of a table’s rows.

Low cardinality isn’t always bad.

Say your col1 can only take two distinct values and your FOO:BAR ratio looks like below.

mysql> SELECT col1, COUNT(*) FROM test GROUP BY col1;

col1	COUNT(*)
FOO	5909
BAR	10

We've got low cardinality (with just 2 possible values) yet if we are just looking for "BAR" rows then they should be easy to find with the index.

I haz an index, why is it not used!?

Just because you have the index there it doesn’t mean that MySQL will decide to use it. MySQL can be smart enough to know about the distribution of the data and can decide that it would be much faster to just go through all of the data rather than the n% that it could return from the index lookup. When the query is restrictive enough however, it would know it will be better to use the index.

Notes on indexes

Index short values

Use smaller data types when possible. Don’t use a BIGINT column if a MEDIUMINT is large enough to hold the values you need to store. Don’t use CHAR(100) if none of your values are longer than 25 characters.
Shorter values can be compared more quickly, so index lookups are faster.
Smaller values result in smaller indexes that require less disk I/O.
With shorter key values, index blocks in the key cache hold more key values. MySQL can hold more keys in memory at once, which improves the likelihood of locating key values without reading additional index blocks from disk.
If you’re indexing a string column, specify a prefix length whenever it makes sense.

Leftmost prefixes

When you create an n-column composite index, you actually create n indexes that MySQL can use. A composite index serves as several indexes because any leftmost set of columns in the index can be used to match rows. Such a set is called a “leftmost prefix.”

Given a table with a composite index on columns named state, city, and zip, rows in the index are sorted in state, city, zip order, so they’re automatically sorted in state, city order and also in state order. This means that MySQL can take advantage of the index even if you specify only state values in a query, or only state and city values. Thus, the index can be used to search the following combinations of columns: state, city, zip or state, city or just state

MySQL cannot use the index for searches that don’t involve a leftmost prefix. So, if you search by city or zip, the index isn’t used at all.

Avoid over-indexing

Every additional index takes extra disk space and hurts performance of write operations.
Indexes must be updated and possibly reorganized when you modify the contents of your tables. The more indexes you have, the longer this will take.
MySQL considers indexes when generating an execution plan for retrievals. Creating extra indexes creates more work for the query optimizer.
It’s also possible (if unlikely) that MySQL will fail to choose the best index to use when you have too many indexes. Maintaining only the indexes you need helps the query optimizer avoid making such mistakes.

If you’re thinking about adding an index to a table that is already indexed, consider whether the index you’re thinking about adding is a leftmost prefix of an existing multiple-column index. If so, don’t bother adding the index because, in effect, you already have it. (Following the example above, if you already have an index on state, city, and zip, there is no point in adding an index on state.)

Oh also, Use the slow-query log to identify queries that are taking too long.

Understand how the query optimizer works

The query optimizer has different goals, but it will primarily aim to use the most restrictive index in order to eliminate as many rows as possible, as soon as possible. The faster it can eliminate rows from consideration, the more quickly the rows that do match your criteria can be found.

How do I help the optimizer?

Try to compare columns that have the same data type.
Try to make indexed columns stand alone in comparison expressions.
- If you use a column in a function call or as part of a more complex term in an arithmetic expression, MySQL can’t use the index because it must compute the value of the expression for every row. Sometimes this is unavoidable, but often you can rewrite a query to get the indexed column to appear by itself.
- Say you have an indexed created_date column and you do a query like: SELECT * FROM test WHERE YEAR(created_date) < 1990;
  - The expression doesn’t compare 1990 to an indexed column; it compares 1990 to a value calculated from the column, and that value must be computed for each row.
  - As a result, the index on created_date is not used because performing the query requires a full table scan.
  - You can fix this by just using a literal date like WHERE created_date < 1990-01-01
Don’t use wildcards at the beginning of a LIKE pattern if you’re really looking for the string only when it occurs at the beginning of the column, leave out the first %.
Use EXPLAIN to verify optimizer operations. EXPLAIN FORMAT=JSON is very, very cool. See the must-reads section for an awesome read about this one.
Avoid overuse of MySQL’s automatic type conversion.

if int_col is an integer column, each of these queries will return the same result:

SELECT * FROM test WHERE int_col = 7;

SELECT * FROM test WHERE int_col = '7';

BUT, the second query involves a small performance penalty for converting the integer and string to double to perform the comparison. A more serious problem is that if int_col is indexed, a comparison that involves type conversion may prevent the index from being used. Same can happen when comparing a string column to a numeric value.

There’s a lot more to touch when it comes to analyzing queries, indexes and the inner workings of the optimizer that I may come back here with more notes. I’ve listed a few great readings below for the interested.