Database indexing: the last frontier. Well, okay, not the last frontier perse…more like the frontier that I’ve been reading about a lot recently. While learning about writing about smart and efficient migrations, I stumbled upon a rabbit hole that I had to restrain myself from going down: the rabbit hole of database indexing. But this week, I allowed myself to explore and learn some more about how indexes work.
As the cat gif above might already suggest, an index in a database is much like an index in a book: a place where you can quickly look up the exact information that you need. We already know that indexing can be super helpful when it comes to application performance. Using indexes forces our database to use integers to look up rows – which are just representations of items or objects – in our database. The reason that they’re so efficient is because looking up something in a database is both fast and cheap if we use an integer to do it (using a string on the other hand, is much, much slower and more expensive). By implementing a simple index, we can speed up a single query by seconds!
But it turns out that even a single index can be complicated. And that’s because there are so many kinds of indexes available for us to use. In fact, there’s a whole world of different types of database indexes out there. Of course, knowing the options available to us when it comes to database indexing is just half the battle; the other half is knowing when it’s the right time to use all these different types of indexes. The best way to learn is by playing around with some indexing ourselves – so let’s dive in!
Indexes All Around Us
The most common index type that we have dealt with so far are single-column indexes. They work pretty much exactly as their name would suggest: they create an index on a specific column of a database. So far, almost all of the indexes we’ve generated have been indexed on a specific single column of a table.
Let’s think for a second: when we use a PostgreSQL to run a migration, the one column that always gets generated is an id
column, which is unique for every row in that database, even if the object is deleted. If we think about it, that’s an example of an index. In fact, PostgreSQL automatically generates an index for the primary key of any table, and always generates a new, unique key for each new row in the database.
One of the very first migrations we ran for our Bookstore application looked like this:
1 2 3 4 5 6 7 |
|
and generated a tiny little table, which didn’t have all that much – but it did have an index as its id
primary key column, which we can confirm by looking at our structure.sql
file:
1 2 3 4 |
|
The advantage of having an id integer
column here is that our database can very easily look up a specific row in our books
table by using the index. We can also see that there’s automatically a NOT NULL
validation that prevents any row in the database from being created without an id
. This isn’t something that we wrote – this is something that Postgres does automatically! And although it might seem pretty obvious and a bit simple right now, it’s important to note – especially since indexes can get rather complicated, rather fast.
Another form of indexing that we’ve played around with is adding our own indexes to has_many
and belongs_to
associations, often by using references
to alias those relationships:
1 2 3 4 5 |
|
And if we already have a relationship set up, we can just write a quick migration to manually add an index to our books
table:
1 2 3 4 5 |
|
This migration adds an index on genre_id
column within our books
table, which makes it very quick and easy to look up a book by it’s corresponding genre. This is still only a single-column, or independent index, because it is only dealing with a single column within a database table. But if there are single indexes, that must mean that the idea of “multiple” indexes is also must be a real thing, right?
Two Indexes Are (Sometimes) Better Than One
Indexes are easy to add; what’s harder is knowing what kind of index to add. Sometimes, a single-column index does the trick. But it’s also possible that it’s quite the right tool for the job.
Let’s say that we are doing an overhaul on our Bookstore application and scaling for size. We’ve decided that in addition to selling our own curated selection of books, we’re also going to allow new, lesser-known bookstores to sign up for our service and sell their own books. Our vendors will have their own admin panel (think vendor portal), and the bookstore staff will have a separate admin panel from which to monitor all the sales across all the vendors who are signed up for our service.
All of our admins (staff and vendors) have an user_id
. But, our staff and our vendors aren’t just simple users of our application – they have special roles, and need access to specific pages, depending on their roles. Our admins also have an admin_id
in addition to their base user_ids
, while our vendors have a vendor_id
, based on their roles as smaller-scale booksellers who will be using our software to sell their products.
Within our staff admin panel, we want our admins to be able to quickly view all sales across our signed up vendors, in addition to one specific vendors sales reports. If an admin logs into the admin panel and clicks on a specific bookseller, the panel should immediately load all the information pertinent to that specific vendor.
The first thing we might think to do is what we already know – namely, add an index on the columns we know we want to look up:
1 2 3 4 5 6 |
|
However, these indexes don’t actually do what we think they’ll do. This adds an index to the primary key column on our users
table, which will allow us to quickly look up a single User
. It also adds an index to our vendor_id
column, and allows us to look up a User
instance based on its vendor_id
, if it has one.
But here’s the rub: we’re only adding a single index in this migration. That is to say, we’re adding an index on our user_ids
, and we’re adding an index on our vendor_ids
. What we really need is an index that first sorts our data by user_id
, and then filters down to the users that match our vendor_id
.
Don’t worry, this is totally doable! We just need to index two columns, instead of one. Actually, there’s a perfect name – or set of names – for what we need: a compound index (aka a concatenated, multi-column, composite or combined index).
Ultimately, we want our database to execute a query that looks something like this:
1
|
|
so that we can avoid writing a SQL statement in our UsersController
like we have now (ew, let’s not do that):
1 2 3 4 5 6 |
|
So how are we going to manage all of this, you might ask? Well, pretty easily. Writing a compound index is almost as easy as writing a single-column index:
1 2 3 4 5 |
|
And that’s it! But what’s happening, exactly? Well, we’re still creating a single index, but we’re doing it across multiple columns. The first column (user_id
) is the primary “sort criterion”, and the second column (vendor_id
) is the secondary “sort criterion”. The important thing to remember here is that order matters. We’ll only ever look up a vendor by user_id
first, and then by vendor_id
. This makes sense if you think about it, since we’ll first want to authenticate by the user currently logged in (in other words, using the current session).
I really liked the way that Markus Winand describes two-column indexing on his blog:
The ordering of a two-column index is therefore like the ordering of a telephone directory: it is first sorted by surname, then by first name. That means that a two-column index does not support searching on the second column alone; that would be like searching a telephone directory by first name.
Compound indexes are super cool because they let you quickly look up the first field/column in a database, and then quickly look up another field/column in a database – but only within the records that were returned by the first index. At the end of the query, you’ll only have rows that satisfy parts of that AND
SQL query we wrote earlier!
Indexes Like None Other
I think we can all agree that indexing can help us narrow down a lot of rows in a database and is probably the coolest filtration system ever invented by and for developers. But that’s not all that indexes can do! They can also prevent some pretty sticky situations that we often overlook.
By now, something that we’re all familiar with is validations. One of the most common kinds of validations that we see in our Rails models all of the time is uniqueness:true
, or validates_uniqueness_of
. However, here’s something we don’t always think about (or perhaps never even realized): these validations only occur at the the ActiveRecord level. And this fact can pose some problems.
Imagine two vendors are signing up to use our Bookstore app. They both want to name their store the same thing, but we definitely don’t want that to happen. So we add this line to our Users
table:
1 2 3 4 5 |
|
Cool, so this is fine then, right? Wrong! Because we just sent out an email blast telling all these potential vendors that they could sign up for a free 60-day trial. Now, all of a sudden, tons of vendors are signing up! And it just so happens, that two of them signed up at almost the exact same second, and wanted to use the same name: Super Cool Books
(I know, I know, what are the chances, right?!)
Here’s the problem: the moment that both of these potential vendors signed up for our service, no row in our polymorphic vendors
database existed with a store_name
that corresponded to the string, Super Cool Books
. So what did ActiveRecord do? It created a new row! Actually…it created two new rows. At almost the exact same time. With the exact same store name. Uh oh.
So how can we fix this? We need to take our uniqueness
validation down another level. In fact, we need to make our database responsible for validating uniqueness. And we can do this by adding a unique index.
Again, this is pretty easy. We just need to add a unique: true
constraint on the column that we want to be able to index:
1 2 3 4 5 |
|
Now, when two vendors try to sign up to be called Super Cool Books
, only one will actually be able to have that name. As soon as an unique index is created on the store_name
column of our vendors
database, any other record that tries to be created with that index will raise a ActiveRecord::RecordNotUnique
error. This is going to be super helpful to us in avoiding inconsistant data, particularly when we know that a lot of data is going to be created at once.
The world of indexes is mostly uncharted territory. It’s also important to keep in mind that you never want to create too many indexes, and only ones that we actually need, since they will slow down the “write” time to the database (whenever we use SQL commands like UPDATE
or CREATE
, for example). But that doesn’t mean we can’t learn about the different kinds of indexes that exist and be completely wowed at how amazing databases are. See, this cat is completely floored – and quite frankly, I am too:
tl;dr?
- There are three main types of indexing: single-column, compound, and unique indexes. Compound indexes create an index on two ore more columns in a database, while unique indexes create a restraint on a single column index.
- This StackOverflow answer is the best explanation of how compound indexes actually work in terms of running a query. If you’re still confused, give it a read!
- Uniqueness validations are super cool! Read more about them over here, and check out some real-life examples of how to use them here.