How Database Indexes Really Work
InfluxData sponsored this post.
I have a growing love of databases that leads me to ask a lot of questions about how they work, and my recent obsession is database indexes.
Previously, I knew that if I wanted a particular column or field to be faster than the rest of them, I indexed it. That was as much as my brain could handle when I was first learning to code, but growing as a developer means expanding my knowledge of the fundamentals, like what exactly a database index is and why it exists.
What Is a Database Index?
Imagine we’re walking through San Francisco looking for our friend Chris. We know she lives in San Francisco, but we don’t know her address. While each building has a unique address that allows us to find it easily, we need a way to tie the person we’re looking for, Chris, with her unique address. In this case, we have to knock on every door in San Francisco until we find her. That’s not very efficient (and probably not a good idea). But if we had a directory tying Chris to her address, 123 Nunya Lane, we could walk straight there.
This is the same principle as a database index. A database index is a type of data structure, like an array or a hash. It’s just one way we can organize data. In this example, we would have an index of names that pointed to addresses.
So why do we need a data structure inside of our database, which is, you know, a big data structure?
Why Do Databases Need Indexes?
We keep so much stuff in our databases — literally anything we think we might need later, from user credentials all the way to the latitude and longitude of the pizza being delivered to our house. Without an index, the database is stuck knocking on every door in San Francisco or searching through every record in a linear fashion. Sometimes, this works just fine. Then again, some databases store hundreds of millions of records, so searching linearly could take ten steps or 525,600 or 300 million. We have to consider the potential number of steps. Without indexes, the database can end up with extremely slow queries as it searches each record for a match, which can then cause a buildup of waiting queries. Latency and overall response time would increase, and anyone waiting on the results of those queries either has to get a hobby, or more likely, use a different application.
What Should Be Indexed?
When deciding to add indexes to our database, we need to consider our data. Indexing every column or field can also have negative effects. If we create ten indexes, writing a single record to the database turns into 11 writes: one to the database, and one to each of the indexes (assuming that record includes all of the indexed columns/fields). As a guiding principle, we want to index the data that is looked up most frequently. The cost of writing to the index is offset by the improved performance of a significant number of our database queries.
What Is the Difference Between Relational Database Indexes and NoSQL Database Indexes?
In relational databases, indexes are created by column. We can choose any column or even a combination of columns to create our index.
NoSQL databases can have indexes, too! There is far less convention in the world of non-relational databases because of the variety of databases, but most of them have excellent docs on how to index data. I like InfluxDB’s indexing in which data inserted as a tag is indexed and data inserted as a field is not, so I don’t have to think about the index more than that if I don’t want to.
Database indexes are a fundamental part of understanding how our databases spend time and resources, and creating them encourages us to understand more about our applications and the data they produce. I feel smarter already.
InfluxData is a sponsor of The New Stack.
Original artwork by Katy Farmer.