Tuesday, September 27, 2005

What is a Clustered Index?

While at the local User Group meeting tonight, I was asked "What is a clustered index?" (because this question has been popping up in a lot of job interviews lately). Note to self: Start asking candidates this question during interviews....

At least from the SQL Server standpoint, you can think of a clustered index as being the physical ordering of the database records as stored on disk. Yeah, yeah, this is not technically correct, but if you start thinking in this way, then you will have the gist of what a clustered index is.

SQL Server stores data in 8 KB pages. If you can fit 1000 different rows of a particular table in one 8 KB page, then great! But, one page will contain at least one row of data.

Tables without a clustered index will not adhere to any sort of ordering. This might be similar to collecting a list of names and phone numbers by having a random collection of people fill out a single petition. In database terms, this is a heap table. If you want to find a person in the list, then you start at the beginning, and scan through the table until you find the data that you're looking for.

Now think about a phone book. Each page has many records (listings), and all of the listings are in alphabetical order by last name (and then by first name). That is, the phone book uses a clustered index (if you will) set to the last name + first name. You can use a well known search algorithm to quickly find a name in a lot less time than starting at the beginning and scanning through all of the entries.

What happens if we need to insert "Flintstone, Fred" in between "Flintrock, Gerald" and "Flintstone, W."? We would have to bump all of the records down one spot, which makes some records fall off of the current page and onto another. In other words, this is probably not something that we would want to repeat over and over again if we have a lot of names to add.

What we could do instead is to assign an incremental number to each name/number that is added to the phone book, and always put new entries at the end of the book. This identifier is the Primary Key of the entry, and because the phone book is physically sorted by this key, that means that it also serves as our clustered index. (The takeaway here is that when you assign a Primary Key to a table that does not already have a clustered index, then a new clustered index will be created by default using the PK).

How do we find people in the book, then, without resorting to table scans? Well, we would need an index to be created containing every name, and list the Primary Key value for that name. In this way, we can look up the name in the index, and then search for the Primary Key to find the entry on the page in the book, and then we can get the phone number.

Likewise, we could also create an index of every phone number, and be able to find the entry in the very same way (by searching for the Primary Key) to see who's phone number that is.

Each index entry contains the clustered index key data, so we want to keep the clustered index values small. This is one reason why auto-incrementing integer (identity) fields are popular as primary keys.

It is commonly argued that you should use GUID (uniqueidentifier) values as primary keys (and therefore, clustered indexes by default) instead of identity fields. The problem with this is that first of all, a GUID is wider than a 32-bit integer, so more disk space is required. Secondly, SQL Server's newid() function, at least, does not produce sequential GUIDs, so if you have a lot of rows to insert, then the database server will be moving around a lot of records in order to keep the clustered index sorted.

UPDATE (10/20/2005): I just learned that in SQL Server 2005, there is a newsequentialid() function that overcomes the previously mentioned issue with the newid() function. Cool!