Simply put:
I prefer using surrogate keys because natural keys are by default a subject to change which is a bad behavior for a row identifier.
But let’s dig a bit deeper into each key type to see why this is. Here’s a little table with column names that tell us what kind of a key each column is.
Surrogate keys
A surrogate key is a row identifier that has no connection to the data attributes in the row but simply makes the whole row unique. And that property is also the downside of it. Because it has no connection to the data attributes we can have two rows with the exact same data in all columns except the key column. This is usually handled at the application side and is an acceptable downside.
An example of a surrogate key is an integer identity or a GIUD unique identifier. I’ve never seen another data type being used as a surrogate key successfully. Both have their pros and cons though.
GIUD unique identifier
GUID is globally unique 16 byte long data type that can have 2128 different values. This makes it ideal for scenarios with multiple server moving data from one to another like replication.
However for a key 16 bytes is really a lot. This causes less data to be available on a single data page which in turn causes extra IO activity because it has to retrieve more data pages. Another issue about it is that is causes perfect page splits in a clustered index because it has random 100% selectivity in it’s entire data type range.
Integer identity
Integer identity is either 4 byte INT with range from -2,147,483,648 to 2,147,483,647 or 8 byte BIGINT with range from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. In 99.9% of cases this range is split in half because the default setting in SQL Server is to start any integer data type from 0. As this is a surrogate key this makes no sense and there’s no reason it shouldn’t start from the min value.
It is a small data type which gives it the advantage of having more data in the data pages thus needing less IO for the same amount of data. Unlike the GIUD unique identifier the integer identity has ever increasing 100% selectivity in it’s entire data type range. This makes it a perfect candidate for a clustered because it doesn’t cause page splits. If it actually is an appropriate candidate for a clustered index is a different matter.
Its downside is that it is not ideal for multi server scenarios although it can be done by using another tinyint column identifying a location and making it a covering row identifier over ID and LocationId columns.
And remember: Never tie any business logic to the surrogate key other than simple CRUD operations.
Natural Keys
A natural key is a row identifier composed of data that uniquely describes data using its own attributes. An example of a natural key is social security number or other government issued number.
However this presents a huge problem from the physical database implementation point of view. In most databases a row identifier is usually also the basis for the clustered index and non-clustered indexes. But natural keys are by definition a subject to change. When the clustered index key is changed ALL indexes have to be rebuilt because non-clustered indexes contain the full key of the clustered index. So every time the natural key, which is also a clustered index changes, all indexes have to be rebuilt. And this is not including changing the actual data type or it’s size, jut the key value.
At this point someone might say: Yes Mladen you’re right about the theory of this but how many times have you seen the Natural key really change? Well so far I’ve seen it 2 times both with heavy consequences. It was 2 times too many.
Natural Key Fail Case 1:
It was a standard customer, product, order type of application. The key in this case was the 7 char long customer ID. It was a mix of first 3 letters of the customer name plus 4 numbers that also had some business meaning. The company got acquired by another company and a new customer numbering was introduced. Every key in that database had to be changed. Due to fully breaking changes to the database the whole application had to be modified and the store went offline for 3 months loosing the company a lot of profit. All this wouldn’t have happened if they had used surrogate keys.
Natural Key Fail Case 2:
This one was even more far reaching. In Slovenia (my home country) we have something called a Tax ID. This is an ID that is unique for companies and individuals so every person and every company has one for tax purposes. Many systems in Slovenia used it as the natural never changing key which sounded like a reasonable thing at the time. And it was so for over 30 years. Applications came and went. But in 2004 Slovenia entered into the European Union. So we had to modify the TaxId to European standards which means that every application using it had to be changed. I know of at least one company that went out of business because of this change. Again had they used a surrogate key the only change would be the length of the TaxId column.
Because of all this I’ve come to the prefer the surrogate keys in majority of cases.
Hopefully this gives you some insight why surrogates are in my opinion better suited as row identifiers. Although whichever you choose is still a matter of common sense and your business problem. The answer is always “It depends”.