In my databases, I tend to get into the habit of having an auto-incrementing integer primary key with the name id
for every table I make so that I have a unique lookup for any particular row.
Is this considered a bad idea? Are there any drawbacks to doing it this way? Sometimes I’ll have multiple indices like id, profile_id, subscriptions
where id
is the unique identifier, profile_id
links to the foreign id
of a Profile
table, etc.
Or are there scenarios where you don’t want to add such a field?
21
It’s never a bad idea to have a guaranteed unique row identifier. I guess I shouldn’t say never – but let’s go with the overwhelming majority of the time it’s a good idea.
Theoretical potential downsides include an extra index to maintain and extra storage space used. That’s never been enough of a reason to me to not use one.
21
TL;DR: Use UUID’s instead of auto-increment, if you don’t already have a unique way of identifying each row.
I disagree with all the answers before. There are many reasons why it is a bad idea to add an auto increment field in all tables.
If you have a table where there are no obvious keys, an auto-increment field seems like a good idea. After all, you don’t want to select * from blog where body = '[10000 character string]'
. You’d rather select * from blog where id = 42
. I’d argue that in most of these cases, what you really want is a unique identifier; not a sequential unique identifier. You probably want to use a universally unique identifier instead.
There are functions in most databases to generate random unique identifiers (uuid
in mysql, postgres. newid
in mssql). These allow you to generate data into multiple databases, on different machines, at any time, with no network connection between them, and still merge data with zero conflicts. This allows you to more easily setup multiple servers and even data centers, like for example, with microservices.
This also avoids attackers guessing url’s to pages they shouldn’t have access to. If there’s a https://example.com/user/1263
there’s probably a https://example.com/user/1262
as well. This could allow automation of a security exploit in the user profile page.
There are also a lot of cases where a uuid column is useless or even harmful. Let’s say you have a social network. There is a users
table and a friends
table. The friends table contains two userid columns and an auto-increment field. You want 3
to be friends with 5
, so you insert 3,5
into the database. The database adds an auto-increment id and stores 1,3,5
. Somehow, user 3
clicks the “add friend”-button again. You insert 3,5
into the database again, the database adds an auto-increment id and inserts 2,3,5
. But now 3
and 5
are friends with each other twice! That’s a waste of space, and if you think about it, so is the auto-increment column. All you need to see if a
and b
are friends is to select for the row with those two values. They are, together, a unique row identifier. (You would probably want to do write some logic to make sure 3,5
and 5,3
are deduplicated.)
There are still cases where sequential id’s can be useful, like when building an url-shortener, but mostly (and even with the url shortener) a randomly generated unique id is what you really want to use instead.
16
Autoincemental keys have mostly advantages.
But some possible drawbacks could be:
- If you have a business key, you have to add a unique index on that column(s) too in order to enforce business rules.
- When transfering data between two databases, especially when the data is in more than one table (i.e. master/detail), it’s not straight-forward since sequences are not synced between databases, and you’ll have to create an equivalence table first using the business key as a match to know which ID from the origin database corresponds with which ID in the target database. That shouldn’t be a problem when transfering data from/to isolated tables, though.
- Many enterprises have ad-hoc, graphical, point-and-click, drag-and-drop reporting tools. Since autoincremental IDs are meaningless, this type of users will find it hard to make sense of the data outside “the app”.
- If you accidentally modify the business key, chances are you will never recover that row because you no longer have something for humans to identify it. That caused a fault in the BitCoin platform once.
- Some designers add an ID to a join table between two tables, when the PK should simply be composed of the two foreign IDs. Obviously if the join table is between three or more tables, then an autoincremental ID makes sense, but then you have to add an unique key when it applies on the combination of FKs to enforce business rules.
Here’s a Wikipedia article section on the disadvantages of surrogate keys.
8
Just to be contrary, No, you do NOT need to always have a numeric AutoInc PK.
If you analyse your data carefully you often identify natural keys in the data. This is often the case when the data has intrinsic meaning to the business. Sometimes the PKs are artefacts from ancient systems that the business users utilize as a second language to describe attributes of their system. I’ve seen vehicle VIN numbers used as the primary key of a “Vehicle” table in a fleet management system for example.
However it originated, IF you already have a unique identifier, use it. Don’t create a second, meaningless primary key; it’s wasteful and may cause errors.
Sometimes you can use an AutoInc PK to generate a customer meaningful value e.g. Policy Numbers. Setting the start value to something sensible and applying business rules about leading zeros etc. This is probably a “best of both worlds” approach.
When you have small numbers of values that are relatively static, use values that make sense to the system user. Why use 1,2,3 when you could use L,C,H where L,H and C represent Life, Car and Home in an insurance “Policy Type” context, or, returning to the VIN example, how about using “TO” for Toyota? All Toyata cars have a VIN that starts “TO” It’s one less thing for users to remember, makes it less likely for them to introduce programming and user errors and may even be a usable surrogate for a full description in management reports making the reports simpler to write and maybe quicker to generate.
A further development of this is probably “a bridge too far” and I don’t generally recommend it but I’m including it for completeness and you may find a good use for it. That is, use the Description as the Primary Key. For rapidly changing data this is an abomination. For very static data that is reported on All The Time, maybe not. Just mentioning it so it’s sitting there as a possibility.
I DO use AutoInc PKs, I just engage my brain and look for better alternatives first. The art of database design is making something meaningful that can be queried quickly. Having too many joins hinders this.
EDIT
One other crucial case where you do not need an Autogenerated PK is the case of tables that represent the intersection of two other tables. To stick with the Car analogy, A Car has 0..n Accessorys, Each Accessory can be found on many cars. So to represent this You create a Car_Accessory table containing the PKs from Car and Accessory and other relevant information about the link Dates etc.
What you don’t (usually) need is an AutoInc PK on this table – it will only be accessed via the car “tell me what accessories are on this car” or from the Accessory “tell em what cars have this accessory”
13
Many tables already have a natural unique id. Do not add another unique id column (auto-increment or otherwise) onto these tables. Use the natural unique id instead. If you add another unique id, you essentially have a redundancy (duplication or dependency) in your data. This goes against the principles of normalization. One unique id is dependent on the other for accuracy. This means that they have to be kept perfectly in sync at all times in every system that manages these rows. It’s just another fragility in your data integrity that you don’t really want to have to manage and validate to long term.
Most tables these days don’t really need the very minor performance boost that an additional unique id column would give (and sometimes it even detracts from performance). As a general rule in IT, avoid redundancy like the plague! Resist it everywhere it is suggested to you. It’s anathema. And take heed of the quote. Everything should be as simple as possible, but not simpler. Don’t have two unique ids where one will suffice, even if the natural one seems less tidy.
3
On larger systems, ID is consistency booster, do use it almost anywhere. In this context, individual primary keys are NOT recommended, they are expensive at the bottom line (read why).
Every rule has an exception, so you might not need integer autoincrement ID on staging tables used for export/import and on similar one-way tables or temporary tables. You would also prefer GUID’s instead of ID’s on distributed systems.
Many answers here suggest that existing unique key should be taken. Well even if it has 150 characters? I don’t think so.
Now my main point:
It looks that opponents of autoincrement integer ID are speaking about small databases with up to 20 tables. There they can afford individual approach to each table.
BUT once you have an ERP with 400+ tables, having integer autoincrement ID anywhere (except cases mentioned above) just makes great sense. You do not rely on other unique fields even if they are present and secured for uniqueness.
- You benefit from universal time-saving, effort-saving, easy-to-remember convention.
- In most cases you
JOIN
tables, without need of checking what the keys are. - You can have universal code routines working with your integer autoincrement column.
- You can extend your system with new tables or user plugins not foreseen before simply by referring to ID’s of existing tables. They are already there from the beginning, no costs to add them additionally.
On larger systems, it can be worth ignoring minor benefits of those individual primary keys and consistently use integer autoincrement ID in most cases. Using existing unique fields as primary keys is maybe saving some bytes per record but additional storage or indexing time pose no issue in today’s database engines. Actually you are losing much more money and resources on wasted time of the developers/maintainers. Today’s software should be optimized for time and effort of programmers – what approach with consistent ID’s fulfills much better.
1
It is not good practice to superfluous designs. I.e. – it is not good practice to always have an auto increment int primary key when one is not needed.
Let’s see an example where one is not needed.
You have a table for articles–this has an int primary key id
, and a varchar column named title
.
You also have a table full of article categories–id
int primary key, varchar name
.
One row in the Articles table has an id
of 5, and a title
“How to cook goose with butter”. You want to link that article with the following rows in your Categories table: “Fowl” (id: 20), “Goose” (id: 12), “Cooking” (id: 2), “Butter” (id: 9).
Now, you have 2 tables: articles and categories. How do you create the relationship between the two?
You could have a table with 3 columns:
id (primary key), article_id (foreign key), category_id (foreign key). But now you have something like:
| id | a_id | c_id | | 1 | 5 | 20 | | 2 | 5 | 12 | | 3 | 5 | 2 |
A better solution is to have a primary key that is made up of 2 columns.
| a_id | c_id | | 5 | 20 | | 5 | 12 | | 5 | 2 |
This can be accomplished by doing:
create table articles_categories (
article_id bigint,
category_id bigint,
primary key (article_id, category_id)
) engine=InnoDB;
Another reason not to use an auto increment integer is if you are using UUIDs for your primary key.
UUIDs are by their definition unique, which accomplishes the same thing that using unique integers does. They also have their own added benefits (and cons) over integers. For instance, with a UUID, you know that the unique string you’re referring to points to a particular data record; this is useful in cases where you do not have 1 central database, or where applications have the ability to create data records offline (then upload them to the database at a later date).
In the end, you need to not think about primary keys as a thing. You need to think of them as the function they perform. Why do you need primary keys? To be able to uniquely identify specific sets of data from a table using a field that will not be changed in the future. Do you need a particular column called id
to do this, or can you base this unique identification off of other (immutable) data?
1
Or are there scenarios where you don’t want to add such a field?
Sure.
First of all, there are databases that have no autoincrements (e.g., Oracle, which certainly is not one of the smallest contenders around). This should be a first indication that not everybody likes or needs them.
More important, think about what the ID actually is – it is a primary key for your data. If you have a table with a different primary key, then you do not need an ID, and should not have one. For example, a table (EMPLOYEE_ID, TEAM_ID)
(where each employee can be in several teams concurrently) has a clearly defined primary key consisting of those two IDs. Adding an autoincrement ID
column, which is also be a primary key for this table, would make no sense at all. Now you are lugging 2 primary keys around, and the first word in “primary key” should give you a hint that you really should have only one.
3
I usually use an “identity” column (auto-incremennting integer) when defining new tables for “long-lived” data (records I expect to insert once and keep around indefinitely even if they end up “logically deleted” by setting a bit field).
There are a few situations I can think of when you don’t want to use them, most of which boil down to scenarios where one table on one instance of the DB cannot be the authoritative source for new ID values:
- When incremental IDs would be too much information for a potential attacker. Use of an identity column for “public-facing” data services makes you vulnerable to the “German Tank Problem”; if record id 10234 exists, it stands to reason that record 10233, 10232, etc exist, back to at least record 10001, and then it’s easy to check for record 1001, 101 and 1 to figure out where your identity column started. V4 GUIDs composed of mainly random data break this incremental behavior by design, so that just because one GUID exists, a GUID created by incrementing or decrementing a byte of the GUID does not necessarily exist, making it harder for an attacker to use a service indtended for single-record retrieval as a dump tool. There are other security measures that can better restrict access, but this helps.
- In M:M cross-reference tables. This one’s kind of a gimme but I’ve seen it done before. If you have a many-to-many relationship between two tables in your database, the go-to solution is a cross-reference table containing foreign key columns referencing each table’s PK. This table’s PK should virtually always be a compound key of the two foreign keys, to get the built-in index behavior and to ensure uniqueness of the references.
- When you plan on inserting and deleting in bulk on this table a lot. Probably the biggest disadvantage to identity columns is the extra hoopla you have to go through when doing an insert of rows from another table or query, where you want to maintain the original table’s key values. You have to turn “identity insert” on (however that’s done in your DBMS), then manually make sure the keys you’re inserting are unique, and then when you’re done with the import you have to set the identity counter in the table’s metadata to the maximum value present. If this operation happens a lot on this table, consider a different PK scheme.
- For distributed tables. Identity columns work great for single-instance databases, failover pairs, and other scenarios where one database instance is the sole authority on the entire data schema at any given time. However, there’s only so big you can go and still have one computer be fast enough. Replication or transaction log shipping can get you additional read-only copies, but there’s a limit to that solution’s scale as well. Sooner or later you’ll need two or more server instances handling inserts of data and then synchronizing with each other. When that situation comes, you’ll want a GUID field instead of an incremental one, because most DBMSes come pre-configured to use a portion of the GUIDs they generate as an instance-specific identifier, then generate the rest of the identifier either randomly or incrementally. In either case, the odds of a collision between two GUID generators are nil, while an identity integer column is a nightmare to manage in this situation (you can go even/odd by offsetting seeds and setting the increment to 2, but if one server sees more activity than the other you’re wasting IDs).
- When you have to enforce uniqueness across multiple tables in the DB. It’s common in accounting systems, for instance, to manage the General Ledger (with a row for each credit or debit of every account that has ever occurred, so it gets very big very quickly) as a sequence of tables each representing one calendar month/year. Views can then be created to hook them together for reporting. Logically, this is all one very big table, but chopping it up makes the DB’s maintenance jobs easier. However, it presents the problem of how to manage inserts into multiple tables (allowing you to begin logging transactions in the next month while still closing out the last) without ending up with duplicate keys. Again, GUIDs instead of identity integer columns are the go-to solution, as the DBMS is designed to generate these in a truly unique way, so that a single GUID value will be seen once and only once in the entire DBMS.
There are workarounds that allow use of identity columns in these situations, as I’ve hopefully mentioned, but in most of these, upgrading from the identity integer column to a GUID is simpler and solves the problem more completely.
6
An auto-incremented (identity) primary key is a good idea except to note that it is meaningless outside of the context of the database and immediate clients of that database. For example, if you transfer and store some of the data in another database, then proceed to write different data to both database tables, the id’s will diverge – i.e., data with an id of 42 in one database won’t necessarily match the data with an id of 42 in the other.
Given this, if it’s necessary to still be able to identify rows uniquely outside of the database (and it frequently is), then you must have a different key for this purpose. A carefully selected business key will do, but you’ll often end up in a position of a large number of columns required to guarantee uniqueness. Another technique is to have an Id column as an auto-increment clustered primary-key and another uniqueidentifier (guid) column as a non-clustered unique key, for the purposes of uniquely identifying the row wherever it exists in the world. The reason you still have an auto-incremented key in this case is because it’s more efficient to cluster and index the auto-incrementing key than it is to do the same to a guid.
One case where you might not want an auto-incrementing key would be a many-to-many table where the primary key is a compound of the Id columns of two other tables (you could still have an auto-incrementing key here, but I don’t see the point of it).
One other question is the datatype of the auto-incremented key. Using an Int32 gives you a large, but relatively limited range of values. Personally I frequently use bigint columns for the Id, in order to practically never need to worry about running out of values.
As other people have made the case for an incrementing primary key I will make one for a GUID:
- It is guaranteed to be unique
- You can have one less trip to the database for data in your application. (For a types table for instance you can store the GUID in the application and use that to retrieve the record. If you use an identity you need to query the database by name and I have seen many an application that does this to get the PK and later queries it again to get the full details).
- It is useful for hiding data. www.domain.com/Article/2 Lets me know you only have two articles whereas www.domain.com/article/b08a91c5-67fc-449f-8a50-ffdf2403444a tells me nothing.
- You can merge records from different databases easily.
- MSFT uses GUIDS for identity.
Edit: Duplicate Point
12
As a principle of good design, every table should have a reliable way to uniquely identify a row. Although that is what a primary key is for, it doesn’t always require the existence of a primary key. Adding a primary key to every table is not a bad practice since it provides for unique row identification, but it may be unnecessary.
To maintain reliable relationships between the rows of two or more tables, you need to do it via foreign keys, hence the need for primary keys in at least some tables. Adding a primary key to every table makes it easier to extend your database design when it comes time to add new tables or relationships to existing data. Planning ahead is always a good thing.
As a basic principle (hard rule perhaps), the value of a primary key should never change throughout the life of its row. It’s wise to assume that any business data in a row is subject to change over its lifetime, so any business data will be a poor candidate for a primary key. This is why something abstract like an auto-incremented integer is often a good idea. However, auto-incremented integers do have their limitations.
If your data will only have a life within your database, auto-incremented integers are fine. But, as has been mentioned in other answers, if you ever want your data to be shared, synchronized, or otherwise have a life outside your database, auto-incremented integers make poor primary keys. A better choice will be a guid (aka uuid “universally unique id”).
The question, and many of the answers, miss the important point that all the natural keys for each table reside solely in the logical schema for the database, and all the surrogate keys for each table reside solely in the physical schema for the database. other answers discuss solely the relative benefits of integer versus GUID surrogate keys, without discussing the reasons why surrogate keys are properly used, and when.
BTW: Let us avoid use of the ill defined and imprecise term primary key. It is an artifact of pre-relational data models that was first co-opted (unwisely) into the relational model, and then co-opted back into the physical domain by various RDBMS vendors. Its use serves only to confuse the semantics.
Note from the relational model that, in order for the database logical schema to be in first normal form, every table must have a user-visible set of fields, known as a natural key, that uniquely identifies each row of the table. In most cases such a natural key is readily identified, but on occasion one must be constructed, whether as a tie breaker field or otherwise. However such a constructed key is always still user visible, and thus always resides in the logical schema of the database.
By contrast any surrogate key on a table resides purely in the physical schema for the database (and thus must always, both for security reasons and for maintenance of database integrity, be entirely invisible to database users). The sole reason for introducing a surrogate key is to address performance issues in the physical maintenance and use of the DB; whether those be joins, replication, multiple hardware sources for data, or other.
Since the sole reason for the introduction of a surrogate key is performance, let us presume that we wish it to be performant. If the performance issue at hand is joins, then we necessarily wish to make our surrogate key as narrow as can be (without getting in the way of the hardware, so short integers and bytes are usually out). Join performance relies on minimal index height, so a 4-byte integer is a natural solution. If your performance issue is insertion rate a 4-byte integer may also be a natural solution (depending on your RDBMS’s internals). If your performance issue for a table is replication or multiple data sources than some other surrogate key technology, be it a GUID or a two-part key (Host ID + integer) may be more suitable. I am not personally a favourite of GUIDs but they are convenient.
To sum up, not all tables will require a surrogate key (of any type); they should only be used when deemed necessary for the performance of the table under consideration. Regardless of which common surrogate key technology you prefer, think carefully about the actual needs of the table before making a choice; changing the surrogate key technology choice for a table will be exhausting work. Document the key performance metric for your table so that your successors will understand the choices made.
Special Cases
-
If your business requirements mandate a sequential numbering of transactions for audit (or other) purposes than that field is not a surrogate key; it is a natural key (with extra requirements). From the documentation an auto-incrementing integer only generates surrogate keys, so find another mechanism to generate it. Obviously some sort of monitor will be necessary, and if you are sourcing your transactions from multiple sites then one site will be special, by virtue of being the designated host site for the monitor.
-
If your table will never be more than about a hundred rows then index height is irrelevant; every access will be by a table scan. However string comparisons on long strings will still be much more expensive than comparison of a 4-byte integer, and more expensive than comparison of a GUID.
-
A table of code values keyed by a char(4) code field should be as performant as one with a 4-byte integer. Although I have no proof of this I use the assumption frequently and have never had reason to rue it.
Not only it’s not good practice, in fact it’s described as an anti-pattern in Bill Karwin’s SQL Antipatterns book.
Not every table needs a pseudokey — a primary key with an arbitrary value, not something that has semantic value for the model –, and there’s no reason to always call it id
.
6
This is pretty universal – otherwise you would need to validate that the key is actually unique. This would be done by looking at all of the other keys… which would be time consuming. Having an incremental key gets expensive as your record number nears the key overflow value.
I usually make the pointers more obvious field names like ref_{table}
or similar idea.
If its not necessary to externally point to a record then you don’t need an id.
13
I wouldn’t say it should always be done. I’ve got a table here with no unique key–and it doesn’t need one. It’s an audit log. There will never be an update, queries will return all changes to what is being logged but that is the best that can reasonably be done it takes a human to define a wrongful change. (If the code could it would have disallowed it in the first place!)
An auto increment counter for a primary key is not a good idea. That is because you need to go back to the database to find the next key and increment by one before inserting your data.
That being said I would generally use whatever the database can provide for the primary key rather than having it as part of the application.
By letting the database natively provide it for you it can guarantee the key to be unique for what it needs.
Of course not all databases support it. In which case I generally use a table that stores key buckets and use high and low ranges that are managed in the application. This is the most performant solution I find because you get a range of 10000 numbers and auto increment them on the application instance. Another application instance can pick up another bucket of numbers to work with. You do need a sufficiently large primary key primitive such as a 64-bit long.
UUIDs I don’t use as primary keys because the cost of building them and storing them is much higher than incrementing a long value by one. UUIDs still deal with the birthday paradox in that a duplicate can theoretically arise.
2