Is it possible to design a distributed system to generate unique ids where there is a requirement that every generated id is guaranteed to be unique with no possibility of collision?

I didn’t think that this was possible, given the fact that there is a finite number of ids which can fit into some structure. So for instance, if we generate ids with the numbers 0-9, and store them in 32 bytes, then there are at most 10^32 possibilities for ids. This gives a very small possibility that there will be a collision, but there is still a possibility. Am I missing something, or is there a way to engineer a system such there there is guaranteed to never be a collision (i.e., all ids are truly unique?)

Even if we generate ids based on a timestamp, say, then it’s still not clear to me that there will never be a collision.

12

there is a finite number of ids which can fit into some structure

Correct. If you have n bits in your structure, after you have generated 2^n IDs, the next one must be a collision; this is the pigeonhole principle.

However, assuming you are dealing with an actual physical system rather than a pure thought experiment, there are “only” 10^80 = 2^265 electrons in the universe so if your data structure is bigger than 265 bits or so then, given a perfect algorithm, you’ll never be able to generate all the IDs.

6

You want version 1 UUIDs. They don’t collide but they leak the generating computer’s MAC address and require a monotonic clock.

From RFC4122 we have this format

8 bytes time-low
2 bytes time-mid
2 bytes time-high & version (high nybble is a 1)
2 bytes clock-seq-high & reserved (high two bits are 1 and 0)
2 bytes clock-seq-low
6 bytes MAC address

Clock-sequence is actually the subsecond portion of the clock now.

The thing about the two byte fiels is they are not two one byte fields. If you mess the code up, you get different results on little endian and big endian machines and you have a chace of collision again. The string form looks like this no matter what your endian is:

????????-????-1???-[89AB]???-?[02468ACE]??????????

The thing about this is when it works, it works. If you don’t have MAC address collisions (the general rule is if you get colliding MAC addresses, RMA both cards) and if you actually have a monotonic clock, this works 100% of the time.

There’s a work around for not having a monotonic clock. That is, do it the old way. Serialize all UUID generations and use clock-seq as a counter that increments with each generation.

The thing about this is you will have bad generation and most likely collisions if your PC battery dies on any node that generates.

If you can’t tolerate leaking the MAC addresses, you’re stuck with lesser means. V4 UUIDs are really good for all terrestrial use under modern hardware assumptions (has thermal diode or other hardware RNG). V5 UUIDs can do the job under specific circumstances (they’re a hash of something else; the main problem is they’re subject to adversarial attack).

6

Is it possible, when K different hosts are generating IDs? Yes.
And we can do it with readily available libraries.

ULIDs do a pretty good job already,
but they’re not collision free.
It looks like the format may soon be known as
UUIDv7.
We allocate 48 bits for a millisecond timestamp that won’t
overflow for millennia, plus 80 bits of entropy, for a 128-bit GUID.

So by the
birthday paradox,
those K servers would (collectively) need to generate on the
order of 2 ** 40 random IDs within a single millisecond, to have a
decent chance of producing a collision.
Suppose K is “large”, and servers are fast, and
your appetite for risk says the odds aren’t looking good enough to you.
Fair enough.
We could allocate 208 bits of entropy, but the collision risk is
still non-zero. There must be another way.

Use a Distributed Consensus protocol, perhaps based on
Raft,
to enroll all servers in your scheme and assign each
a distinct host_id in the range 1 .. K.
I don’t know how many servers you have.
So let us allocate 32 bits to store such IDs.
That’s enough to accommodate all publicly accessible IPv4 destinations
in the global internet.
The GUID prefix will be {timestamp, host_id}.
Now we have a mere 48 bits of entropy to work with.
So if any single host rolled on the order of 2 ** 24
random IDs within a millisecond, it would have
a fair chance of self colliding.
That’s a mere 16 million IDs,
achievable if we spend about a tenth of a nanosecond
generating each ID.
Perhaps plausible for a core router that puts IDs
on each forwarded packet.

But wait!
By hypothesis these IDs are all being
generated locally on a single host.
So we can just use a counter for uniqueness.
Or several counters, one per functional sub-unit,
if for example our router has several line cards
or our GPU has several cores.

If all sub-units consume IDs at roughly the same rate,
then we can essentially stretch the 32-bit host_id
into something longer which encompasses the sub-units.
If that’s still not enough, we just request additional host_ids.

For sub-units that process IDs at diverse rates,
it’s easy to make some central thread hand out big counter blocks
of a thousand or a million.
Oracle database servers have been allocating
PK IDs
in this way for decades.

So there you have it.
Your hosts, and their sub-units, can only generate
so many IDs per second, and 2 ** 80 is a large number.
Allocate prefix bits to hosts and optionally to sub-units,
and then you just need counter(s) for the remaining bits.

This makes the (reasonably weak) assumptions that a host
can learn approximately what the current time is,
that it can persist such timestamps so upon reboot it
only lets the clock march monotonically forward,
and that it can keep track of what happened in the
last millisecond or so, to coordinate ID uniqueness
across its sub-units.
If you require even greater rates of ID generation,
or if some assumption can’t feasibly be engineered
into your use case, then just allocate more than 128 bits total
for each distinct ID.

2

It is easy enough, but require some degree of coordination.

The general scheme is relatively simple: partition the bit-space of the ID in two sections, one with a “generator ID” unique to each generator, and one with a strictly monotonically increasing sequence.

For example: the UUID v1 scheme uses MAC addresses as generator ID, and then timestamp + counter as strictly monotonically increasing sequence.

There are 2 challenges to solve:

  • Ensuring uniqueness of the generator ID.
  • Ensuring strict monotonicity of the sequence.

The first challenge typically requires some form of coordination across generators. This can be a one-off coordination system — assign a unique ID when setting up the generator — or it can be a more active form of coordination — such as registering with a central entity on start-up.

The second challenge requires either knowledge of the system, guardrails, or persistence:

  • Persistence: each generated sequence number is persisted, on start-up a generator restarts from the last persisted number. In practice, persistence is likely best performed in batch: acquire a batch of N numbers (bumping the persisted item by N), then distribute them, rinse and repeat.
  • Knowledge: if the operation for which IDs are provided is known not to exceed a generation rate of N/s, then the sequence number can be seeded on start-up by a timestamp of appropriate resolution, which is then simply incremented.
  • Guardrails: Same as Knowledge, except that a check is made that the maximum rate of ID generation is not exceeded.

As a practical example, I’ve regularly implemented such a scheme to generate unique 64-bits ID within a multithreaded application. In such a case, the partition key was a thread index (from 0 to N) and the sequence was seeded with a nanosecond resolution timestamp stored in a thread-local variable, and incremented on each use. The maximum rate of generation being 1/ns/thread, we knew it wouldn’t be exceeded — our application was plenty fast, but still.

For export, the ID was prepended with a 64-bits ID unique to the running process, obtained from a shared Zoo Keeper cluster. The running process would lock the use of this ID for the duration — via a lease system — ensuring no other process could possibly use it, and thus we had a 128-bits ID guaranteed unique across all applications.

3

No,
but the chance the Earth is hit in soon future by meteor big enough to eradicate most or all of the human existence including the computer that generates UUID is much higher than to actually have collision in UUID.

So you should better watch for meteors rather than UUID collisions :).

In general – you just throw an error if collision happens, the data associated to it will be lost, if that happens once in your lifetime, you will be still quite lucky and the business impact should not be that critical for one-time accident. Anyway the internet is unstable place, there is quite usual chance the request is lost on the way, the server or client critially quit without finish what started etc. – just handle the possible duplication in same way as some of such critical error happens.

7

There are GUID and UUID schemes that allow independently operating machines to generate universally unique identifiers with near certainty.

But the question asks about a distributed system. Generally, distributed systems involve communication between the nodes. So, yes, you could have a distributed system that generates truly universally unique identifiers by coordinating across a network.

In the simplest case, a central entity generates all IDs and passes them out to requesting nodes. You can improve scalability in a variety of ways. For one, you could carve up the ID space into blocks and relegate authority of those blocks to different nodes.

If you consider this cheating, consider that GUID schemes that rely on a machine’s MAC address work exactly like this, because the MAC address space is divvied up by a central authority that gave vendors authority over non-overlapping blocks.

Is it possible to design a distributed system to generate unique ids where there is a requirement that every generated id is guaranteed to be unique with no possibility of collision?

It is possible if.

  1. You can guarantee that each node in the system has a unique ID.
  2. Your nodes have a clock (or counter) which you can guarantee will increase monotonicly.
  3. Your nodes can be trusted to actually follow the prescribed generation process.
  4. You are ok with anyone who sees the UUIDs being able to determine which UUIDs were generated by the same node.
  5. You are ok with with your system having an “expiry date” at some time in the future.

This is what “version 1” UUIDs do. A clock (which will roll over around 3400AD) is combined with the MAC address of the node (which is supposed to be uniquely allocated by network card vendors based on allocations they receive from the IEEE).

This gives a very small possibility that there will be a collision, but there is still a possibility.

Ultimately there are all sorts of low proability random events that could ruin your day. At a certain point you just have to accept the risk. For any reasonable number of UUID’s the chance of random UUID or hash based UUIDs generated from unique non-malicious input accidentally colliding is extremely low.

IMO a far bigger concern than two UUIDs randomly colliding is how you ensure that the systems generating IDs and introducing them to your larger distributed system are doing so non-maliciously.

I didn’t think that this was possible,

This answer shows one rather easy possibility to create such a system. It can work as described in certain circumstances (and has been implemented like this in business systems I have been involved with), but some aspects would be impractical on larger scales. Still, it is a proof of concept that it is, indeed, possible.

  • Create a central DB which stores all generated UUIDs. This needs to be nothing special – it just needs enough storage to store all of them. If that should be an issue over time, there would be tricks to get around that (“high water marks” and such). The DB requires a single standard RDBMS feature only (the possibility to create a simple unique index).
  • Every system that wants to generate a new UUID can do so freely by any algorithm you might think of – for example the one based on a relatively unique MAC address, timestamp, and counter. Or it may simply generate a random number by any means.
  • As last step of the UUID creation, that UUID has to be sent to the central DB. The DB will insert the UUID into its table of UUIDs. If at that point a duplicate is detected, it will tell the generating system and refuse the UUID. Else it reports success. Collisions are impossible since we assume standard sane/safe RDBMS behaviour.
  • If the generator receives a failure answer, then it generates another UUID. This is repeated until success.

This is still respecting all specifications in OPs post – OP does not require that the distributed system is allowed to split into different disjunct partitions, so we can assume connectivity to the central system; it is not specified that this needs to work across different companies etc. either, so we can waive many practical aspects for now.

If your goal is having no collisions then there is a simple strategy that does is at the expense of having many collisions if there are any:

You have a large number of processes that generate UUIDs. For example my computer might be one such process. Each process generates one random UUID, and from then on returns the next UUID every time. If two processes each generate a million UUIDs then you get a collision only if the initial UUIDs are less than a million apart. However, if you have one collision, you will have many. The probability to have any collision at all is much smaller. The expected number of collisions is exactly the same as with random UUIDs. What is very important is that the first random number is truly 128 bits or more.

0

Seed each local part of the distributed system with a prefix. After that each local system creates a series of ids by adding one to the previous one.

Simple, easy system for a distributed system to generate an unbounded number of guaranteed unique identifiers.

System 1 would generate the identifiers 1:0 , 1:1, 1:2, and so on. System 432 would have 432:0, 432:1, 432:2, and so on; with no possibility of collision.

In practice however, it is generally not worth the effort, for reasons that have been explained in other answers.