When GUIDs Collide

What happens when 2 identifiers or keys collide?  Bad stuff.  Databases originally used integers for their record identifiers.  If you wanted to add a new record you must query or ask database what was the last integer and add one to it.

This is fine in isolation, but what happens if you want to insert a bunch of records into a remote database that you can only connect to once per hour?  Oh and there are multiple users.  You can’t use Integers anymore you must use something like a UUID/GUID.

Globals Unique Identifiers (GUID) or Universally Unique Identifiers provide a “random” 128bit value that looks like the following:


They work great, and once you switch from Integers to GUIDs your database is truly distributed and can be updated on the moon!  They are a bit ugly though.  Also they take a huge amount of storage compared to the 64 bits of an integer.  Data size matters sometimes.  Wait, GUIDs are 128 bits!

But still we have a slight probability of generating or guessing the same ID with the birthday problem.

Screen Shot 2019-10-27 at 6.42.43 AM.png

If you are FAANG then you have to start worrying about stuff like this.  There will be collisions.  What happens if you add another GUID or two, increasing the bits to 512?

So there is still the GUID Soup to consider…  Start putting multiple GUIDs in an URL and it is pretty gross.

Enter CUID, which has some interesting objectives:

Collision-resistant ids optimized for horizontal scaling and binary search lookup performance.

Screen Shot 2019-10-27 at 6.29.57 AM.png

How do they do this?  with a string 25 characters.  Timestamping is critical to its binary search capabilities.  Ordering by timestamp allows the lookup to be logarithmic.  So if you need to find one record in a billion.  It would take about 30 tries to find it.  Sorting is key to speed here.

If you need speed and no creation time privacy CUIDs may be a viable option.  Anonymous systems must never use CUIDs.

Only slightly larger in storage size than a GUID… 25 Characters * 8 Bits/Character = 200 Bits.  Guessing CUIDs I feel would be easier…

Screen Shot 2019-10-27 at 6.50.42 AM.png

I’ll stick with my GUIDs soup for now.

Until Next Time. Happy Merging Distributed Datasets,



P.S. Microsoft trying to brand it, er recapitulate:

Screen Shot 2019-10-27 at 7.12.13 AM.png