UUIDv5 Explanation

UUIDv5 Explanation

Introduction

So, I was pondering how to support the <podcast:guid> tag of the podcasting 2.0 namespace in my RSS Feed Generator for podcasts, and I was reading the documentation. First, I discover that the guid is a UUIDv5 string. I make some research, and find out that UUIDv5 stands for Universally Unique Identifier version 5, also known as GUID (or Globally Unique IDentifier). Okay, makes sense, the tag docs mention that its purpose is to identify the podcast. They also provide two different tools to generate this kinds of strings online: UUID Tools and RSS Blue, and I check that, providing the same parameters (including the podcast namespace, with UUID of ead4c236-bf58-58c6-a2c6-a6b28d128cb6) they spit out the same result. I try inspecting both pages to get the code they use, but everything is either obfuscated or wrapped in what I think is jQuery layers of callbacks. I start looking for already made code, of which I only find npm packages (not applicable because I'm doing this in browser JS, not node, which uses different types and API's in this specific case) and a very incomplete implementation in C in the specification. Speaking of the specification, it failed to mention some very important details, but it super helpfully made a point to reiterate the endianness of the data needed for the algorithm multiple times /j. I ended up finding a super short npm package I can't find anymore, I studied its code and ended up adapting it to what I needed. The clever bits of the code, like using padStart(), or applying a regular expression to filter the namespace, are thanks to that package. So for anyone trying to implement UUIDv5's in browser JS, here's the code, and for anyone wanting to understand the standard or implement it in other languages, here's also a thorough explanation:

The explanation

UUIDv5's are deterministic, meaning that, as mentioned before, the same inputs will always produce the same outputs. Version 4 UUIDs are not, for example, they depend on the time they were generated. You could argue that really they are also deterministic (and therefore all computing, probably) and that time is another input, but that's another conversation we won't be having today. This is the "byte anatomy" of any UUID.

0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                          time_low                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       time_mid                |         time_hi_and_version   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|clk_seq_hi_res |  clk_seq_low  |         node (0-1)            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         node (2-5)                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Once you know how a UUID looks (something like a5de3ad2-5d30-5c05-aa56-30c24b857264), you'll start seeing v4 and v5's everywhere, as they aren't only used for podcasts, at least that was my experience. To know which UUID version you are seeing, look at the second group, and the first number should tell you, for example, the previous one was a v5 (a5de3ad2-5d30-5c05-aa56-30c24b857264).

Version 5 requires 2 inputs, a namespace (there's also standard namespaces), and the string to encode. As an example, we will be generating a guid for a podcast which RSS feed resides in https://media.example.com/feed.xml/. This means our namespace is "ead4c236-bf58-58c6-a2c6-a6b28d128cb6" and our input string will be "media.example.com/feed.xml" (protocol scheme and trailing slashes stripped off, as mandated by the podcast 2.0 specification). We will be using arrays, so we can control byte order, and we will need bit logic, so make sure you know that kind of syntax in your language of choice.

"Accessibility" notice: code is rendered with the Cascadia Code font, which is open source and made by Microsoft, designed to be used in the Visual Studio IDE, with ligatures, making some characters look different. For example, !== becomes !==, and => becomes =>. I find this useful for reading (especially JS code), but if you don't, feel free to copy the code and paste it somewhere else with a different font without ligatures, and the characters will show up normaly.

  1. Dealing with the namespace

    First, we need to process the namespace. Take that string, eliminate or ignore the dashes, take it pair by pair of characters, and translate each pair to its corresponding hexadecimal value; take those values and store them in an array. In the following function, a regular expression is used inside the array's replace() method to separate the pairs, and a function that takes each one (labeled as hex, for hexadecimal) and plugs them into parseInt() to make the translation from a hexadecimal string representing a byte to its integer value (you can copy it with a button that appears by hovering over the code on desktop or pressing it on mobile):

    function uuidToBytes(uuid) {
      let bytes = [];
      uuid.replace(/[a-fA-F0-9]{2}/g, function(hex) { bytes.push( parseInt(hex, 16) ) });
      return bytes;
    }
  2. Dealing with the string input or "name"

    If you just look at the code, this may look equivalent to what we just did to the namespace, but don't be fooled, my friend. This time we need to take the input and translate each character to it's UTF-16 char code, then store those values in an array. There's a convenient function in JavaScript that let's us do exactly that, charCodeAt(). Here I initialize the array differently, but don't be scared:

    function stringToBytes(str) {
      let bytes = new Array(str.length);
      for(let i = 0; i < str.length; i++) { bytes[i] = str.charCodeAt(i) }
      return bytes;
    }
  3. SHA-1 and "the promise"

    Okay, now you have both inputs parsed as needed for our next step, applying the sha-1 encription/digest algorithm. For this, JS again comes to our aid (I wasn't going to implement it on my own, I don't know much about cryptography) with window.crypto.subtle.digest(). This method returns a Promise object that resolves to the encrypted string. We just need to pass the concatenation of the namespace with the "name" as the argument. This particular method, in JS, needs us to specify 'SHA-1' as our first argument to select the algorithm, and requires the argument to encrypt to be a buffer. This is how we can get our sha-1 result:

    crypto.subtle.digest('SHA-1', new Uint8Array(namespace.concat(name)) )
  4. Getting our hash back

    Let's call the result of the previous step hash. We now need to convert this into a manipulable byte array. This implies segmenting our result to the first 16 bytes, because sha-1 returns more bytes than we need, and in our JS case, turning that buffer into an array we can manipulate (because buffers aren't useful for what we want to do next):

    let bytes = Array.from( new Uint8Array(hash, 0, 16) );
  5. Byte logic

    To comply with the UUIDv5 spec, we need to set some bits. Specifically, we need to turn the first half of byte 5 into a 5 in binary (0b0101) to specify the version, and set the first two bits of byte 7 to 0b10. We can do each of this operations with a bitwise OR and an AND operation:

    bytes[6] = (bytes[6] & 0x0f) | 0x50;
    bytes[8] = (bytes[8] & 0x3f) | 0x80;

    To set the 5, we zero out the first half of the byte by bitwise ANDing with 0x0f (0b0000.1111), then setting it to 5 by ORing it with 0x50 (0b0101.0000). To set the 0b10, we do a similar operation. First we zero the bits with a bitwise AND with 0x3f (0b0011.1111) and then we set them with an OR with 0x80 (0b1000.0000) (Added dots in binary notation for legibility).

  6. Aaaand, from byte array to final string we gooooo!

    Finally, take that manipulated byte array, and make each byte into a UTF-16 character (flip what we did in step 1 on it's head, it all comes full circle!). Place dahses after character 7, 11, 15 and 19. To do this in JS, we can use the handy map() method to apply the toString() function to each element of the array. This can leave us with unique values instead of pairs, for example 'b' instead of '0b', so we use padStart() to detect and remedy those cases all in one. Then, join('') just converts the character array into a proper string object. To add the dashes, I use a template literal:

    let uuidS = bytes.map( v => v.toString(16).padStart(2,'0') ).join('');
    uuidS = `${uuidS.substring(0,8)}-${uuidS.substring(8,12)}-${uuidS.substring(12,16)}-${uuidS.substring(16,20)}-${uuidS.substring(20)}`;

All together, your 3 functions for success:

Here is all the steps we took, compiled into 3 functions with some additional error handling:

function uuidToBytes(uuid) {
  let bytes = [];
  uuid.replace(/[a-fA-F0-9]{2}/g, function(hex){ bytes.push(parseInt(hex, 16)) });
  return bytes;
}
function stringToBytes(str) {
  let bytes = new Array(str.length);
  for(let i = 0; i < str.length; i++){bytes[i] = str.charCodeAt(i);}
  return bytes;
}
function v5(name, namespace) {
  if(typeof name === 'string') name = stringToBytes(name);
  if(typeof namespace === 'string') namespace = uuidToBytes(namespace);
  if(!Array.isArray(name)) throw TypeError('name must be an array of bytes');
  if(!Array.isArray(namespace) || namespace.length !== 16) throw TypeError('namespace must be uuid string or an Array of 16 byte values');
  return crypto.subtle.digest( 'SHA-1', new Uint8Array(namespace.concat(name)) ).then( hash => {
    let bytes = Array.from(new Uint8Array(hash, 0, 16));
    bytes[6] = (bytes[6] & 0x0f) | 0x50;
    bytes[8] = (bytes[8] & 0x3f) | 0x80;
    let uuidS = bytes.map(v => v.toString(16).padStart(2,'0')).join('');
    uuidS = `${uuidS.substring(0,8)}-${uuidS.substring(8,12)}-${uuidS.substring(12,16)}-${uuidS.substring(16,20)}-${uuidS.substring(20)}`;
    return uuidS;
  }).catch(e=>console.error(e));
}

Note that because crypto returns a Promise, I used the then syntax to do the next steps and return the final result. To use the code, you just need to paste the previous functions somewhere and call

v5('media.example.com/feed.xml','ead4c236-bf58-58c6-a2c6-a6b28d128cb6').then(res=>{ })

and the string will be available inside the the then() statement as res. There's also async/await syntax, so use whatever you prefer.

You can check it yourself, but both this code and the online tools mentioned return a5de3ad2-5d30-5c05-aa56-30c24b857264 for our example inputs. You can use this to check your implementation is working right!

This topic touched on many fields and techniques in programming, so I hope you found this interesting or useful in some way. Thanks for reading!