Asset Identifiers
Following on from my article on Content Addressable Storage, this post addresses some design considerations around choosing a format for asset identifiers in a pipeline. Asset identifiers are the means by which each asset in the project is uniquely identified and referenced.
There’s a lot of consideration that goes into choosing how an asset pipeline refers to assets. Each way naturally has pros and cons, and it’s not always clear cut on what the best approach might be. The choice will depend on the specifics of the project and the rest of the tech stack involved.
I’m assuming you’ve already read the previous article and that you have an idea of what an asset pipeline is, at least insofar as that article is concerned. The primary bit to remember is that content creators author source assets which are the primary inputs to the asset pipeline.
Filename vs GUIDs vs DBIDs
There are several major mechanisms of identifying assets that are in wide use today. The two most popular by far are using either the asset filenames (or hashes thereof) or using GUIDs.
Filenames
The first and most obvious idea is to just use filenames. Referencing a texture that’s checked into source control as assets/textures/environments/rocks/granite_diffuse.tga
by the string "assets/textures/environments/rocks/granite_diffuse.tga"
is pretty direct. A great number of pipelines start out here just by virtue of how simple this mechanism is.
Using filenames doesn’t require a lot of infrastructure (… at first). An asset compiler can easily take a filename referenced by an asset and open it, read in the contents, and go on its merry way. The OS itself works with filenames. The source control repository that all the assets are checked into works with filenames. Human beings pick and can easily understand filenames.
Storing a filename in an asset file is easy since it’s just a string (… mostly). String pathnames have additional benefits; for text assets, anyone can open up the asset file and easily see the asset identifiers and recognize what they are. Users who diff two versions of the asset can easily recognize whether a reference has changed. Diagnostics about missing assets are clearly presented as missing file errors and are easily understood.
Naturally, there are downsides.
Perhaps the biggest downside of using filenames is the difficulty of supporting renames. Content creators rename files. Over the lifetime of a product, the asset files might be renamed a lot. Sometimes in bulk. Whole groups of assets are copied and modified to form new sets of assets. Maintaining the integrity of references becomes difficult when using filenames.
Tooling can help here, with some work. The editor or related tools can handle renames by opening and “fixing” all other asset files that reference the renamed file. However, what happens if another creator is creating new assets referencing the old asset identifier? Eventually either some broken assets will be checked in or someone is going to have a lot of pain to sort through after fetching latest content.
Another solution for renames is the use of a “redirector” asset. Consider a rename of stone.tga
to rock.tga
. The tools might create a stone.tga.redirect
file after the rename; the contents of stone.tga.redirect
would point to the new name rock.tga
. The rest of the tooling, if it tries to open stone.tga
will also look for stone.tga.redirect
; if the latter is found, the new filename is read from the redirector and that new file is opened instead. To avoid all these redirector files from piling up in the source repository, some nightly job or the like will have to do the full asset validation (which you want anyway) to find all the redirectors and backpatch all assets with incoming references. This approach certainly works and might be familiar to a lot of readers, since it is used by Unreal. However, it can be clunky and difficult to diagnose when it does inevitably go wrong.
Another issue with relying on filenames is guaranteeing stability of runtime asset identifiers. The obvious choice for pipelines using filenames as source asset identifiers is to use hashes of the filenames as the efficient runtime identifier, since a hash is deterministically computable by asset compilers and tooling. However, that means file renames can have wide impact across build artifacts and create a lot of churn. The churn can impact storage space used in a CAS, and increase patch size, and reduce network compatibility profiles, and otherwise just introduce cost.
Case sensitivity problems can also come up more often than people might think (literally every large project I’ve ever worked on). Those will happen no matter what asset identifier is used, but using filenames means those problems can easily percolate out into the whole pipeline and build artifacts and so on.
Ultimately, filenames certainly work and are an entirely valid way to refer to assets, albeit with some downsides to keep in mind.
GUIDs
Another very common approach is to use GUIDs or UUIDs for asset identification. This might be a true RFC 4122 compliant UUID or a custom scheme like that popularized by Microsoft; it doesn’t really matter in this case.
The important thing for this article is that a GUID is a large (128-bit, or 16-byte) number that can be generated on one machine with an astronomically low probability of colliding with any other GUIDs generated anywhere (even on other machines!).
The primary advantage of using a GUID is that the asset can be renamed without breaking any references. stone.mat
wouldn’t refer to stone.tga
by name, but rather would refer to some GUID like 6e46fe1d-d46b-4eb7-aeb7-c4f717e0b2d2
. That big ol’ GUID is then (somehow) associated with the file stone.tga
. If a user renames the texture to rock.tga
, the GUID is then (somehow) reassociated with the new filename, and the references from stone.mat
keep on working. Likewise, new assets checked in by another creator would also just reference that stable GUID and would keep working after the rename is submitted.
That stability also impacts choices at runtime. Runtime asset identifiers might still be the same GUIDs or hashes thereof, so the stable source asset identifiers necessitate less churn on asset artifacts. That’s a win across the board.
Of course, GUIDs aren’t a panacaea.
The first problem with GUIDs is that “somehow” I mentioned a couple times a few paragraphs back. Where is the GUID stored? How does the tooling map the GUID to actual filename and back?
One option for storing the GUID is to embed it in the asset file itself. This works reasonably well for custom asset file types owned and manipulated by the engine’s associated tools. It works less well for standard or third-party file formats that have limited or missing support for user extension.
Howver, it’s not obvious how one might store a GUID in an FBX file. Some common file formats have official ways of embedding custom metadata, and DCC tools aren’t supposed to lose or overwrite that data, but the reality is that some popular tools won’t maintain that custom metadata across edits or exports. Embedding metadata in file formats that aren’t under your direct control is risky and prone to failure.
Another option to store the GUID is to use a sidecar file of some kind. Probably the most well-known example of this approach is Unity’s .meta
files. The idea here is that each asset file has an accompanying second file that stores metadata. Given a file like stone.tga
there will be a second file stone.tga.meta
which contains the GUID.
Aside from storage, using GUIDs will require having a local database or index for mapping GUIDs to actual files on disk and back. A mature asset pipeline will have some kind of database like this anyway, but it’s worth keeping in mind the upfront need when planning out how to bootstrap the pipeline.
A significant downside to GUIDs is that they are not really introspectable. Looking at an asset file in a diff tool or text editor can be obtuse; seeing that the material references 6e46fe1d-d46b-4eb7-aeb7-c4f717e0b2d2
is only helpful if you somehow know what that GUID maps to. Most of us are not going to remember one GUID much less the thousands and thousands that will eventually be created. This is especially painful when dangling asset references happen (and they will), because the “missing asset 6e46fe1d...
” error messages are not likely to be terribly useful for most folks. Yet more tooling can help lessen these problems but can never fully eliminate them.
A further downside to GUIDs is that they make asset file copies treacherous. A copy of an asset (or its sidecar) means that two different files would then have the same GUID. The tools have to resolve this somehow, such as rewriting one of the files to have a new GUID… and hopefully the tools chose correctly (e.g. heuristically by modifying the asset file that the pipeline hasn’t seen already). It’s solveable but at the cost of extra complexity.
The GUID choice permeates the editor and user experience. UI needs to be built to see and search both the GUID and filename of every asset. Errors and validation will be “speaking” in GUIDs and tools need to translate those back into human-friendly filenames so they make sense (when possible).
GUIDs, like filenames, are certainly a viable option, but not without problems.
Other Options
There certainly are other options.
A few pipelines built specifically for live collaborative environments use DBIDs, aka the unique ids of rows or objects in a database. This might be the AUTO_INCREMENT PRIMARY KEY
in a MySQL table or the _id
field in a Mongo database. (And yes, I realize I’m carbon dating my familiarity with server technology here.) There are a massive host of downsides to this approach, and GUIDs can solve most of the same problems, so DBIDs are not especially popular outside niche cases and probably best avoided. GUIDs also allow creators to work offline without needing a connection to any central database, and allow merging and branching to work using standard tools.
Some pipelines use hybrid asset identifiers of some kind, such as referring to an asset by both a GUID and a logical name. The idea is to take the logical names/indices needed to support “sub-assets” (such as multiple meshes in an FBX file) and plumb that as a general concept throughout the entire asset pipeline. An opposite approach of course is to not even support sub-assets at all and require container asset types like FBX to be exported with only a single entity. The latter is vastly simpler to implement, maintain, and debug; it’s slightly more of a burden for creators, but likely only slightly.
Recommendation
For simple and small projects, like most indie or hobby engines, I’d recommend the filename approach. It’s dead simple to implement, easy to debug and diagnose, and the downsides won’t manifest too much if there’s not a huge number of files or creators.
For large projects, the GUID approach will fit naturally. GUIDs avoid many of the more complicated edge cases when hundreds of creators are iterating locally or across branches. GUIDs may require more intricate tooling, but most of that tooling will be necessary anyway to support a team of varying technical levels.
The medium projects, either approach works fine. If there’s a chance the project will grow, or might make use of outsourcing, the GUID approach can pay dividends over time but will impose higher upfront costs. If using filenames for identifiers, I’d advise to at least try to settle on a naming and folder structure scheme early in the project so ,mass renames are less likely to be necessary.
I strongly recommend staying away from niche or fancy schemes like DBIDs. Using a database or other services to coordinate and collaborate interactively is great, but just use GUIDs as identifiers to maintain global uniqueness.
This is a small aside, but worth pointing out I think.
No matter how assets are identified, I very strongly recommend to store source assets and metadata in self-describing structured formats. You really want the ability to scan a file with zero additional code or schema knowledge and to accurately pick out all the asset identifiers with zero false-negatives or false-positives.
As an example, consider a JSON encoding like the following. Replace the filenames with GUID strings, it doesn’t matter for this example. What matters is that the asset identifiers are just strings, and without knowing the schema of the file there’s no way to know for sure which strings refer to assets.
{ "name": "granite",
"texture": "stone.tga",
"shader": "solid.lit" }
Adding structure to the encoding will make it possible to write tooling that picks out the asset identifiers programmatically.
{ "name": "granite",
"texture": { "$type": "asset", "$id": "stone.tga" },
"shader": "solid.lit"
}
With the latter structure, a tool can easily search for all JSON objects where $type
is asset
and unambiguously know that the $id
field is meant to be an asset identifier. All other strings can be ignored. Specifically, the tooling can see that stone.tga
is an asset identifier but solid.lit
is not.
Building asset formats this way will drastically simplify the writing of many tools and allow assets to be processed reliably by different versions of the tools (which may not all have the same schema used to produce the asset).
Sidecar File Considerations
Sidecar files are any file that contain additional data about a primary file. For example, the .meta
files in Unity are sidecar files to the primary asset data.
In an asset pipeline, sidecar files don’t get asset identifiers of their own. They’re strongly associated with an actual asset file. That file gets an asset identifier and the sidecar file is found automatically.
For Unity’s case “automatically” means just slapping a .meta
extension on the filename for any asset, though there are certainly other options. The nice thing about the extension approach is that it makes it easier for content creators to understand what their source control submissions should look like: e.g. for every asset, expect a related .meta
file. Tools could also opt to just hide the .meta
files from users, or just reduce their visibility like GitHub does.
There’s multiple reasons to use these kinds of files. Unity for example also stores tons of other metadata about assets in the .meta
files, like the compression settings for textures. An engine that expects most assets to be primarily produced externally (with standard or third-party DCC tools) will likely find sidecar files to be necessary.
There can also be performance considerations that lead to the creation of sidecar files. Terrain data stored in a heightmap format is probably better stored in a binary blob, but everything else that can reasonably be text should be for the sake of merging, diffing, and introspection. Breaking up just these kinds of assets between “easy text” stuff and “binary only” stuff can end up being really handy. Decoding a 10MB JSON file is likely prohibitively more expensive than supporting a separate binary sidecar file, from experience.
Sidecar files of any kind do impose a few costs. Approaches that rely on extra extensions, like .meta
files, require that any asset rename or move requires the accompanying sidecar files to be renames in the same way. Thankfully the large popularity of Unity means that a lot of tools already understand the importance of keeping .meta
files with their associated asset files and simplify this process, which can be leveraged by reusing the same .meta
naming structure.
Considering early on how an asset pipeline might use sidecar files can help clarify what sorts of asset identifiers might best suit the needs of the project.
CAS for Source Assets
The previous article covered the use of Content Addressable Storage for asset artifacts. There’s also reasons to use CAS for (some) source asset files, as well.
That full breadth of that topic is a way out of scope for this article. The gist is that large binary assets and the like can be stored in a CAS. The regular asset files in source control would then all be smaller and simpler text files. The text asset files would contain the CAS identifiers of any of their binary data.
As a workflow example, consider an artist importing an FBX file. On import, that specific FBX file would be copied into the CAS. A new asset file with a GUID and other relevant metadata would be created which would contain the CAS identifier of the FBX and the filename of the original imported file. When the asset pipeline processes that asset file it can pull up the FBX data from the CAS.
If the original FBX file is ever modified or edited, the pipeline would need to reimport that file by copying its new version into the CAS and updating the asset file with the new CAS identifier.
There’s multiple potential benefits to this approach. For the purposes of this article, a notable benefit is that it eliminates the need for ever managing separate sidecar files even if using GUIDs as asset identifiers. The natural downside of course is all the work and tooling in building that CAS, managing when items are stored or synchronized, and accurately and reliably detecting when files need to be reimported.
Summary
Asset identifiers are a core part of an asset pipeline, but there isn’t an obvious answer or “best” solution. Pipelines built for higher budget projects with large teams are likely to find the trade-offs of using GUIDs to be the best fit, while small to medium projects can find the up-front simplicity of filenames to outweigh other considerations.
Thankfully, neither of the two major options presented here will be disasterously wrong for any project. Filenames work even at large scale, just not as well. GUIDs are completely fine even in a very small project, they just take a little more work to get up and running. Either choice works.