Asset Pipelines and Content Addressable Storage
It’s been almost three years since I last published anything here. My last foray into writing for this site included a series of articles on designing and implementing an asset pipeline for games. I never finished that series. This article is the first in a new attempt at exploring the mysteries of building pipelines for game assets.
This time around, I’m trying a new tack. Instead of an orderly series attempting to describe things step by step, I’ll be approaching some individual components of a mature asset pipeline. Each of these topics are expansive in their own right and certainly deserve an article (or several) of their own.
I’ve rewritten this article four or five times before posting, trying over and over to find the right way to structure this information and the right way to properly explain all the topics. I finally decided to just pull the trigger and post despite the problems I knew were still here. This article is at best a high-level overview and skips over a lot of details. Hopefully it’s still useful to some readers.
This article describes the general concept and uses of a Content Addressable Storage (CAS) mechanism for storing compiled asset artifacts.
See also my second article on this topic which covers the selection of asset identifiers.
Asset Pipeline Basics
It will take several more articles to dive into all the details of designing a full asset pipeline. We only need a high-level overview for the purpose of this article.
Content creators – like artists, designers, musicians and foley artists, writers, and so on – create source assets using digital content creation (DCC) tools or using specialized tools built directly into a game engine’s editor application. These source asset files are designed for efficient editing and modification.
The source asset files are usually not appropriate for loading and consumption by the game itself. The engine doesn’t want to consume raw Photoshop
.psd files with all their edit-time information about layers and filters. Rendering code will be much happier with textures that are stored in a GPU-friendly compressed texture format with all mip-levels resolved.
There’s also a lot of runtime content that may not be directly authored by content creators but which instead is computed by tools. Lighting information for a level is often computed programmatically and stored in a “pre-baked” format to make game runtime efficient. A single source asset for a level might thus need many corresponding asset artifacts to be generated for the engine to consume at runtime.
The part of the asset pipeline most relevant to this article is the asset build pipeline. Its job is to take all those raw source assets and efficiently and reliably convert them into the proper runtime-friendly build artifacts required by the engine. When a creator wants to play the game to see their work in the virtual world, they need to be able to rely on the pipeline to ensure all the necessary assets are compiled and ready to use. The creator further relies on the pipeline to avoid needlessly recompiling assets, especially given that some asset compilations can take many minutes or even hours to complete!
Game source assets are managed in a version control repository, such as Perforce or Subversion or even git(-lfs). Assets have all the same concerns about revisions, branching, and so on that coders deal with for source code.
Content creators fetch the latest set of assets a few times a day, potentially getting new revisions of hundreds of large assets. There’s branches to switch between. There’s local edits, and shared shelves or private branches for users to share their local edits before they are merged into the mainline branch. There’s a need to check out old revisions of assets, or to bisect between revisions of assets to find problems.
The compiled asset artifacts, however, do not live in version control. This is similar to how the object files from C or C++ source code are generally not checked in, which means both code and assets have the same concerns around fetching the latest revision possibly causing a lengthy rebuild. Unfortunately, the “compile time” lost to building source code is often a small fraction of the the time lost to building asset artifacts.
It only gets trickier with game assets as we look deeper. The “correctness” of an asset artifact can change regularly throughout the course of development (as runtime engine formats change and evolve). Asset artifacts are a product of the source assets and the compiler that built them, plus other configuration or target factors. Identifying the correct asset artifact for a given combination of engine build + source asset revision + target platform gets complicated quickly. Storing compiled asset artifacts by their original asset identifier just doesn’t cut it when so many factors necessitate rebuilding the artifacts.
Files and Folders
All those source asset files are saved on disk in a traditional file structure. Version control software tracks files and hierarchies and recreates those when checking a revision out from the repository. That’s all well and good and is friendly to the content creators’ expectations.
In a simple asset pipeline, the compiled asset artifacts are also stored in a traditional file structure. Source assets are compiled into one or more artifacts and stored as files in a build folder using some determinstic file name derived from the source asset.
There’s a number of problems with storing assets as basic files, however. Our filesystems are optimized to help humans organize large numbers of mostly unrelated files. Filesystems aren’t designed to store large numbers of potentially tightly-coupled immutable data.
Perhaps the biggest problem for the purposes of this article is that regular files and folders can’t deal with revisions or history very well. A file’s contents can be modified and there’s no way to restore the prior contents. There’s no way to know which version of a file’s contents actually exist on disk. For source assets (and code), the version control software does its best to manage all these states. We don’t store build artifacts in version control so we can’t rely on that software to help us for artifacts.
The mutability of regular files also plays havoc with synchronization or validation tools. Synchronizing files from a source folder to a desintation folder is expensive. The tools can’t see which file contents are missing from the destintation because the destination might have the file but with different contents. Synchronization thus requires comparing byte-for-byte each file. The software also has to decide what to do on a conflict: error, overwrite, copy to new name, ask the user, or something else?
A naive approach to solving branches or versions may be to make a separate folder for each. However, most source assets will be the same on each branch and hence most compiled artifacts will be the same. Storing them twice wastes a lot of space. Features like hard linking (allowing file names to point to the same file object in the filsystem) allow us to eliminate that waste, but only if we somehow know a priori that the files will actually be identical.
We have a bunch of objects (asset files) we need to store, but trying to identify them by only a name incurs heavy overhead for synchronization and disallows us to cheaply switch between revisions or branches.
Content Addressable Storage
The CAS concept is an alternative approach to storing objects. The main idea of CAS is that objects are identified by their contents. All that means in practice is that the object identifier is a hash of the object’s contents instead of a name.
This has a ton of practical benefits. The biggest advantage of CAS for an asset pipeline is that we can reliably check if a specific version of an asset artifact has been stored into the CAS and hence avoid rebuilding the asset. We need to know the CAS identifier of potential build artifacts before we try building them, of course; we’ll discuss how that’s done later in this article.
The strict one-to-one mapping between identifier and content also guarantees that there can be no duplicate objects. Object contents do not need to be inspected to determine if two identifiers refer to the same data or not. This means that synchronizing two CAS object stores is about as cheap as it can get.
The uniqueness and immutability are key to how we address the permutation of asset artifacts. If the asset artifact produced by a build changes for any reason then it will have a different hash and different unique identifier in the CAS. Likewise, if a build process ever produces the same output for an asset, like if a user reverts some local changes or switches back and forth between two branches, then there is no extra storage required for the duplicate.
There is, however, one very glaring problem with a CAS: there’s no file names!
The engine will load assets based on some relevant asset identifier, probably something derived from the filename of the original source asset (or an associated GUID or the like). The engine knows its looking for the map asset called “level one” but there’s no names associated with objects in the CAS. A solution is required.
Thankfully, dear reader, we have such a solution!
Regular hierarchical filesystem don’t store names inside file objects, either. Names are stored in the folder objects on disk. Finding a file by name first requires finding the file’s parent folder. We basically need something similar to folders but for our CAS.
We’ll call these manifests instead of folders, as there are a few differences between the two. For starters, since any notion of hierarchy is typically unnecessary for runtime asset consumption, manifests can be flat. A single manifest object can be a big table of all asset identifiers.
The second big difference is more of an intrinsic property of using a CAS. The contents of the files referenced in a filesystem folder can change, but the contents of objects in a manifest are immutable by the very nature of a CAS. If a new version of an asset is written to the CAS then it will have a new hash and a new manifest will need to be generated that maps the asset name to the new CAS identifier.
The manifest is just another object. It’s some encoding of a list of asset names and a list of CAS identifiers. We can hash the manifest and thus we can store the manifest directly into the CAS itself. If we ensure the manifest contents are deterministically sorted, we thus gain all the uniqueness and de-duplication benefits of the CAS for our manifests.
When the engine begins to load assets, all it needs now is the identifier of a particular manifest. From there, the engine can look up an asset name to find the CAS object identifier and load the contents. There’s not really any other difference to the engine; it has no need to ever care about branches or revisions or any of that. Shimming the manifest and CAS lookup into most engines is remarkably easy given that production engines already feature an abstraction layer for asset loading.
The only remaining problem is identifying the specific manifest to use. Since each build of a set of assets will potentially produce a new manifest, there will very quickly be more than one. The CAS itself won’t help us here; we will need some external mechanism to identify the “current” manifest identifier for use at runtime.
This is largely the problem of the asset build pipeline. Whenever the pipeline finishes processing a batch of assets and building the artifacts, it can generate a new manifest of all known assets and their build artifacts. The manifest identifer can be written to the asset database or to a file or other place where the engine and other tools can retrieve it.
A CAS all by itself doesn’t actually help us much.
Given a bunch of source assets, the build pipeline still has to do something to generate the CAS object identifiers. If the identifiers are derived from the contents of the build artifacts then the build pipeline must have some way to know the hashes of artifacts before it builds them. That’s certainly a problem.
The solution to this problem is caching build metadata. Recall that an asset compiler is dependent on various factors like the source asset and contents thereof, any dependency assets and their contents, compiler version, and other configuration. If any of these inputs change, the asset compiler must run and produce new build artifacts.
We can save this information. After building a set of artifacts, the compiler can write an additional object that contains the list of all objects it produced. Before rebuilding an asset, we can cheaply check for and load that object to see if we have known build artifacts and, if so, see if they’re all still stored in the CAS.
The tricky part is mapping all of those inputs to that output object. We don’t want to identify our list of artifacts by a hash of the list of artifacts. Instead, we want to identify our list of artifacts by a hash of the list of inputs to the asset compiler. It’s almost the same as the hashes for CAS identifiers… only it’s no longer “content addressable.”
The metadata can be stored in a very similar way as all the rest of the CAS objects. Depending on the strictness of the underlying CAS technology, the metadata could be written into the CAS with a “wrong” identifier. If all the CAS is doing is providing a way to map a hash key to an object, that would work just fine. However, it does complicate the ability to have integrity verification on the CAS which is a big loss.
Thankfully, even if the metadata isn’t stored in the CAS, most of the underlying code or technology for the CAS can be reused. It’s a small wrinkle, but one easily handled.
Actually determining how to store CAS objects can be nuanced. There’s a lot of possibilities.
A very simple option is to store CAS objects as loose files on the filesystem. Given a CAS identifier hash, convert that into a file path. For example, in a CAS located at
c:\build, the object
44f90c6 might be written as a file named
C:\build\objects\44\f9\44f90c6. Though loose files can have a lot of potential OS overhead, and they make it relatively easy to corrupt the CAS by modifying files.
We could take a page out of the book of big database systems and build a custom object store that’s backed by one or a small handful of native OS files and manage the indices and storage allocation ourselves. That can provide a lot of efficiency by reducing file I/O and system call overhead at runtime, which is the main reason game engines often ship assets in big pack files of some kind. However, a custom CAS like this is a lot of complex code.
An existing production database could be used directly. There are several database engines (both SQL and NoSQL) which can efficiently store and access (very) large
BLOB objects which lets them act like general purpose filesystems or object stores. Using such a database may not be as optimal as a custom solution can be, but they can be efficient enough for early development.
A very mature and battle-tested CAS for a large production game is probably going to benefit from the custom CAS storage mechanism. However, for most projects I’d actually recommend the loose filesystem, especially when first building out the technology!
Tossing a CAS onto an asset build pipeline can help user’s on their local machines when they switch branches or revisions. The CAS by itself isn’t going to do anything to help share the asset compilation load across a whole team or whole studio.
To level up the pipeline for the whole team, a remote cache is required.
The remote cache is some shared network-accessible location that stores asset build artifacts and build metadata. Since we’re using a CAS approach, the remote cache itself is yet another CAS object store.
Using a remote CAS is, on the surface, really simple. If the build pipeline is looking for an object but doesn’t find it in the local CAS, then the pipeline looks in one or more remote CAS locations. If the object is found remotely then the object is downloaded and stored in the local CAS. The pipeline then carries on as if the object had always been in the local CAS. That’s pretty much it.
The remote cache may be slightly more than just the CAS, since those asset build metadata files will also need to be made available remotely. If you decide to be strict and store metadata separately from the CAS itself, that means that you need a (very slightly) more complicated API or configuration to handle both CAS objects and metadata objects.
The remote cache, including the CAS and metadata, would be populated by build servers. These might run on every commit to the asset repository, or nightly, or really whatever cadence makes sense for the project. The goal is to have the remote cache populated by the time most users would be likely to fetch new asset revisions from version control.
There may even be multiple remote caches. If a studio has multiple offices, for example, there may be a cache on each premises. A fully remote team might have caches in the cloud in different geographic regions. Different caches may be updated at different frequencies, be written by different jobs, or have different access controls.
It’s also possible to allow user’s machines to write to a remote cache. A relatively simple REST API can efficiently handle writes as easily as reads. A client would POST a list of object identifiers it wishes to read or write and would then receive the list of objects to GET or PUT, which reduces the total number of requests necessary. When an object is PUT the server can validate or compute the identifier. HTTP can handle streaming large files just fine, and HTTP/2 or HTTP/3 mostly improves that capability.
If there is a writeable remote cache, I recommend having two caches, with the first higher-priority (preferred for reads) cache being read-only and populated only by the stable build servers. The writeable cache should be wiped clean on a regular basis. Doing so reduces the risk that a some unstable version of the tools on a users’ machine somehow writes bad data into the cache which could spread across the whole team.
Loose Files CAS
I recommended starting with a loose filesystem approach so I should probably dive into a bit more detail on the topic.
To recap, the loose filesystem approach means picking some folder, say
c:\build\objects for the sake of examples, and writing CAS objects as files in that folder. The name of the files would be the CAS identifier encoded in hexadecimal, possibly with the first octet or two used to form child folders (to avoid performance issues with having too many files in a single folder). An object identified by
44f90c6 would be written as a file named
C:\build\objects\44\f9\44f90c6. Pretty simple.
The main benefit of using loose files for CAS is that you can leverage tons of existing tooling. This is critical in early development. New code has bugs and your asset pipeline will be no exception. Aim for an implementation that is trivial to debug and comes with a ton of powerful debugging and maintenance tools built right into the operating system.
Being able to browse the CAS in File Explorer or a terminal window is a pretty big win. The files can be opened right up with an appropriate application or editor. With other CAS backend implementations you’d need a (addmittedly very simple) custom tool to browse entries and you’d have to first copy the object out into a loose file before being able to open them in other programs.
Another benefit to loose files is that when someone is seeing a discrepancy in what an asset build produces for them vs what is produced on another machine, you can quickly find any differences between them using any diff tool out there. Using a custom object store means you’ll need to build custom tools for these comparisons (and hope those don’t have bugs). Even using an existing database for the object store means you’ll be limited to the specialized table diffing tools that exist for that database. It’s hard to argue with being able to diagnose asset pipeline builds with WinMerge or Beyond Compare or whatever your favorite diff tool might be.
Synchronization of two CAS object stores is also a lot easier using loose files. Just about every OS comes with some kind of GUI app and command line tool that lets you copy files from one place to another; you can populate one CAS from another with Windows Explorer. Automated processes can use
rsync rather than implementing and maintaining a custom process. Since CAS files are identified by their hash and should not be modified, there’s no need to overwrite files or compare contents.
However, a loose filesystem also lacks any innate protection against corruption or bugs. A bug in an asset pipeline could write a file with the wrong name/identifier. The user or some other local process might mutate an object stored as a loose file. Setting the files to read-only permissions helps avoid accidental writes, but doesn’t eliminate the possibility. These issues are rare, based on experience, but they do happen! A simple tool can be written (just a few lines in most scripting languages) to validate that the files in the CAS match their identifiers and remove those that do not.
That need for CAS validation implies that storing build metadata in a loose filesystem CAS is an especially bad idea. Thankfully, using the loose filesystem means that there’s almost no logic to reimplement to store build metadata in a similar hash-based fashion. Store the metadata in
c:\build\meta instead of
c:\build\objects, or something like that.
Perhaps counter-intuitively, there’s also a few potential performance advantages to the loose file store, too. Writing new object to the CAS is a chicken-and-egg situation: until the full contents of the object are known, the object identifier cannot be determined. For small assets that can be generated in memory that’s no problem, but it can be an issue for very large objects. By leveraging the native filesystem, these objects can be streamed out to a temporary file and then renamed to their appropriate location in the CAS. This is very efficient on virtually every OS so long as the temporary file and CAS location are on the same disk. Consider using a
c:\build\temp folder or similar.
That same benefit lends itself well to middleware. Plenty of products in the game industry have their own custom asset build tools that deposit a bunch of their artifacts into some target output directory with no means to slot in a custom file abstraction mechanism. These middleware files can also be cheaply renamed to their proper location in the CAS if using loose files, but would have to be copied if using more sophisticated storage mechanism.
A large chunk of the efficiency problems with loose files can be avoided by using an index file. An index is a file (maybe a SQLite database) that stores all the object identifiers and metadata like creation time or file size so that individual objects. That avoids the need to frequently invoke system calls or load metadata from files on disk. Since an index is not a source of truth, the index file can be deleted or recreated on demand; that implies there’s no need to migrate or upgrade index files if the underlying format ever needs to change. The index adds some small complexity but it won’t be a single point of failure that invalidates the whole CAS; if it ever gets corrupted then delete and repopulate.
There’s always the option of a hybrid approach as well. The worst cases for a loose filesystem approach are many small files, and there will be plenty of small assets. A hybrid approach might use a database or custom mechanism for storing any object below a certain size threshold while using loose files for larger objects. Hybrid approaches are likely to be the most complex but also can offer “best of both worlds” efficiency.
Finally, with loose files, a remote cache can be a regular networked file share as supported by virtually every OS. There’s absolutely no custom software or extra code necessary to support remote access. There’s some IT setup and maintenance requires but that’s true of ever remote cache option. I very strongly recommend making the file share read-only and only populate it from the stable build jobs if using loose files, given the risk of unstable tools writing bad objects.
The final piece of the puzzle to putting a CAS into action is resolving garbage collection (GC). CAS by its nature is write-only, but we still need a way to remove stale or unused files, especially on users’ local machines. We’re talking about many gigabytes of assets, and even with the deduplication behavior of CAS, all those permutations of asset artifacts will quickly add up.
Automatic garbage collection is a process of finding the “garbage” (unused) files in the CAS and “collecting” (removing) them. The process of actually removing an item is easy but deciding which items to remove has some nuance. There is also the choice of how often to run the GC, though “daily” is a very safe bet.
The naive approach is to just delete CAS objects that are older than some epoch, like “7 days ago.” A slightly better approach is to age CAS object based on last access time, though tracking access time could potentially add some overhead to fetching CAS objects that might add up. I would suggest trying both creation time and access time and seeing which works better for your specific project and team.
Using time as the sole metric still has some problems. Objects in the CAS that are consumed by a stable playtest build or external builds would be expired unless those builds are actively used by the team. We want a way to pin some objects into the CAS so they aren’t expired.
Manifests once again provide a solution. The GC process can be provided one or more manifests corresponding to pinned builds. The GC would not expire an object from the CAS if it is referenced in any of the pinned manifests.
For users’ local machines, the only manifest likely to really matter is from their most recent build. Given that the GC is only running once a day or so and that objects are only expired if they’re at least a few days old, the current manifest is mostly only used to help pin stable objects that are in active use.
For a remote cache or build machine, the GC process might be fed a list of manifests corresponding to any pinned builds. A pinned build could be all the recent nightly builds, the latest stable QA approved builds, and any builds explicitly pinned (such as a build that is earmarked for an important upcoming demo). Since the team also needs the executable builds and installers – maybe even separate asset packs – in addition to the asset manifests, and a way to browse and find those builds, there is already a need for a separate tool or database that can manage the pinned builds and provide the list of manifests to the GC process.
Edit: An excellent additional note from David Clyde is to be aware of race conditions if GC is running at the same time as any write to a remote cache. The remote write process might return an indication that an asset is already present and doesn’t need to be uploaded right as the GC is marking that asset for collection. The low tech option is to block remote writes during GC or to only run GC in “off hours” (early morning, weekends, etc.). More sophisticated options that necessitate coordination between the GC and writes are certainly possibly for studios with a 24/7 high volume usage of the cache.
Users of git may have recognized a lot of the ideas mentioned in this article. Indeed, git is built on the idea of content addressable storage.
Files in git are identified by a SHA1 hash of their contents. Each commit corresponds to a tree object that maps names to file hashes, and the tree object is itself identified by a hash of its contents, just like a manifest object in our definition. Branches and tags pin specific commits and trees, and a
git gc command removes any unreferenced objects.
Under the hood, git takes the “hybrid” approach to storing objects. New objects are stored as loose files in the
.git/objects folder. This has all the pros and cons we mentioned previously about using a loose filesystem approach to storing CAS objects. Because of the performance degradation when reading lots of small files (which is especially common in a source code repository), git will periodically condense all those loose objects into pack files, stored in
The specific data structures used for git packfiles can be somewhat informative for our purposes, but remember that git is optimized for source code. The Linux kernel git repo, which is considered to be fairly large, is only ~2GB in size. The final asset build size of even a modest AA game likely is at least an order of magnitude larger, and the CAS is going to be even larger still. The packfile mechanism used by git might work very well for all the many thousands of little asset artifact objects, perhaps.
Remote access is a huge part of git and of course there’s several protocols supported by git which may also be informative. In particular the “smart” HTTP protocol may be of interest, particularly the portion for negotiating and uploading packs. I do not at all believe that just slotting in the git algorithms or formats is appropriate for our purposes, but they might help folks grok the concepts a bit better.
Games are really big, especially those in the AA and higher budget ranges. A modest AAA game might have 100s of gigabytes of source assets that all need to be compiled into runtime-ready artifacts. Enabling the caching and reuse of asset artifacts will empower content creators to quickly fetch new version of content, share in-development assets, switch between content branches, or check out older assets.
Using content addressable storage, manifests, and a little asset build metadata will provide a powerful mechanism to reuse and share asset build artifacts across an entire studio.
I’ve personally implemented most of the concepts described in this article and have some battle scars to show for it. While this article is at best just an overview, I hope it serves to share some of the useful knowledge I’ve acquired on the topic.
The advice to start simple and build initial implementations that focus on debuggability and tool reuse are born from experience gained both from following and from ignoring this advice. Build a reliable foundation first and you’ll have plenty of time to optimize and refine later. That applies to far more than game asset pipelines, too.
Reach out for feedback, questions, or discussion on Twitter!