Game Development by Sean

Resource Pipelines (Part 5 - File Scanning)

Table of Contents

In our previous installment of the series, we talked about file dependencies. This time we’re going to talk about their impact and applicability to the resources we store on disk, both in source and run-time formats.

Series Index

Processing Files

It might seem weird I’m only touching this in part 5 of the series, given that processing files is kind of the whole point of the resource pipeline. I wanted to make sure we had some groundwork out of the way first, though.

The resource pipeline needs to find all given source files and run them through the pipeline’s processor. This will be the meat of the pipeline, handling things like texture compression, mesh optimization, and other kinds of format conversion.

The naive approach works remarkably well: find all the source resources in the game’s resource folder and process each one of them. Really, that works!

The optimized version of the naive approach would be to minimize processing if the run-time resource has already been generated and is up-to-date. This is a job for the dependency database in part, but mostly for a new conversion database.

The more optimal approach can use a feedback loop from processing. This will mostly only be relevant in large projects with lots of history and lots of unused resources. This approach uses the dependency database to find the required dependencies for the final game and uses those as a seed input set, rather than just scanning all inputs. Any dependencies found during this run of the processor are then also processed, allowing for new files/dependencies to be updated.

For many projects, though, the naive or optimized naive approach will be more than sufficient. Larger AA or AAA games will need to pull out more stops to minimize and optimize processing.

Conversion Database

I previously mentioned a conversion database. The idea of this database is to track metadata about a processes source resource. The contents for a given entry will be the source file, its source dependencies, and which outputs it generated. This will be not only the names or resource identifiers but also hashes of the files.

This database allows to detect when a source file is changed; if the hash is different from the database and the file on disk, the file has been edited and needs to be reprocessed.

If any of the generated files mentioned in the database are missing or have different hashes, even if the source is unchanged, the file should be reprocessed (or the generated files should be pulled from a cache; we’ll talk about that later) so that the output generated folder matches the expected output for a given set of inputs.

Source Scanning

For the purposes of checking if a source file has been converted properly, we also want to know the list of source dependencies and their hashes. For an example, consider a Material file. The specific baked data it might contain will be dependent on its Shader file; for example the list of shader uniforms supplied by the Material, the list of texture bindings, and so on. It is possible for the shader to be changed and the Material file to be left untouched, yet the generated run-time Material file will need to be regenerated.

Much of the above shouldn’t sound like black magic. This is all pretty similar to what happens when we build source code with build systems like gmake, MSBuild, or ninja. The source build toolchains try to detect which source files have changed and recompile them into object files when needed; they try to avoid recompiling when unnecessary, and they have to scan header files in C/C++ projects because a change to a header can affect the resulting object file.

Many source build tools use only timestamps for checking if an object file is out-of-date. This can be sufficient for many cases, but can cause a large degree of recompilation (or reconversion for an asset pipeline) in other cases. Consider switching branches in source control or checking out a new working copy especially; the files may not change but can still end up with new file modification time stamps and hence a lot of compilation. For C and C++ compilation, tools like ccache can alleviate the problem, though some build systems just solve this by using hashes instead of time stamps (or use time stamps as a first-pass optimization to avoid recomputing hashes needlessly).

That all applies to resource conversion. The conversion database is essentially just the metadata used for storing hashes and detecting out-of-date conversions. It can also function as the dependency database or a component there-of, but that’s a separate topic.

Finding Dependencies

Resource processing is also the time for generating the dependency databases we’ve been talking about. While processing a given file, we have read the source off disk; we’ve inspected the file contents and can find the dependent files. While processing the file, we’re also generating the run-time resources, and we know what the dependencies for those resources will be.

The problem here is that loading and parsing files can be expensive. This is particularly true for some engines and less an issue for others; using our Material example from above, some engines require loading a Material into the engine in order to inspect dependencies. Unity is an example of such an engine (though one would rarely need to actually scan for dependencies since Unity does that already).

However, a developer who has need to scan for dependencies on their own will find such scanning troublesome in Unity, because the material and its textures and shader must be loaded into memory just to look for dependencies of the file. A developer could write a custom .material parser, which isn’t too bad in Unity since they’re just YAML files, but the schema presents some complexities.

Namely, it’s difficult to know for a given Unity YAML file which nodes are dependencies and which aren’t. Unity stores a lot of data in their files and a lot of different resource identifiers that aren’t actually dependencies. Code would need to be carefully written that understands the specific structure of .material files to find the real dependencies and this code would be fragile and potentially break with any Unity upgrade. Plus that code would not be applicable to .unity scene files or other Unity metadata.

Stepping away from Unity for a moment, let’s consider a custom engine with its own custom formats. While we would be in control of our formats and avoid surprise breakages, we also want to avoid having to write a custom parser for every possible resource we have if the only thing we want to do is scan for dependencies; some resources don’t actually need any useful conversion (the source and run-time format may be identical) or some tools may need to scan and update dependencies much faster than full conversion would allow.

The nice thing about having this be readable from the file without needing custom code is that this scanning can be written in multiple languages more easily. Having to rewrite the scanner for every possible resource type in both C++ and C#, for instance, would be a lot of work. Rewriting the core common format’s parsing logic is a much smaller task and can be reused as new resource types are added. This is handy for projects that use multiple languages, as many games do with a C++ core and a C# tool suite.

Simplifying the Schema

A solution is available! Namely, we want to simplify the schemas of our custom file formats. For any structured meta-data in our files, we can use a common format. Consider YAML, JSON, or a binary representation thereof.

Consider for a moment the Swagger 3.0 specification. Swagger is a schema for specifying Web APIs as a collection of end points; each end point is a sub-schema itself. Swagger allows specifying these sub-schemas inline or referencing them in external files. Sounds a lot like some of our game resources! Swagger uses nodes of the form $ref: /path/to/file when referencing an external file.

Now let’s consider JSON Schema It is a schema for specifying schemas for arbitrary JSON documents. Like Swagger, it uses fields of the form "$ref": "/path/to/file" for sub-schema references.

The commonality here is the idea of $ref. It’s a keyword that denotes a field which references another. Having this kind of field makes it trivial to find all references within one of these YAML or JSON files without really needing to know anything at all about the rest of the file’s schema: just scan for any $ref field and register it as a dependency. Done!

Our game resources are a little more complex what with soft and hard dependencies, but the gist is the same: we can look for a particular key and know that it’s a reference. If our files are stored in JSON for example (or at least the structured portions thereof), we can look for fields of the form fieldName:{"$ref":"soft|hard",path:"/path/to/file"}.

Note this only works if only real references take this form. Looking back at the Unity example, we could scan their YAML files for fields of the form guid: a_guid_here, but this fails because there are many GUIDs stored in these YAML files and not all of them are (active) references. There’s no common schema that we can use to reliably know what is or is not a reference. For a custom engine’s custom formats, we want to be more regular.

Note that I’m not really saying that all files must be an identical type of file. We can have structured text files, structured binary files, and mixed structured and unstructured binary files. We can also acknowledge that some resources will never have references and never need any specialized analysis for dependencies (e.g., textures generally don’t depend on anything) so their format is less relevant.

Serialization and Engine References

In a previous article I brought up the concepts of serialization and custom resource reference types. That was all in support of normalized resource formats.

We want to be able to easily generate and parse these formats. We want to make sure that when we write serialization code like serialize(writer, "texture", m_texture) in our material serialization functions will write a normalized reference in a form like, "texture":{"$ref":"hard","/path/to/texture"}, instead of an inscrutable form like, "texture":"/path/to/texture".

Run-time File Dependencies

My examples thus far has mostly looked like JSON fragments for convenience’ sake. While JSON is an acceptable format for source resources, JSON - or any text format - is somewhat imperfect as a run-time resource format.

This is what I was getting at before; there’s no need for all files to be in the same format, just that we be able to inspect files and find things like resource references with minimal special processing.

Storing run-time resources in an optimized binary format is fine. We’ll want to make sure that format still clearly differentiates between any file references and other data. The run-time format might only store a 64-bit file name hash if such is the engine’s resource identifier type; so long as this run-time format can clearly distinguish between resource identifiers and other arbitrary 64-bit integers, everything will be fine.

Ideally this structured format will be similar for any run-time optimized binary files, just to simplify the code that has to be written. This isn’t a hard rule, just a good goal to aim for.

Summary

Hopefully I’ve made the case for standardizing reference types in resource files.

Without care and effort taken to make resources use a common format, either a lot more code must be written to efficiently scan for dependencies, or inefficient processes like full conversion will be required to scan for dependencies.

Using normalized reference formats allows for more easily rewriting scanning code in multiple languages, more easily handles file format upgrades and back-compat, and more simply slots into writing an in-engine serialization system.