Code Generation Strategies

Code generation is a time-honored technique where you take some high-level representation of a domain (like an API or storage model or process) and generate one or more artifacts from it - typically code, but often documentation or setup scripts too.

It's not always the right tool, but it can be a huge productivity booster if used properly, and so I'd like to cover some considerations today.

One of the last public projects where I undertook some of this is in the DirectX Shader Compiler (dxc for short), and so I'll use that as a running example where possible.

Overview

Code generation follows a very simple process: Model -> Generator -> Artifact.

The Model is your high-level, domain-specific representation.
The Generator is the program that transforms the model into something different, fit for a different kind of consumption.
The Artifact is the output of the generator.

In slightly more sophisticated cases, the process can look like Model -> Processed Model -> Generator -> Artifact -> Post-Processing -> Final Artifact.

A Processed Model that is different from the model itself is useful to allow additional checks or to flesh out the model algorithmically.
A Post-Processing stage is useful when you can't change the generator, when it's too costly or when you're simply experimenting.
A Final Artifact that is different from the originally generated artifact is useful partly to keep it distinct via post-processing, but also to integrate with other systems, typically build systems.

But First, Learn

The first piece of advise I would have when working with code generators is to not start from them.

Instead, understand properly the problem you're trying to solve, and work out manually what a solution would be. You need only do the manual implementation once (or once per variant you need to handle), but resist the temptation to jump directly to code generation. Doing this twice is even better, as it lets you zero in on what may vary from one piece of data to another.

Your manual implementation lets you have a tight loop for testing, as well as provide a working solution you can use to contrast with your generated solution to look for errors or omissions.

Model

Models are domain-specific, so it's hard to speak generally.

There are many ways to represent a model, but you'll always want these considerations to be held.

It has to be amenable to manual review. The whole idea here is that you can have a human understand, author and review the high-level representation easily, and the tooling takes care of the grunt work.
It has to be amenable to consumption by your code generator. No need to make it unnecessarily difficult for yourself.

For example, in dxc, there are two main sources of model information.

A Python script that populates an in-memory database of information about HLSL and DirectX rules and instructions (db_dxil).
An extra text file with intrinsics.

The Python script consists mostly of functions that populate "facts" that are easy to review.

The extra text file was taken as-is from a prior codebase for compatibility, to avoid regressions. This is more common - a plain text file with some domain-specific declarations.

In a different project, for example, C# files were used as a model for an object-oriented API with modern idioms.

Other common model formats include the usual suspets: XML, JSON, YAML, and relational databases (and/or their associated database schema).

Processed Model

Sometime you can use your authored model as-is, and sometimes processing the model further can help some. If you do, it's very useful to have the ability to print out the full processed model in its entirety to help debug and avoid surprises.

In the dxc case, for example, the Python construction includes steps that use prior information to generate new information. For example, there is a step to gather the longest intrinsic function name, derived from what's declared explicitly.

This step can also be used to sanity-check your model. For example, dxc includes a step to check that there are no duplicate names in pass arguments.

Generator

The code generator is the other primary component of the model / generator / artifact trio.

Generators will usually run once at build time and so they don't need to be especially fast. They shouldn't be too slow, otherwise it will annoy developers and introduce friction when they will not want to work with it. But generally clarity and simplicity should trump performance concerns.

In terms of language selection, you'll often want to choose something that makes it easy to work with your model representation.

For dxc, Python was a good choice because it needed to do a bit of file I/O and string processing, some amount of runtime processing of basic rules into derived rules, and being interpreted made iterating very easy.

If you're going to be working with source code as a model, often the availability of a parser will be your primary concern. So for example if you're using C# as your input, any .NET language that works with Roslyn would be a good choice. You do have to think about building this separately at that point though.

The level of maturity and sophistication you need from your generator will also matter. The bigger the project becomes, the more you'll want a mature, scalable toolset.

If you don't want to write your own script/program from scratch, here are some more special-purpose tools that might come in handy.

awk: data extraction and reporting tool. But you can probably use Perl/Python/NodeJS as your needs grow.
bison: generates parsers in C/C++/Java.
lex: generates lexical analyzers.
xsltproc: command-line XSLT processor. It's usually not too hard to wrap your favorite XSLT library if you don't care for this program specifically - plus, you might be able to write your own XSLT extensions easily.
yacc: generates parsers in C.

Post-Processing

The downside of code generators is that while they're great at processing rules quickly, they can get messy if you need to teach more and more corner cases and exceptions.

One approach that has worked well for me in the past is post-processing generated artifacts via patching files with the good-old diff and patch. I've covered the process of patching generated files in the past.

For example, if there's a change I'd like to see in the generated code, I'll create a patch for it and integrate it with the build system. I'll then file an item to go update the generator, but be immediately unblocked. When the generator learns about the new feature (if needed), I can go back and remove the patch.

Note that this sort of patching should be done very selectively; if you have a significant number of exceptions to your rules, you likely need to spend some time developing a more expressing model.

A different case is when you're simply pipelining tools. That's all fine - in fact, the most common use case is to feed the output to a compiler, which is a different kind of generator itself.

Final Artifact

At last, we got here. If all goes well, you have code that is clean, consistent, and generated from a very easy-to-understand description of your problem space.

Congratulations!

Let's talk about a couple of practical matters ...

Version Control

When it comes to version control, make sure that you keep track not only of your source model files, but also your generator (which version you're using if it's not already part of your project), as well as your final artifacts.

Committing the final artifacts can be a bit controversial, but here are some good reasons to do that.

Makes it very easy to see diffs in generated outputs from one commit to another and use your favorite tools for blaming, branching, merging, etc.
Allows developers to be able to skip the generation process if needed (which might require additional tooling or setup).
Allows the sources to be indexed by source control mechanisms.
Allows the sources to be referenced in debug symbols so you don't have to build locally to step through generated code.

Build Integration

There a few variations on how to integrate code generation into your build pipeline, if that's your scenario.

The least scalable one but still useful in early stages is to simply standardize generation with scripts and generate things by hand.

For a more mature process, you'll want to do something like this.

Declare dependencies to your build system.
Run the generator. Do this on-the-side if you need post-processing.
Post-process the generated files into a new set of files. This lets you more easily regenerate patches.
If you commit your final artifacts and you expected developers to submit updates when they update generator or model (most common case), then you'll want to make sure that things are still consistent, by making sure that the files that are generated match the files that are committed, and failing otherwise.
If running locally, copy them to the location where they're needed.

Locally, developers will typically build this as part of your project, and they will use a script or command to specifically run code genration, and to regenerate patches that get committed as needed.

dxc took a somewhat different approach that is also worth calling out. Instead of having the final artifacts "on the side", they are often snippets that get replace in-code. For example, the hctdb_instrhelp.py file scans files for replacement comments, and use a helper CodeTags script to drive replacement in places like the DXIL opcode table.

Happy code generation!

Tags: coding design

Home