Code Generation Strategies

Code generation is a time-honored technique where you take some high-level representation of a domain (like an API or storage model or process) and generate one or more artifacts from it - typically code, but often documentation or setup scripts too.

It's not always the right tool, but it can be a huge productivity booster if used properly, and so I'd like to cover some considerations today.

One of the last public projects where I undertook some of this is in the DirectX Shader Compiler (dxc for short), and so I'll use that as a running example where possible.

Overview

Code generation follows a very simple process: Model -> Generator -> Artifact.

In slightly more sophisticated cases, the process can look like Model -> Processed Model -> Generator -> Artifact -> Post-Processing -> Final Artifact.

But First, Learn

The first piece of advise I would have when working with code generators is to not start from them.

Instead, understand properly the problem you're trying to solve, and work out manually what a solution would be. You need only do the manual implementation once (or once per variant you need to handle), but resist the temptation to jump directly to code generation. Doing this twice is even better, as it lets you zero in on what may vary from one piece of data to another.

Your manual implementation lets you have a tight loop for testing, as well as provide a working solution you can use to contrast with your generated solution to look for errors or omissions.

Model

Models are domain-specific, so it's hard to speak generally.

There are many ways to represent a model, but you'll always want these considerations to be held.

For example, in dxc, there are two main sources of model information.

The Python script consists mostly of functions that populate "facts" that are easy to review.

The extra text file was taken as-is from a prior codebase for compatibility, to avoid regressions. This is more common - a plain text file with some domain-specific declarations.

In a different project, for example, C# files were used as a model for an object-oriented API with modern idioms.

Other common model formats include the usual suspets: XML, JSON, YAML, and relational databases (and/or their associated database schema).

Processed Model

Sometime you can use your authored model as-is, and sometimes processing the model further can help some. If you do, it's very useful to have the ability to print out the full processed model in its entirety to help debug and avoid surprises.

In the dxc case, for example, the Python construction includes steps that use prior information to generate new information. For example, there is a step to gather the longest intrinsic function name, derived from what's declared explicitly.

This step can also be used to sanity-check your model. For example, dxc includes a step to check that there are no duplicate names in pass arguments.

Generator

The code generator is the other primary component of the model / generator / artifact trio.

Generators will usually run once at build time and so they don't need to be especially fast. They shouldn't be too slow, otherwise it will annoy developers and introduce friction when they will not want to work with it. But generally clarity and simplicity should trump performance concerns.

In terms of language selection, you'll often want to choose something that makes it easy to work with your model representation.

For dxc, Python was a good choice because it needed to do a bit of file I/O and string processing, some amount of runtime processing of basic rules into derived rules, and being interpreted made iterating very easy.

If you're going to be working with source code as a model, often the availability of a parser will be your primary concern. So for example if you're using C# as your input, any .NET language that works with Roslyn would be a good choice. You do have to think about building this separately at that point though.

The level of maturity and sophistication you need from your generator will also matter. The bigger the project becomes, the more you'll want a mature, scalable toolset.

If you don't want to write your own script/program from scratch, here are some more special-purpose tools that might come in handy.

Post-Processing

The downside of code generators is that while they're great at processing rules quickly, they can get messy if you need to teach more and more corner cases and exceptions.

One approach that has worked well for me in the past is post-processing generated artifacts via patching files with the good-old diff and patch. I've covered the process of patching generated files in the past.

For example, if there's a change I'd like to see in the generated code, I'll create a patch for it and integrate it with the build system. I'll then file an item to go update the generator, but be immediately unblocked. When the generator learns about the new feature (if needed), I can go back and remove the patch.

Note that this sort of patching should be done very selectively; if you have a significant number of exceptions to your rules, you likely need to spend some time developing a more expressing model.

A different case is when you're simply pipelining tools. That's all fine - in fact, the most common use case is to feed the output to a compiler, which is a different kind of generator itself.

Final Artifact

At last, we got here. If all goes well, you have code that is clean, consistent, and generated from a very easy-to-understand description of your problem space.

Congratulations!

Let's talk about a couple of practical matters ...

Version Control

When it comes to version control, make sure that you keep track not only of your source model files, but also your generator (which version you're using if it's not already part of your project), as well as your final artifacts.

Committing the final artifacts can be a bit controversial, but here are some good reasons to do that.

Build Integration

There a few variations on how to integrate code generation into your build pipeline, if that's your scenario.

The least scalable one but still useful in early stages is to simply standardize generation with scripts and generate things by hand.

For a more mature process, you'll want to do something like this.

Locally, developers will typically build this as part of your project, and they will use a script or command to specifically run code genration, and to regenerate patches that get committed as needed.

dxc took a somewhat different approach that is also worth calling out. Instead of having the final artifacts "on the side", they are often snippets that get replace in-code. For example, the hctdb_instrhelp.py file scans files for replacement comments, and use a helper CodeTags script to drive replacement in places like the DXIL opcode table.

Happy code generation!

Tags:  codingdesign

Home