The Ten-Factor CI Job

Introduction

The systems that build, test, validate, and deploy software are critically important to modern businesses, particularly with the rapid adoption of DevOps and Continuous Integration (CI) practices. Unfortunately these CI systems themselves are rarely managed according to CI methodologies. Too often, they are operated with a break-fix mentality that is antithetical to performance oriented systems thinking.

The output of the CI system is live build artifacts to be tested and ultimately released into a production environment. Rapid and frequent releases of stable software can only be produced by a stable assembly line system. Given that a CI outage can bring an otherwise productive group of engineers and developers to a screeching halt and evaporate the reproducibility of previous runs, CI systems require the same level of care and attention as the software they are designed to deliver.

To operate a reliable CI system, each part of the system must be on a firm foundation, following good principles at each step. And these principles must be applied to each component; there’s no such thing as "just a simple script". The most trivial script can become incompatible over time, and break the entire workflow.

The CI system is a software assembly pipeline. It’s composed of a sequence of jobs which take some input, process it to create derived artifacts, and pass these artifacts downstream. This mirrors the Unix tools design philosophy: programs which perform a single function and are easy to combine. Each step in a CI pipeline is like a Unix tool; it accepts a defined input, processes it efficiently, and produces an output.

In the world of software systems, the analog of a Unix tool is a 12-factor app. The 12-factor methodology was derived from the experiences of the developers of Heroku. After their experience hosting millions of apps, Heroku became very prescriptive about how applications should be deconstructed for maximum reliability and performance. In a nutshell, the 12 factors require each application to be broken down into distinct services and jobs, each of which:

  • Performs a specific function.
  • Is versioned, managed, and released independently.
  • Has fully declarative dependencies.
  • Is deployable into multiple environments without modification.
  • Can be easily linked up to other apps.

From a design standpoint, these characteristics are very similar to a CI pipeline job. This guide applies the relevant points of the 12 factor methodology to jobs in a CI pipeline.

Background

The contributors to this document have been involved in the creation and maintenance of many continuous integration pipelines and their associated jobs, including our work on the build and release system at Conjur.

This document synthesizes all of our experience and observations on a wide variety of approaches to enabling continuous integration at organizations with wildly different needs and compositions. Our focus is on the practices that improve the maintainability and usability of CI systems by encouraging participation from all stakeholders.

Our motivation is to raise awareness of some systematic problems we've seen in the implementation of continuous integration pipelines, to provide a shared vocabulary for discussing those problems, and to offer a set of broad conceptual solutions to those problems with accompanying terminology. The format is inspired by The Twelve Factor App.

Who should read this document?

Anyone working in software that writes tests or maintains continuous integration pipelines.

Differences between 12 factor apps and CI jobs

It’s worth discussing how CI jobs are different than 12 factor applications.

Application CI Job
Long-running process Short-running process
Processes many transactions Processes a single transaction
Returns response Creates artifact

A 12 factor app typically uses HTTP, or similar, as the protocol. An incoming message contains request parameters. The parameters are processed, and a response is returned.

A CI job is almost always the combination of two things:

  1. A source code repository
  2. A small number (often 0) of input parameters, typically as environment variables

Source code is typically provided to the CI job by the CI framework; just like HTTP messages are parsed and provided to a web application by a web server.

The Ten Factors

I. Codebase

One codebase tracked in revision control, many deploys

II. Dependencies

Explicitly declare and isolate dependencies

III. Config

Store config in the environment

IV. Backing Services

Treat backing services as attached resources

V. Build, release, run

Strictly separate build and run stages

VI. Processes

Execute the app as one or more stateless processes

VII. Disposability

Maximize robustness with fast startup and graceful shutdown

VIII. Dev/prod parity

Keep development, staging, and production as similar as possible

IX. Logs

Treat logs as event streams

X. Audit

Build steps and artifacts written to an audit system independent of any single CI component

I. Codebase

One codebase tracked in revision control, many deploys

This principle seems obvious to any seasoned developer. And yet, .... Do you have shell scripts pasted into a Jenkins server? What about ad-hoc jobs? Those little edit boxes which encourage you to paste in code are nefarious little traps that undermine the robustness of your CI system.

Good

Bad

Even if the code in those edit boxes is technically under revision control, how is it versioned? Can you re-create any build that you ran in the past? Or is it likely that some little edit box has been changed in an un-trackable way? Do you have a catch-all “CI” scripts repository, which is used by many jobs? This is a no-no, because combining code for different jobs into a single repo makes it impossible to version them independently.

Test code should be versioned alongside the code it is testing.

Job configuration should also be tracked in source control. Committing to a repository on job config change is good; defining the jobs as code is better 1. Setting up job configurations by hand in user interfaces is error-prone and a history of changes is important for rebuilding past artifacts.


  • Only paste a single command into code boxes.
  • Document the operation of the job.
  • Store the job configuration in a source code repository.
1. For example if you're using Jenkins, use the Job DSL Plugin or a project like DotCI to define your jobs.

II. Dependencies

Explicitly declare and isolate dependencies

Ah yes. Dependencies. CI jobs have all kinds of dependencies:

  • Languages like Python and Ruby
  • Tools like Docker, Vagrant, Selenium and AWS
  • Backing services like Postgresql

Without the right dependencies, the CI tools break. Worse yet, dependencies generally drift over time. When you install a dependency a year from now (after a CI server crashes, or you’re rebuilding or updating it), you’ll likely get something quite different from today. So, it’s not enough to carefully manage the CI code, you have to manage the dependencies just as carefully. Otherwise, the result is a system that works today, but:

  1. Won’t work tomorrow
  2. Makes non-reproducible artifacts

The tools for specifying dependencies are getting a better, but it’s still very easy to leave a dependency underspecified and then experience a system failure when it’s unexpectedly upgraded in a backwards-incompatible way.

The only way to completely lock down dependencies on a full system (e.g. Linux machine), is to capture an image and tie the build job to that specific image id. This used to be quite difficult; thanks to Docker, it’s now gotten a lot easier. Containers are great for build jobs because:

  1. The image has all of its dependencies built in; OS, libraries, application code, everything.
  2. The image is locked with a specific version (SHA). It’s guaranteed to work just the same in the future as it does today.

Just keep in mind, container technology is not a panacea. For example, you can’t containerize your Windows and OS X build jobs (yet). When you can't use a container, fall back on a traditional virtualized image. If you can't do that either, manage the build machine as strictly as you can using configuration management tools.

Sometimes dependencies cannot be locked to a specific version. For example, when you are a writing a library. If you pin your dependencies to specific versions, you are forcing the library's user to also use that dependency. This is often not desirable or possible. In this case, pin the major version of the project and run your library's tests on a schedule to ensure that updated dependencies do not break functionality.


  • Explicitly declare all dependencies.
  • Dependencies must be locked to specific versions (e.g. Gemfile.lock).
  • Run builds on strictly controlled containers or virtual machines.
  • Build servers (bare metal or base images) should have the minimum possible set of installed packages and configuration.

III. Config

Store config in the environment

Externalize the configuration of the job to the environment. This makes the job maximally flexible, it ensures that the job is not tied to a specific file arrangement, and it makes it easy to move secrets out of the source code.

This includes sensitive configuration. Secrets should not be checked into source control or entered into your CI system without encryption. Ephemeral or one time credentials are best. Externalizing your secrets ensures that the code is not tied to a specific environment or user. It also prevents accidental leakage of secrets. Finally, this pattern ensures that jobs can be run only by specifically privileged individuals and systems.


  • Don’t rely on hard-coded paths to configuration files unless the whole file can be safely committed to the source control repo.
  • Externalize secrets out of the source code and out of images.

IV. Backing Services

Treat backing services as attached resources

Build jobs may depend on backing services such as a database, message queue, or caching service. Jobs should be sufficiently abstracted from these backing services so that the physical location of the service (local or remote) is abstracted from the job. This enables backing services to be swapped out and reconfigured without requiring modification, re-test and re-deployment of the job.

Things get really interesting when your CI system is building and testing a service-oriented application. Then, for acceptance testing, services will have dependencies on other upstream services. If your services have hard-coded assumptions about the location of upstream services, it makes it very hard to verify code along different development streams and branches.

Mocking out calls to backend services is an option to be considered carefully. Mocking is fine for unit tests, but then smoke tests should also be run to ensure the real backend behaves as expected. The above guidelines apply for smoke tests.


  • A job should work equally well with local or remote backing services.
  • The location of backing services should be provided to the job as configuration.
  • Any secrets needed to connect to a backing service should be provided to the job as configuration.

V. Build, Release, Run

Strictly separate build and run stages

Testing is an integral part of building, releasing and running software.

The jobs that make up the CI system are software. Therefore, they should be coded, tested, packaged, and released. In addition, job code must be meaningfully versioned, because the requirements of the product being built change over time.

It’s important that as the job code is changed in response to the product code, that the association between the older product code and older job code is not lost. Otherwise, it becomes very difficult to re-build historical packages, and to effectively test and deliver hot fixes.


  • Create a release artifact for the job code.
  • Job code should have test cases, and a build and release process of its own.
  • Maintain version relationships between the product code and the job code.

VI. Processes

Execute the app as one or more stateless processes

Stateful CI jobs are a big problem. It’s very common for a job to rely on some specific filesystem configuration or for a sequence of jobs to use a common filesystem to pass information down the pipeline.

Not only are stateful jobs fragile, they are also not scalable. When build jobs are written with the assumption that the are operating on a shared filesystem, it’s impossible to scale the CI system out across a pool of worker machines.

In practice, this means that the CI server becomes more and more stressed until a breakdown ensues. The build team tries to scale out the CI server to multiple machines, only to find that the jobs won’t run because they are stateful.


  • Job code should not rely on the file system to persist data across runs.
  • Job code should be portable across build nodes.

VII. Disposability

Maximize robustness with fast startup and graceful shutdown

Jobs can sometimes take a long time to run. Track how long jobs take to run so you can take action when new changes cause an unexpected spike in build time.

It’s common for a build manager to realize that a build job is going to fail, and terminate it prematurely. When this happens, any resources created for the job should be released. Cloud instances should always be terminated and Docker containers should be run with the --rm flag.


  • Job code should handle shutdown gracefully and not leave orphaned resources.

VIII. Dev/Prod Parity

Keep development, staging, and production as similar as possible

This is very important for CI, and very often overlooked. This manifests most often in CI as job code which will only run on the CI system. This makes it very difficult for developers to run, test and modify the job code.

Automation is often mistakenly identified as the objective of CI / CD. Actually, the objectives are reliability and predictability, followed by speed. Automation is a technique which can help improve predictability and speed, but it’s far too easy to build a system that’s automated, but unreliable. When automation breaks down (and it will, frequently), it’s important that:

  1. Troubleshooting is efficient
  2. Manual workarounds can be applied to keep important code moving through the pipeline, without compromising the quality of the output.

For these reasons, there shouldn’t be anything special about the CI environment that can’t be easily reproduced on a developer’s laptop (or on a cloud server).

One of the trickiest aspects of this is build credentials, aka secrets. For example, suppose a build job is going to push an artifact to an S3 bucket for official releases. If this secret is tied to the build machine, then there’s no way to produce an official release from any other environment if the build environment is broken or the job starts to fail.


  • Base environments and dependencies used by the CI system should be easy for developers to provision and use.
  • Trusted developers should have access to secrets used by the CI system, in a controlled and audited way.
  • Manually built artifacts should be valid input for downstream automated jobs.
  • The CI system should be built from images that can be pulled and run by developers without the need to setup a complex build environment.

IX. Logs

Treat logs as event streams

Robust log capture is pretty well handled by standard CI systems.

This is largely due to the fact that the CI pipeline jobs are often shell-based, and shells have robust support for log streams, stdout and stderr.

When a build job is a more complex piece of software, such as a Java program, it should log verbosely in order to assist with troubleshooting.

Finally, if you need to keep build logs for historical reasons, store them externally to your CI system.


  • Build jobs should log verbosely.
  • Send command results to stdout and logging to stderr.

X. Audit

Build steps and artifacts written to an audit system independent of any single CI component

The audit is the definitive record of the CI process.

Trusted developers can perform CI / CD operations manually, as long as they generate the necessary audit records.

Audit is used for detailed provenance of artifacts.


  • Write CI job metadata to an audit record, not just a log file.
  • Support manual / override workflows in addition to automation.