Deployments

We run many different services and applications, of varying maturity and technology. However, we strive for commonality across our release and deployment processes.

Philosophy

The following philosophy applies to all controlled environments (production, staging, on-demand, etc.). Ideally, the same process should apply to all controlled environments for a project.

Any release process must satisfy the following criteria:

It be documented by a “Runbook”.
It must use artifacts produced by a fully-tested continuous integration cycle.
It must support monitoring, verification, and rollback.
It must have an appropriate an audit trail covering what happened during release.
Releases must be backwards-compatible
It must facilitate regular release - where “regular” means “many times each day”.
It must be operable by any developer - no gatekeepers, and no specialist knowledge.
Before a release to production can happen, it must be approved by another engineer (to ensure we never work alone).
It must scale to multi-region deployments.

Many of these are typically satisfied by basing the release process around Hercules, our ChatOps tool, through the #releases channel in Slack. Wherever possible, use Hercules for your release process. However, other release processes are viable as long as they satisfy the above criteria.

Runbooks

Each project must contain a runbook detailing:

Prerequisites required for deploying the service
Instructions on how to deploy, monitor, verify, and rollback the service

The easiest way to find deployment runbooks is from our development homepage.

Artifacts

All deployment artifacts must be generated by continuous integration.

No locally-built or otherwise-generated artifact should ever be deployed to any controlled environment.

Method of deployment

The method of deployment is service-dependent, so must be described in detail in the appropriate runbook.

Wherever possible, releases should be performed via “Hercules”.

Approvals

Production deployment should never be conducted alone.

For this reason, every deployment process requires an explicit approval by another engineer immediately preceding release.

Audit-trails

We must maintain an audit-trail over releases, so that we can identify issues.

Production deployments should notify their actions in the Slack #releases channel.

Where possible, deployments should be registered with monitoring tools (e.g. Instana).

Backwards-compatible

Deployments are typically rolled out (or rolled back) across multiple servers, so cannot be considered atomic.

Therefore, all deployments must be able to co-exist with the current live version.

Verification

Release must follow at least the following stages, with appropriate verification:

Prior to release, code must have passed appropriate code, product, and UX standards
Release must receive approval from an engineer
Release the artifact to all servers
Post-release, verify the release is functioning correctly; if not, roll back

Ideally, we should include a “canary” stage. Here, the release process should be:

Prior to release, code must have passed appropriate code, product, and UX standards
Release must receive approval from an engineer
Release an artifact to a canary
Verify the canary release is functioning correctly; if not, roll back
Release the artifact to all servers
Post-release, verify the release is functioning correctly; if not, roll back

The canary stage must deploy the artifact into the production environment. However, the canary process may take different forms:

In a multi-tenant system, the canary may be an internal-only tenancy, used for verification before rolling out to other tenancies.
The canary may serve a proportion of live traffic.

Verification of each step must confirm that the artifact is functioning correctly. The process must be documented in the appropriate runbook. Wherever possible, verification should be automated with appropriate monitoring.

We should take care that our verification processes do not unduly affect customer analytics - verify against internal data wherever possible.

Rollback

The deployment must be able to roll back leaving the environment in its previous state. Specific instructions on how to revert the environment should be in the service runbook.

The crucial element here is that rollback is based on artifacts, and must not require an additional build. We should remember that we may need to issue rollback deployments at any times, including those when our source-control and build services are offline.

A note on continuous deployment

Continuous delivery and deployment is a very appealing option, but it takes a lot of work to achieve it reliably. However, the principles outlined above are a pre-requisite for successful continuous deployment. Projects are welcome to implement continuous deployment only if they already satisfy the requirements of artifact-based deployment.