Rolling Updates for Zero Downtime and Easy Deployment

The Typical Deployment Problem

Let's say we made a multiplayer browser game running on 20 servers, with a capacity of 100 players each. As players connect, we send them to to whichever server has room. With our 20 servers we can handle up to 2000 ccu, not bad!

Now let's say we add a feature or fix a bug in our game. What happens next?

Many large companies have a cycle just like this, though they do it a little bit more professionally with "planned downtime." Automated deployment (which has become an industry unto itself) can help us with that last bullet point by speeding up the patch time. Automated deployment is valuable but it only solves one part of the problem. In the end we're using a hammer when what we needed was a new architect.

Decoupling the Website from the Game Client

The real problem in the typical deployment above is that the website has one version of the game client which only works when it connects to a specific version of the game server. The solution is to break the 1:1 association between the website and the game client. You might be saying to yourself, "but the game client *is* the website" Consider this design instead:

So if we have server v2.0.0, and someone tries to play on it, we load client v2.0.0.

The game client and the website are no longer the same thing. The website becomes a loader of game clients (as well as fufilling all its other jobs).

If we patch 5 out of our 20 servers to v3.5.0, any player that tries to play on those servers will load client v3.5.0. Meanwhile, the 2.0.0 servers are still online with players. A player can disconnect from the v3.5.0 server and come play on a v2.0.0 server -- nothing is broken.

Uploading a new game client no longer breaks existing servers, and patching a single server no longer commits us to a deployment binge that only ends when everything has been patched. It is now also possible to test a patch in production by only applying it to a single server (yes!!!).

Example Implementation

One size does not fit all. Below are some snippets showing key points. These are hypothetical, though I did mostly copy and paste them from a live product and removed some of the extra data for clarity.

A game instance registering itself with the master server:

this.connector = new TLSClient({
    cert: fs.readFileSync(config.MASTER_SERVER_CERT),
    host: config.MASTER_HOST,
    reconnectInterval: 2000,
    port: config.MASTER_PORT,
    password: config.MASTER_SERVER_API_PASSWORD
})

this.connector.on('authenticated', () => {
    this.connector.send({
        version: config.GAME_VERSION, // e.g. "2.0.0"
        mode: "idle",
        url: 1,
        currentPlayers: 0,
        maxPlayers: 50
    })
})

An article is in the works explaining how to create and use a master server, though I have made the underlying tech open source: https://github.com/timetocode/tls-json.

The results of querying the master server http json api:

[
    {
        "ip": "::ffff:123.123.123.123",
        "subdomain": "us-west-1",
        "version": "2.0.0",
        "mode": "idle",
        "url": 1,
        "currentPlayers": 8,
        "maxPlayers": 50
    },
    {
        "ip": "::ffff:123.123.123.123",
        "subdomain": "us-west-1",
        "version": "2.0.0",
        "mode": "idle",
        "url": 2,
        "currentPlayers": 21,
        "maxPlayers": 50
    },
    {
        "ip": "::ffff:123.123.123.124",
        "subdomain": "us-west-2",
        "version": "2.0.0",
        "mode": "idle",
        "url": 1,
        "currentPlayers": 44,
        "maxPlayers": 50
    }
]

The frontend website code using the server data above to dynamically load a game client:

function loadScript(src, callback) {
    let ready = false
    let scriptEle = document.createElement('script')
    scriptEle.type = 'text/javascript'
    scriptEle.src = src
    scriptEle.onload = scriptEle.onreadystatechange = function () {
        if (!ready && (!this.readyState || this.readyState === 'complete')) {
            ready = true
            callback(scriptEle)
        }
    }
    let tagEle = document.getElementsByTagName('script')[0]
    tagEle.parentNode.insertBefore(scriptEle, tagEle)
}

function startGameClientAndConnect(server) {
    let canvas = document.createElement('canvas')
    canvas.id = 'main-canvas'
    document.body.appendChild(canvas)

    loadScript('/js/game-v' + server.version + '.js', () => {
        // this script produces a 'window.gameClient'
        window.gameClient.connect('wss://' + server.subdomain + '.example.com/' + server.url)
    })
}

The above script if passed the first server from the master api would load https://example.com/js/game-v2.0.0.js and then attempt to connect to wss://us-west-1.example.com/1.

All of the pieces can be varied as needed by a game, but the above key concepts will achieve the decoupled architecture. In summary the specific features are 1) a master server that holds a list of running instances, 2) instances connect to the master server and share their game version, 3) the website of the game can load varying game client versions (and do, when the palyer joins a specific server).