Rolling Updates for Zero Downtime and Easy Deployment

The Typical Deployment Problem

Let's say we made a multiplayer browser game running on 20 servers, with a capacity of 100 players each. As players connect, we send them to to whichever server has room. With our 20 servers we can handle up to 2000 ccu, not bad!

Now let's say we add a feature or fix a bug in our game. What happens next?

If we upload the new game client, players may find that the game no longer works because the new game game client doesn't work quite right with the existing game servers.
If we upload the new game server, players may find that the game no longer works because their outdated game client doesn't work quite right with the new game servers.
If we don't change the name of the new game client or employ cache busting, players will continue to attempt to connect to our new game servers with the outdated client rendering them unable to play.
At some point, all of the players have refreshed their page and received the new game client, and all of the servers have been patched, and everything goes back to normal.
If we discover a bug in our new game client or our new game server, we're forced to fix it on the spot and repatch everything or begin reverting everything either of which can take a couple hours.
By the time we've done 20 servers it has been 2-3 hours of focused work with a partially or wholly degraded production environment.

Many large companies have a cycle just like this, though they do it a little bit more professionally with "planned downtime." Automated deployment (which has become an industry unto itself) can help us with that last bullet point by speeding up the patch time. Automated deployment is valuable but it only solves one part of the problem. In the end we're using a hammer when what we needed was a new architect.

Decoupling the Website from the Game Client

The real problem in the typical deployment above is that the website has one version of the game client which only works when it connects to a specific version of the game server. The solution is to break the 1:1 association between the website and the game client. You might be saying to yourself, "but the game client *is* the website" Consider this design instead:

we maintain a list of all game servers on a master server
when a new game server turns on, it contacts the master server
the master server records the ip and the *version* of the new game server
our website obtains the server list from our master server
when a server is selected, we load the matching game client

So if we have server v2.0.0, and someone tries to play on it, we load client v2.0.0.

The game client and the website are no longer the same thing. The website becomes a loader of game clients (as well as fufilling all its other jobs).

If we patch 5 out of our 20 servers to v3.5.0, any player that tries to play on those servers will load client v3.5.0. Meanwhile, the 2.0.0 servers are still online with players. A player can disconnect from the v3.5.0 server and come play on a v2.0.0 server -- nothing is broken.

Uploading a new game client no longer breaks existing servers, and patching a single server no longer commits us to a deployment binge that only ends when everything has been patched. It is now also possible to test a patch in production by only applying it to a single server (yes!!!).

Example Implementation

One size does not fit all. Below are some snippets showing key points. These are hypothetical, though I did mostly copy and paste them from a live product and removed some of the extra data for clarity.

A game instance registering itself with the master server:

this.connector = new TLSClient({
    cert: fs.readFileSync(config.MASTER_SERVER_CERT),
    host: config.MASTER_HOST,
    reconnectInterval: 2000,
    port: config.MASTER_PORT,
    password: config.MASTER_SERVER_API_PASSWORD
})

this.connector.on('authenticated', () => {
    this.connector.send({
        version: config.GAME_VERSION, // e.g. "2.0.0"
        mode: "idle",
        url: 1,
        currentPlayers: 0,
        maxPlayers: 50
    })
})

An article is in the works explaining how to create and use a master server, though I have made the underlying tech open source: https://github.com/timetocode/tls-json.

The results of querying the master server http json api:

[
    {
        "ip": "::ffff:123.123.123.123",
        "subdomain": "us-west-1",
        "version": "2.0.0",
        "mode": "idle",
        "url": 1,
        "currentPlayers": 8,
        "maxPlayers": 50
    },
    {
        "ip": "::ffff:123.123.123.123",
        "subdomain": "us-west-1",
        "version": "2.0.0",
        "mode": "idle",
        "url": 2,
        "currentPlayers": 21,
        "maxPlayers": 50
    },
    {
        "ip": "::ffff:123.123.123.124",
        "subdomain": "us-west-2",
        "version": "2.0.0",
        "mode": "idle",
        "url": 1,
        "currentPlayers": 44,
        "maxPlayers": 50
    }
]

The frontend website code using the server data above to dynamically load a game client:

function loadScript(src, callback) {
    let ready = false
    let scriptEle = document.createElement('script')
    scriptEle.type = 'text/javascript'
    scriptEle.src = src
    scriptEle.onload = scriptEle.onreadystatechange = function () {
        if (!ready && (!this.readyState || this.readyState === 'complete')) {
            ready = true
            callback(scriptEle)
        }
    }
    let tagEle = document.getElementsByTagName('script')[0]
    tagEle.parentNode.insertBefore(scriptEle, tagEle)
}

function startGameClientAndConnect(server) {
    let canvas = document.createElement('canvas')
    canvas.id = 'main-canvas'
    document.body.appendChild(canvas)

    loadScript('/js/game-v' + server.version + '.js', () => {
        // this script produces a 'window.gameClient'
        window.gameClient.connect('wss://' + server.subdomain + '.example.com/' + server.url)
    })
}

The above script if passed the first server from the master api would load https://example.com/js/game-v2.0.0.js and then attempt to connect to wss://us-west-1.example.com/1.

All of the pieces can be varied as needed by a game, but the above key concepts will achieve the decoupled architecture. In summary the specific features are 1) a master server that holds a list of running instances, 2) instances connect to the master server and share their game version, 3) the website of the game can load varying game client versions (and do, when the palyer joins a specific server).