Shifting traffic during a regional failure
  • 21 Mar 2024
  • 5 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Shifting traffic during a regional failure

  • Dark
    Light
  • PDF

Article summary

Traffic failover, or just failover, refers to the automatic process of directing all or some portion of network traffic from one region’s service cluster to another region’s service cluster when the first region suffers from network degradation. These regions are typically geographically separated but failover can also occur in other scenarios. Because traffic flows to a new location, users will still be able to access the service. Users will likely experience some latency, but ultimately, continual uptime is worth the trade off.

In this guide, we will cover how to set up failover between an edge and an example service. We will first install the Greymatter bridge proxy that connects the two regions before moving on to the failover and health checking configurations on the edge-to-service upstream. By the end of the guide, traffic directed towards your service will get shifted to the failover cluster when Greymatter detects the service is unhealthy.

Prerequisites

  • Two Greymatter 1.8.2+ installations

  • A tenant project

  • A deployed service in one Greymatter mesh with an identical service running in another mesh

1. Create the bridge proxy

We utilize a proxy to bridge the network gap between meshes. By doing so, we can easily set robust security and networking policies between both meshes. Failover traffic will flow through this bridge into the next region.

1a. Generate the bridge proxy configuration

Enter your tenant project directory and run from the root:

greymatter init bridge -n <project namespace>

This command will initialize your bridge proxy Kubernetes manifest and GSL file.

1b. Edit the metrics options

Open the newly created GSL configuration file. In the primary listener block, add the following struct:

metrics_options: {
  impersonate_proxy: [{
    name: "<name of the service with failover>"
    namespace: "<namespace of the service>"
    }]
}

This option allows the bridge to impersonate the service for health checking purposes.

1c. Connect the proxy to the failover mesh

Still inside the primary listener block, add a route match for incoming requests:

routes: {
  "<route path to match>": {
    prefix_rewrite: "<url path expected by the second mesh>"
  }
}

The prefix_rewrite field allows for you to change the URL sent to the second mesh. This URL should end in a trailing slash to avoid redirects. If the second mesh ingress can route the request without any changes, then drop the field.

Inside the route block, add an upstream pointing at the ingress of the second mesh:

upstreams: {
  "<upstream name>": {
    gsl.#Upstream
    instances: [{
      host: "<address of the second mesh ingress>"
      port: <port of the second mesh ingress>
    }]
  }
}

Before moving on, make sure to set all the necessary security or networking policy on the bridge.

Dry run sync to catch any errors:

greymatter sync --dry-run

Commit and push you updates and then deploy the bridge proxy by running:

kubectl apply -f k8s/bridge.yaml

2. Add failover options to the service

Now that the bridge is in place, we can point the service at it for failover. Locate the edge block inside the GSL file for the service that needs to fail over to another region. Inside of the upstream definition, copy this block, changing the name and namespace if you departed from the defaults:

failover_instances: [
	{
		name:      "bridge"
		namespace: context.globals.namespace
	},
]

Since this is a list, you can add additional bridge proxies as failover points. Greymatter will prioritize the traffic sent to each member of the list in decreasing order. To override this ordering, each instance element exposes a priority field where a higher number translates to a lower priority.

3. Add active health checks to the service

Failover relies on health checking to determine when to begin rolling traffic. This type of instance health checking is similar but different from the service health checks that power Greymatter Sense. There are two types of instance health checking: active and passive. We’ll use active for this guide, but real deployments will likely demand a mix of active and passive.

Within the same block that contains the failover_instances field, add a new field:

health_checks: [{
	interval_msec:       1000
	timeout_msec:        1000
	unhealthy_threshold: 1
	healthy_threshold:   1
	health_checker: http_health_check: {
		path: "<url path called to check service health>"
	}
}]

The path field should point to a valid URL path handled by your service. Greymatter will use the response code returned by calling the service at that path to determine service health. This URL should end in a trailing slash, unless you have disabled trailing slash redirects, because 3xx status codes are not considered successful.

4. Verify failover

Failover traffic shifting will occur across a continuum starting when 71% of service instances are considered healthy (when about a third of the instances are failing). To ensure this process works we can simulate a network disruption by scaling the service deployment to zero.

Open the logs for the failover region’s service and for the in-cluster service. Make a request to the in-cluster service to note the traffic. Now, scale down all the in-cluster instances. Start to make requests to the in-cluster instances. You may receive some errors for the first few requests, but after a while the unhealthy instances will get pruned and replaced by instances in the failover region. You can verify traffic is shifted by looking for new requests in the failover region’s service log.

Conclusion

Congratulations, you have successfully configured cross-mesh failover. To start adding failover to other services, perform the same steps, but add new routes to the bridge proxy instead of creating a new one. Although we limited failover from the edge-to-service connection, you can also set failover on the upstream definition for a service-to-service connection.

Additional Reference

Failover configuration is set on the Upstream definition. It occurs between instances of the upstream, not between upstreams. The upstream also requires some form of health checking, either active or passive, so that instances can get marked as unhealthy. Since the control plane cannot perform service discovery across meshes, tenants need to deploy a bridge proxy.

Listener

metrics_options

impersonate_proxy - a list of proxies to impersonate for metrics check-ins. Used in failover bridge proxies.

impersonate_proxy

name - the GSL name of the proxy to impersonate.

namespace - the GSL namespace of the proxy to impersonate.

Upstream

failover_instances: [...#FailoverSchema]

A list of structs detailing instances to failover to. Each element receives a priority equal to one plus their index. Optional.

FailoverSchema

A failover instance.

name - Name of the failover instance. Must match the instance’s GSL name.

namespace - Namespace of the failover instance. Must match the instance’s GSL namespace.

priority - The priority class of the instance. A higher number corresponds to a lower priority. Optional. Defaults to the one plus the element’s position in the list.


Was this article helpful?

What's Next