Couchbase: Failover and Recovery

Couchbase: Failover and Recovery

Note: In this blog post, I assume that you have basic knowledge about Couchbase.

Hello everyone, welcome to another Couchbase blog post. I highly recommend reading my Couchbase: Rebalance blog post before reading this post.

There are scenarios that we have to remove our nodes from our cluster for maintenance issues, or some node could be unresponsive, and we or Couchbase itself have to remove the node from the cluster.

Let's start with how node removal works. This is how Couchbase describes what removal is:

"Node removal allows a node to be taken out of a cluster in a highly controlled fashion, using rebalance to redistribute data and indexes among available nodes."

So basically, we are taking out a node from the cluster. What is that mean? That means at the end of the removal process, the cluster will have fewer resources to maintain, remaining nodes could have to store more vBuckets.

Buckets can have maximum of 3 replicas. For a Data Service, the minimum replica limit is n + 1 (n is the replica number). After removing the node that contains Data Service, if the replica number is still more than n + 1, Couchbase will still support the replica number after rebalance.

Assume that we have a bucket with two replicas(n=2) and remove nodes sequentially. (Numbers are approximate for the sake of simplicity)

Failover-2-v2 (1).gif

What happened here? At first, we have: 40.000 active vBuckets and 80.000 replica vBuckets (1:2). Four nodes, which satisfies our minimum n + 1 replica node limit. After removal of Node 2, this is what happened: We still have 40.000 active vBuckets and 80.000 replica vBuckets in total (1:2). The node number is still more significant than n + 1, so active and replica vBucket numbers are increased in the remaining nodes to satisfy the vBucket replica by rebalancing. The remaining nodes now have to maintain more vBuckets. Now let's remove Node 4 and see what happens now:

Failover-3-v2 (1).gif

Now, things have changed. From now on, we can't satisfy the desired node replica number for Data Service. Active vBuckets and replica vBuckets are now 40.000 (1:1).

This is what happens in node removals in a nutshell. Let's continue with failover.

Failover

At the beginning of this blog post, we said that there could be scenarios that we have to remove a node from the cluster. That is what Failover is. Couchbase documents describe failover as:

"Failover is a process whereby a node can be taken out of a Couchbase cluster with speed."

Failover has two types: graceful and hard.

Let's start with graceful. Graceful is a failover type that specific to nodes that have Data Service. The type name is graceful because Couchbase removes the node in a controlled style, so there will be no downtime. If the service is a Data Service, the process makes replica vBuckets active and then removes the Data Service node. The cluster itself can't activate graceful failover. It has to be manually activated.

Hard failover is necessary when a node becomes unreachable. If the service is a Data service, the process makes replica vBuckets active and removes the Data Service node like the graceful failover. Hard failover can be manually activated, but failover can be automatically triggered by Cluster Manager additionally. This process is called Automatic failover. Automatic failover occurs on three kinds of failure: node failure, disk read/write failure, and group failure.

If a failover has occurred, gracefully or hardly, there will be an imbalance in the ratio of active to replica vBuckets. So Rebalance should be triggered.

Recovery

There are two possibilities after failover. First is eradicating the node from the cluster by Rebalance completely. The second is to Recover the node and add it back to the cluster by Rebalance.

There are type types of Recover: Delta and Full Recovery.

Delta Recovery

Delta Recovery takes all the data, takes it to memory, and resynchronizes. The recovery process does not delete any vBuckets.

Recovery-2-v2.gif

Delta Recovery only works when the node and cluster are healthy, but the node is in a failed-over state. But even these conditions are satisfied, there could be problems while Delta Recovery is working. So instead of Delta Recovery, Full Recovery could take over. There are a lot of reasons why this operation could happen. For example, the node could be hard failed over and marked for removal, or there could be configuration changes while the recovery process was performing.

Full Recovery

Full Recovery is far simpler than Delta Recovery. The process removes all vBuckets and documents from the node and elects a new collection of vBuckets and documents.

Recovery-1-v2.gif

If the node has GSI Indexes, they are left unmodified during the rebalance in both recovery processes.

In the Delta Recovery, data is already in the node, so there is less network traffic than Full Recovery. But Delta Recovery could need significant memory, and it could exceed the bucket memory quota.

Thank you for reading.

May the force be with you!