GridGain Developers Hub

Disaster Recovery for System Groups

A GridGain 9 cluster includes two system RAFT groups, both of which are essential for the cluster’s normal operation:

You perform disaster recovery operations on system RAFT groups to recover permanent majority loss. When a system RAFT group loses majority, it becomes unavailable. When CMG is unavailable, the cluster itself remains available with limitations: it can still process most of the operations, but it cannot join new nodes, start/restart existing nodes, and start building new indexes. When MG is unavailable, the cluster becomes unusable; it cannot handle even GET/PUT/SQL requests.

You see that the majority has been lost in cluster logs in the console or in the rotated log files. When a RAFT group becomes unavailable, the logs would show something like Send with retry timed out [retryCount = 11, groupId = cmg_group]. or Send with retry timed out [retryCount = 11, groupId = metastorage_group]..

An indicator that CMG is down is when a node does not start after a restart command. This is reflected in the log as Local CMG state recovered, starting the CMG, not followed by Successfully joined the cluster.

If a node tries to start when CMG is available, but MG is not, the log shows Metastorage info on start not followed by Performing MetaStorage recovery.

Cluster Management Group

If CMG loses majority:

  1. Restart CMG nodes to restore the lost majority.

  2. If the above fails, forcefully assign a new majority using the following CLI command (manually or via REST): recovery cluster reset --url=<node-url> --cluster-management-group=<new-cmg-nodes>.

    The command is sent to the node indicated by the --url parameter, which must belong to the new-cmg-nodes RAFT group. This node becomes the Repair Conductor, and it initiates the reset procedure.

The above procedure might fail for the following reasons:

  • Some of the nodes specified in new-cmg-nodes are not in the physical topology.

  • The Repair Conductor does not have all the information it needs to start the procedure.