Disaster Recovery for System Groups
A GridGain 9 cluster includes two system RAFT groups, both of which are essential for the cluster’s normal operation:
You perform disaster recovery operations on system RAFT groups to recover permanent majority loss. When a system RAFT group loses majority, it becomes unavailable. When CMG is unavailable, the cluster itself remains available with limitations: it can still process most of the operations, but it cannot join new nodes, start/restart existing nodes, and start building new indexes. When MG is unavailable, the cluster becomes unusable; it cannot handle even GET/PUT/SQL requests.
You see that the majority has been lost in cluster logs in the console or in the rotated log files. When a RAFT group becomes unavailable, the logs would show something like
Send with retry timed out [retryCount = 11, groupId = cmg_group].
or
Send with retry timed out [retryCount = 11, groupId = metastorage_group].
.
An indicator that CMG is down is when a node does not start after a restart
command. This is reflected in the log as Local CMG state recovered, starting the CMG
, not followed by Successfully joined the cluster
.
If a node tries to start when CMG is available, but MG is not, the log shows Metastorage info on start
not followed by Performing MetaStorage recovery
.
Cluster Management Group
If CMG loses majority:
-
Restart CMG nodes to restore the lost majority.
-
If the above fails, forcefully assign a new majority using the following CLI command (manually or via REST):
recovery cluster reset --url=<node-url> --cluster-management-group=<new-cmg-nodes>
.The command is sent to the node indicated by the
--url
parameter, which must belong to thenew-cmg-nodes
RAFT group. This node becomes the Repair Conductor, and it initiates thereset
procedure.The above procedure might fail for the following reasons:
-
Some of the nodes specified in
new-cmg-nodes
are not in the physical topology. -
The Repair Conductor does not have all the information it needs to start the procedure.
-
-
If some nodes were down or were unavailable due to a network partition (and hence did not participate in the repair):
-
Start these nodes (or restore network connectivity and restart them).
-
Migrate these nodes to the repaired cluster using the following CLI command (manually or via REST):
recovery cluster migrate --old-cluster-url=<url-of-old-cluster-node> --new-cluster-url=<url-of-new-cluster-node>
.
-
Metastorage Group
If MG loses majority:
-
Restart MG nodes (or at least their RAFT nodes inside GridGain nodes).
-
If the above fails:
-
Make sure that every node that could be started had started and joined the cluster.
-
Forcefully assign a new majority using the following CLI command (manually or via REST):
recovery cluster reset --url=<existing-node-url> [--cluster-management-group=<new-cmg-nodes>] --metastorage-replication-factor=N
.N
is the requested number of the voting RAFT nodes in the MG after repair. If you omit--cluster-management-group
, the command takes the current CMG voting members set from the CMG leader; if CMG is not available, the command fails.The command is sent to the node specified by
--url
. This node becomes the Repair Conductor, and it initiates thereset
procedure.If the Repair Conductor fails to repair MG, the procedure has to be repeated manually (there is no failover).
-
-
If some nodes were down or were unavailable due to a network partition (and hence did not participate in the repair):
-
Start these nodes (or restore network connectivity and restart them).
-
Migrate these nodes to the repaired cluster using the following CLI command (manually or via REST):
recovery cluster migrate --old-cluster-url=<url-of-old-cluster-node> --new-cluster-url=<url-of-new-cluster-node>
.
-
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.