GridGain Developers Hub

Disaster Recovery for Data Partitions

You perform disaster recovery operations to recover from situation when data operations on your GridGain cluster nodes become unfeasible because GridGain cannot guarantee data consistency. In such cases, you need to either return data to a consistent state or declare the current state consistent.

Disaster Scenarios and Recovery Instructions

Minority Offline

Minority refers to less than half of the number of replicas configured for a distribution zone (DZ). For example, of DZ1 is configured with 2 replicas and DZ2 - with 3 replicas, losing a single GridGain node is a majority loss for DZ1 and a minority loss for DZ2.

You may discover that one or more of your cluster nodes are offline in a number of ways, including the ignite recovery partition-states CLI command with the --global option, which would show Read-only partition, Degraded partition, or Unavailable partition for offline nodes.

Once a minority offline status has been discovered:

  1. Command the system to bring the offline node(s) online.

    The system attempts to bring the indicated nodes online. Possible outcomes are:

    • Nodes return online in time (before scale-down timeout) with valid data. The system replicates the missing data (if any)using either log replication or the full state transfer procedure.

    • Nodes return online in time but without data. The system replicates the data using the full state transfer procedure.

    • A node does not return online before scale-down timeout. The system distributes a replica to a new node and starts the rebalance procedure.

    • Node return online with inconsistent data - see steps 4 and 5.

  2. Run the ignite recovery partition-states CLI command on the relevant zones/nodes/partitions to verify Partition States.

  3. If the state is Healthy or Available partition, consider the recovery completed.

  4. If the state is Broken:

    1. Restart your GrigGain node or the relevant partitions using the ignite recovery restart-partitions CLI command.

    2. Rerun the ignite recovery partition-states CLI command.

  5. If the the partition state is Read-only partition, Degraded partition, or Unavailable partition, reset the relevant partitions using the ignite recovery reset-partitions CLI command.

Majority Offline

Majority refers to half (or more) of the number of replicas configured for a distribution zone (DZ). For example, of DZ1 is configured with 2 replicas and DZ2 - with 3 replicas, losing a single GridGain node is a majority loss for DZ1 and a minority loss for DZ2.

You may discover that one or more of your cluster nodes are offline in a number of ways, including the ignite recovery partition-states CLI command with the --global option, which would show Read-only partition, Degraded partition, or Unavailable partition for offline nodes.

If the node(s) that remain(s) online include the primary replica, the partition becomes Read-only partition (see Global Partition States); all the data is available for reading until lease expires. If the node(s) that remain(s) online do not include the primary replica, the partition becomes Unavailable partition (see Global Partition States).

Once a majority offline status has been discovered:

  1. Command the system to bring the offline nodes online.

    The system attempts to bring the indicated nodes online. Possible outcomes are:

    • Nodes return online in time (before scale-down timeout) with valid data. A leader is elected, the system replicates the missing data (if any)using either log replication or the full state transfer procedure, and a leaseholder is elected

    • Nodes return online with inconsistent data - see steps 4 and 5.

  2. Run the ignite recovery partition-states CLI command on the relevant zones/nodes/partitions to verify Partition States.

  3. If the state is Healthy or Available partition, consider the recovery completed.

  4. If the state is Broken:

    1. Restart your GrigGain nodes or the relevant partitions using the ignite recovery restart-partitions CLI command.

    2. Rerun the ignite recovery partition-states CLI command.

  5. If the the partition state is Read-only partition, Degraded partition, or Unavailable partition, reset the relevant partitions using the ignite recovery reset-partitions CLI command.

In the Majority Offline scenario, you would typically lose part of the data. For example, if you reset partition A while partition B was in the Available partition state, you would lose:

  • The latest data from A that has been restored using reset-partitions

  • Some of the latest data from B, which had been inserted into it in a transaction that had also inserted data into A

Partition Loss

In this scenario, in addition to having Majority Offline, you lose all replicas of a partition, e.g., partition A. This causes a loss of all the data from partition A once you run the ignite recovery reset-partitions CLI command, as well possibly a loss of some of the recent updates in other partitions.

Try bringing the nodes back online as described in the Majority Offline scenario.

Partition States

This section describes the data partition states that define the partition availability and readiness for utilization.

Local Partition States

Local partition state is a local property of a replica, storage, state machine, etc., associated with the partition.

  • Healthy - a state machine is running with no issues.

  • Initializing - a node is online, but the corresponding RAFT group has not completed its initialization yet.

  • Snapshot installation - a full state transfer is taking place. Once it has finished, the partition will become healthy or catching-up. Before that, data cannot be read, and log replication is on pause.

  • Catching-up - a node is in the process of replicating data from the leader, and its data is slightly in the past. More specifically, node has not replicated the tail of the log that corresponds to 100 log entries.

  • Broken - the state machine experiences issues (likely as a result of an exception). Some data might be unavailable for reading, and the log cannot be replicated. This state will not be changed automatically - it requires intervention.

Global Partition States

Global partition state is a global property of a partition that specifies its apparent functionality from user’s point of view.

  • Available partition - a healthy partition that can process read and write requests. Implies that all peers are healthy at the moment.

  • Read-only partition - a partition that can process read requests but not the write requests. There is no healthy majority. However, there is at least one alive (healthy/catch-up) peer that can process historical read-only queries.

  • Unavailable partition - a partition that cannot process any requests.

  • Degraded partition - a partition that is available to the user, but is at a higher risk of having issues than other partitions. For example, one of the group’s peers is offline. There is still a majority, but the backup factor is low.