GridGain Developers Hub

Point-in-Time Recovery

Continuous Archiving for Point-in-Time Recovery (PITR) makes it easy to recover a cluster to any previous point in time. Basically, using PITR, you can roll back the data in the cluster to any state you want to.

When PITR is enabled, the cluster continually records all operations that modify the data to the write-ahead log (WAL). PITR consists of two stages: first, it restores a full snapshot and then applies all the operations from the WAL from the time the full snapshot was taken up to the required moment. This brings the cluster to the state it was in as of the specified moment.

pitr

In the figure above, three snapshots were created during cluster operation, and we want to restore the cluster to a specific moment between point 2 and point 3. In this case, GridGain takes an earlier full snapshot of data (snapshot 2) and then applies the operations from the WAL Archive 2, recreating the required state of the cluster for the given moment.

Because PITR replays the operations starting from the latest available snapshot, the longer the period between the snapshot and the point you want to restore the cluster to, the more operations need to be reapplied and the longer it will take to restore the cluster. Because of this, you should create snapshots on a regular basis. These snapshots will split the lifetime of the cluster into smaller periods, each snapshot serving as a starting point for a recovery process for any time in the subsequent period.

Write-ahead Log and Continuous Archiving

The WAL keeps track of all operations that were performed on the data. Log files contain operations for a fixed period of time. However, if PITR is enabled, GridGain keeps all WAL files permanently, archiving them in a directory specified in DataStorageConfiguration. This process is known as continuous archiving. For more information about WAL files and performance, see Keep WALs Separate.

If continuos archiving causes the WAL archive to grow beyond the maxWalArchiveSize and minWalArchiveSize values (see persistence configuration properties), self-cleanup of the archive might prevent you from returning to the exact point in time you need. To work around this limitation, you can do one (or both) of the following:

  • Configure the maxWalArchiveSize and minWalArchiveSize values based on WAL statistics in your specific environment. The goal of this empiric configuration is to balance the recovery capability (i.e., PITR) and the disk size limitations of your WAL archive.

  • Configure your snapshot mechanism to save snapshots to an "external" location (with no memory limitations), and your PITR mechanism - to look for data in this external location.

Data Consistency

To ensure data consistency, transactions that have not finished by the time of the recovery will be disregarded. Similarly, if a series of dependent transactions was in progress at the recovery point, all transactions from the series will be ignored and the recovery point will be shifted to the moment before the series begun. This means that with point-in-time recovery the cluster is restored to the latest consistent state prior to the given point.

Requirements

In order to use PITR, you need to make sure your server and cluster configuration meets the following requirements.

Time Synchronization

All machines running the cluster nodes must be configured to synchronize time via the NTP protocol.

Storage Size

When PITR is enabled, the WAL segments will not be automatically deleted. It is, therefore, crucial to make sure that each node has enough disk space.

Consider the following points as general guidelines for managing disk space when PITR is enabled.

Schedule Periodic Snapshot Creation

Snapshots should be created periodically to reduce the time it takes to perform a recovery operation and the amount of changes between snapshots. You can use the Snapshots Management Tool (or any other scheduler) to schedule snapshot creation.

The following command sets up a schedule that creates a full snapshot every day at 00:00.

snapshot-utility.sh schedule -command=create -name="snapshot creation schedule"  -full_frequency=daily
snapshot-utility.bat schedule -command=create -name="snapshot creation schedule"  -full_frequency=daily

Move or Delete Old Snapshots Regularly

Because snapshots and WAL files will take up significant amount of space on your hard drive, make sure you regularly remove the snapshots you no longer need. Snapshot can be moved or deleted using the Snapshots Management Tool.

To remove a specific snapshot, execute the following command:

snapshot-utility.sh delete -id=snapshot_id
snapshot-utility.bat delete -id=snapshot_id

To create a snapshot deletion schedule, use the following command:

snapshot-utility.sh schedule -command=delete -name="snapshot deletion schedule" -ttl=5d -frequency=hourly
snapshot-utility.bat schedule -command=delete -name="snapshot deletion schedule" -ttl=5d -frequency=hourly

This schedule will execute a snapshot deletion command every hour; each command will delete any snapshots that are older than 5 days at the time the command is executed.

Functional Limitations

Please consider the following limitations before using PITR in a production environment.

  • PITR is not supported with caches that have disk page compression enabled. Look for an exception like: "Failed to start cache because disk page compression is enabled."

  • When PITR is enabled, you cannot create snapshots with a subset of caches. You can only create snapshots with all the caches stored in the cluster.

  • Dynamic caches created within one group of caches will be lost if they are not saved in a full snapshot. In other words, a dynamically created cache can be restored only at a point in time after it has been saved in a full snapshot.

  • If you manually remove a snapshot, PITR may fail. Use the provided tools to manage snapshots.

  • You will not be able to move or delete the final snapshot using the Snapshots Management Tool.

  • Because PITR always requires a snapshot to be available, a full snapshot is automatically created during the cluster activation. This first snapshot must be preserved at all times.

  • If you delete a snapshot using Snapshots Management Tool and want to restore the cluster to any time after that snapshot, an earlier snapshot will be used.

Enabling Point-in-Time Recovery

To enable continuous archiving for point-in-time recovery, you have to enable snapshots and set pointInTimeRecoveryEnabled property in control.sh. If the property is not set, the cluster takes the value from the coordinator’s config and saves it.

control.sh --property set --name 'pointInTimeRecoveryEnabled' --val 'true'
control.bat --property set --name 'pointInTimeRecoveryEnabled' --val 'true'

Recovering to Point in Time

To restore the cluster to a specific point in time, use the restore command in the Snapshots Management Tool, and specify the -to parameter. The time must be specified in yyyy-MM-dd-HH:mm:ss.SSS format.