Data Partitioning
Data partitioning is a method of subdividing large sets of data into smaller chunks and distributing them between all server nodes in a balanced manner.
How Data is Partitioned
When the table is created, it is always assigned to a distribution zone. Based on the distribution zone parameters, data is separated into PARTITIONS
parts, called partitions, stored REPLICAS
times across the cluster. Each partition is identified by a number from a limited set (0 to 24 by default). Each individual copy of a partition is called a replica, and is stored on separate nodes, if possible. GridGain uses the Fair partition distribution algorithm. It means that it stores information on partition distribution and uses this information for assigning new partitions. This information is preserved in cluster metastorage, and is recalculated only when necessary.
Once partitions and all replicas are created, they are distributed across the available cluster nodes that are included in the distribution zone following the DATA_NODES_FILTER
parameter and according to the partition distribution algorithm. Thus, each key is mapped to a list of nodes owning the corresponding partition and is stored on those nodes. When data is added, it is distributed evenly between partitions.
The nodes store partitions in the folder specified in the ignite.system.partitionsBasePath
, or in the work/partitions
folder. The nodes also store partition-specific RAFT logs in the ignite.system.partitionsLogPath
with the information on RAFT elections and consensus.
Default Partition Number
When creating a distribution zone, you have an option to manually set the number of partitions with the PARTITIONS
parameter, for example:
CREATE ZONE IF NOT EXISTS exampleZone WITH STORAGE_PROFILES='default',PARTITIONS=10
Otherwise, GridGain will automatically calculate the recommended number of partitions:
dataNodesCount * coresOnNode * 2 / replicas
Primary Replicas and Leases
Once the partitions are distributed on the nodes, GridGain forms replication groups for all partitions of the table, and each group elects its leader. To linearize writes to partitions, GridGain designates one replica of each partition as a primary replica.
To designate a primary replica, GridGain uses a process of granting a lease. Leases are granted by the lease placement driver, and signify the node that houses the primary replica, called a lease holder. Once the lease is granted, information about it is written to the metastorage, and provided to all nodes in the cluster. Usually, the primary replica will be the same as replication group leader.
Granted leases are valid for a short period of time and are extended every couple of seconds, preserving the continuity of each lease. A lease cannot be revoked until it expires. In exceptional situations (for example, when primary replica is unable to serve as primary anymore, leaseholder node goes offline, replication group is inoperable, etc.) the placement driver waits for the lease to expire and then initiates the negotiation of the new one.
Only the primary replica can handle operations of read-write transactions. Other replicas of the partition can be read from by using read-only transactions.
If a new replica is chosen to receive the lease, it first makes sure it is up-to-date with its replication group by stored data. In scenarios where replication group is no longer operable (for example, a node unexpectedly leaves the cluster and the group loses majority), it follows the disaster recovery procedure, and you may need to reset the partitions manually.
Version Storage
As new data is written to the partition, GridGain does not immediately delete old one. Instead, GridGain stores old keys in a version chain within the same partition.
Older key versions can only be accessed by read-only transactions, while up-to-date version can be accessed by any transactions.
Older key versions are kept until the low watermark point is reached. By default, low watermark is 600000 ms, and it can be changed in cluster configuration. Increasing data availability time will mean that old key versions are stored and available for longer, however storing them may require extra storage, depending on cluster load.
In a similar manner, dropped tables are also not removed from disk until the low watermark point, however you can no longer write to these tables. Read-only transactions that try to get data from these tables will succeed if they read data at timestamp before the table was dropped, and will delay the low watermark point if it is necessary to complete the transaction.
Once the low watermark is reached, old versions of data are considered garbage and will be cleaned up by garbage collector during the next cleanup. This data may or may not be available, as garbage collection is not an immediate process. If a transaction was already started before the low watermark was reached, the required data will be kept available until the end of transaction even if the garbage collection happens. Additionally, GridGain checks that old data is not required anywhere on the cluster before cleaning up the data.
Distribution Reset
Partition Rebalance
When the cluster size changes, GridGain waits for the timeout specified in the DATA_NODES_AUTO_ADJUST_SCALE_UP
or DATA_NODES_AUTO_ADJUST_SCALE_DOWN
distribution zone properties, and then redistributes partitions according to partition distribution algorithm and transfers data to make it up-to-date with the replication group. This process is called data rebalance.
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.