Handling Exceptions
This section outlines basic exceptions that can be generated by Ignite and GridGain, and explains how to set up and use the critical failures handler.
Handling Ignite/GridGain Exceptions
Exceptions supported by the Ignite API and actions you can take related to these exceptions are described below. Please see the Javadoc throws clause for checked exceptions.
Exception | Description | Action | Runtime exception |
---|---|---|---|
|
Thrown when you try to perform an operation on a cache in which some partitions have been lost. Depending on the partition loss policy configured for the cache, this exception is thrown either on read and/or write operations. See Partition Loss Policy for details. |
Reset lost partitions. You may want to restore the data by returning the nodes that caused the partition loss to the cluster. |
Yes |
|
Indicates an error condition in the cluster. |
Operation failed. Exit from the method. |
Yes |
|
Thrown by the Ignite API when a client node gets disconnected from cluster. Thrown from Cache operations, compute API, and data structures. |
Wait and use retry logic. |
Yes |
|
Thrown when there is either a node authentication failure or security authentication failure. |
Operation failed. Exit from the method. |
No |
|
Can be thrown from Cache operations. |
Check exception message for the action to be taken. |
Yes |
|
Thrown when the Ignite API fails to deploy a job or task on a node. Thrown from the Compute API. |
Operation failed. Exit from the method. |
Yes |
|
Used to wrap the standard |
Retry after clearing the interrupted flag. |
Yes |
|
Thrown by various SPI ( |
Operation failed. Exit from the method. |
Yes |
|
Thrown when there is a SQL query processing error. This exception also provides query specific error codes. |
Operation failed. Exit from the method. |
Yes |
|
Thrown when there is an authentication / authorization failure. |
Operation failed. Exit from the method. |
No |
|
Thrown from Ignite cache API if a cache is restarting. |
Wait and use retry logic. |
Yes |
|
Thrown when a future computation is timed out. |
Either increase timeout limit or exit from the method. |
Yes |
|
Thrown when a future computation cannot be retrieved because it was cancelled. |
Use retry logic. |
Yes |
|
Indicates that the Ignite instance is in an invalid state for the requested operation. |
Operation failed. Exit from the method. |
Yes |
|
Indicates that a node should try to reconnect to the cluster. |
Use retry logic. |
No |
|
Thrown if a data integrity violation is found. |
Operation failed. Exit from the method. |
Yes |
|
Thrown when the system does not have enough memory to process Ignite operations. Thrown from Cache operations. |
Operation failed. Exit from the method. |
Yes |
|
Thrown when a transaction fails optimistically. |
Use retry logic. |
No |
|
Thrown when a transaction has been automatically rolled back. |
Use retry logic. |
No |
|
Thrown when a transaction times out. |
Use retry logic. |
No |
|
Indicates an error with the cluster topology (e.g. crashed node, etc.). Thrown from Compute and Events API |
Wait on future and use retry logic. |
Yes |
Critical Failures Handling
GridGain is a robust and fault tolerant system. But in the real world, some unpredictable issues and problems arise that can affect the state of both an individual node as well as the whole cluster. Such issues can be detected at runtime and handled accordingly using a preconfigured critical failure handler.
Critical Failures
The following failures are treated as critical:
-
System critical errors (e.g.
OutOfMemoryError
). -
Unintentional system worker termination (e.g. due to an unhandled exception).
-
System workers hanging.
-
Cluster nodes segmentation.
A system critical error is an error which leads to the system’s inoperability. For example:
-
File I/O errors - usually
IOException
is thrown by file read/write operations. It’s possible when Ignite native persistence is enabled (e.g., in cases when no space is left or on a device error), and also for in-memory mode because GridGain uses disk storage for keeping some metadata (e.g., in cases when the file descriptors limit is exceeded or file access is prohibited). -
Out of memory error - when GridGain memory management system fails to allocate more space (
IgniteOutOfMemoryException
). -
Out of memory error - when a cluster node runs out of Java heap (
OutOfMemoryError
).
Failures Handling
When GridGain detects a critical failure, it handles the failure according to a preconfigured failure handler. The failure handler can be configured as follows:
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="failureHandler">
<bean class="org.apache.ignite.failure.StopNodeFailureHandler"/>
</property>
</bean>
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setFailureHandler(new StopNodeFailureHandler());
Ignite ignite = Ignition.start(cfg);
GridGain support following failure handlers:
Class | Description |
---|---|
|
Ignores any failures. Useful for testing and debugging. |
|
A specific implementation that can be used only with |
|
Stops the node in case of critical errors by calling the |
|
This is the default handler, which tries to stop a node. If the node can’t be stopped, then the handler terminates the JVM process. |
Critical Workers Health Check
GridGain has a number of internal workers that are essential for the cluster to function correctly. If one of them is terminated, the node can become inoperative.
The following system workers are considered mission critical:
-
Discovery worker - discovery events handling.
-
TCP communication worker - peer-to-peer communication between nodes.
-
Exchange worker - partition map exchange.
-
Workers of the system’s striped pool.
-
Data Streamer striped pool workers.
-
Timeout worker - timeouts handling.
-
Checkpoint thread - check-pointing in Ignite persistence.
-
WAL workers - write-ahead logging, segments archiving, and compression.
-
Expiration worker - TTL based expiration.
-
NIO workers - base networking.
GridGain has an internal mechanism for verifying that critical workers are operational.
Each worker is regularly checked to confirm that it is alive and updating its heartbeat timestamp.
If a worker is not alive and updating, the worker is regarded as blocked and GridGain will print a message to the log file.
You can set the period of inactivity via the IgniteConfiguration.systemWorkerBlockedTimeout
property.
Even though GridGain considers an unresponsive system worker to be a critical error, it doesn’t handle this situation automatically, other than printing out a message to the log file.
If you want to enable a particular failure handler for unresponsive system workers of all the types, clear the ignoredFailureTypes
property of the handler as shown below:
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="systemWorkerBlockedTimeout" value="#{60 * 60 * 1000}"/>
<property name="failureHandler">
<bean class="org.apache.ignite.failure.StopNodeFailureHandler">
<!-- Enable this handler to react to unresponsive critical workers occasions. -->
<property name="ignoredFailureTypes">
<list>
</list>
</property>
</bean>
</property>
</bean>
StopNodeFailureHandler failureHandler = new StopNodeFailureHandler();
failureHandler.setIgnoredFailureTypes(Collections.EMPTY_SET);
IgniteConfiguration cfg = new IgniteConfiguration().setFailureHandler(failureHandler);
Ignite ignite = Ignition.start(cfg);
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.