Troubleshooting and Debugging
This article covers some common tips and tricks for debugging and troubleshooting GridGain and Ignite deployments.
Debugging Tools: Consistency Check Command
The ./control.sh|bat
utility includes a set of consistency check commands that help with verifying internal data consistency invariants.
Persistence Files Disappear on Restart
On some systems, the default location for Ignite persistence files might be under a temp
folder. This can lead to situations when persistence files are removed by an operating system whenever a node process is restarted. To avoid this:
-
Ensure that
WARN
logging level is enabled for GridGain. You will see a warning if the persistence files are written to the temporary directory. -
Change the location of all persistence files using the
DataStorageConfiguration
APIs, such assetStoragePath(…)
,setWalPath(…)
, andsetWalArchivePath(…)
Too Many Thin Clients Connect to Cluster
In some environments, the cluster may encounter a memory issue with an especially large number of clients. For example, if GridGain accepts a lot of client connections and then has to run a memory-intensive operation, the node may run out of memory. To avoid this:
-
Track the
client.connector.ActiveSessionsCount
metric to make sure you are not getting more connections than necessary. -
Use Java Metrics to keep track of memory usage on the node.
-
Increase the amount of direct memory by setting the
MaxDirectMemorySize
JVM parameter. Specific memory requirement heavily depends on the amount of clients and the load performed by them.
If the metrics show that you are running low on memory, use the maxConnectionCnt
thin client configuration parameter to limit the number of .
Cluster Does not Start After Field Type Changes
When developing your application, you may need to change the type of a custom
object’s field. For instance, let’s say you have object A
with field A.range
of
int
type and then you decide to change the type of A.range
to long
right in
the source code. When you do this, the cluster or the application will fail to
restart because GridGain doesn’t support field/column type changes.
You can use the experimental meta
command to remove metadata that normally stores type affinity from the cluster. If you are not sure what to remove specifically, use the list
subcommand to list all metadata types. You can specify the type to remove by ID or by name, and also specify the output folder to store a backup in. After you do so, GridGain will treat the column as a new type and continue to operate normally.
control.sh --meta remove [--typeId <typeId>] [--typeName <typeName>] [--out <fileName>]
control.bat --meta remove [--typeId <typeId>] [--typeName <typeName>] [--out <fileName>]
The meta
command is not intended for normal cluster operation and the user is responsible for fulfilling conditions for proper execution:
-
Data of the removed type must not be stored in caches;
-
No other operations should be performed with the type. For example, you should not delete metadata while creating a new object.
You can also, in development, go into the
file system and remove the following directories: marshaller/
, db/
, and wal/
located in the GridGain working directory (db
and wal
might be located in other
places if you have redefined their location). This achieves a similar result to performing a meta
command, but is less targeted.
In production, we still recommend adding a
new field with a different name to your object model and removing the old one. This operation is fully
supported. At the same time, the ALTER TABLE
command can be used to add new
columns or remove existing ones at run time.
Saving WAL Data to Disk on Corruption
The normal way to deal with data corruption is to use maintenance mode to resolve the corruption issue and return to normal operation. Sometimes it may lead to data being lost, for example when the index file is restored to a state not accounting for WAL. You can enable the IGNITE_DUMP_PERSISTENCE_FILES_ON_DATA_CORRUPTION
system property to save all stored data to the {GRIDGAIN_HOME}/db/dump
folder when corruption is detected.
Debugging GC Issues
The section contains information that may be helpful when you need to debug and troubleshoot issues related to Java heap usage or GC pauses.
Heap Dumps
You can configure JVM to dump the heap automatically when the OutOfMemoryException
exception occurs.
This helps if the root cause of this exception is not clear as the dump provides a deeper look at the heap state at the moment of failure:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/path/to/heapdump
-XX:+ExitOnOutOfMemoryError
Detailed GC Logs
In order to capture detailed information about GC related activities, make sure you have the settings below configured in the JVM settings of your cluster nodes:
-XX:+ScavengeBeforeFullGC
-XX:+PrintFlagsFinal
-XX:+UnlockDiagnosticVMOptions
-Xlog:gc*,safepoint:/path/to/gc/logs/gc.log:time,uptime,level,tags:filecount=10,filesize=10M
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
-Xloggc:/path/to/gc/logs/gc.log
Replace /path/to/gc/logs/
with an actual path on your file system.
In addition, for G1 collector set the property below. It provides many additional details that are
purposefully not included in the -XX:+PrintGCDetails
setting:
-XX:+PrintAdaptiveSizePolicy
Performance Analysis With Flight Recorder
In cases when you need to debug performance or memory issues you can use Java Flight Recorder to continuously collect low level runtime statistics, enabling after-the-fact incident analysis. To enable Java Flight Recorder use the following settings:
-XX:+FlightRecorder
-XX:+UnlockDiagnosticVMOptions
-XX:+DebugNonSafepoints
To start recording the state on a particular GridGain node use the following command:
jcmd <PID> JFR.start name=<recording_name> duration=60s filename=/var/recording/recording.jfr settings=profile
For Flight Recorder related details refer to Oracle’s official documentation.
JVM Pauses
Occasionally you may see an warning message about the JVM being paused for too long. It can happen during bulk loading, for example.
Adjusting the IGNITE_JVM_PAUSE_DETECTOR_THRESHOLD
timeout setting may give the process time to finish without generating the warning. You can set the threshold via an environment variable, or pass it as a JVM argument (-DIGNITE_JVM_PAUSE_DETECTOR_THRESHOLD=5000
) or as a parameter to ignite.sh (-J-DIGNITE_JVM_PAUSE_DETECTOR_THRESHOLD=5000
).
The value is in milliseconds.
Client Node Fails to Start Before Server Node
When running environments where clusters need to be brought up often, the expected behavior is to start the server node first and then have client nodes connect to it. If done in reverse, it may cause issues as the server is still starting when the client tries to get data from it. You can manually provide a readiness check by creating an AtomicLong data structure on server node after it starts, and checking for it from the client nodes:
Here is an example of server-side code:
...
//loading is complete, create atomic sequence and set its value to 1
ignite.atomicLong("myAtomic", 1, true);
...
And the code below allows delays the client initialization until the value is retrieved:
while (true) {
// try get "myAtomic" and check its value
IgniteAtomicLong atomicLong = ignite.atomicLong("myAtomic", 0, false);
if (atomicLong != null && atomicLong.get() == 1) {
// initialization is complete
break;
}
// not ready
Thread.sleep(1000);
}
Uneven Data Distribution
The default GridGain affinity function does not guarantee even data distribution. As a result, sometimes large clusters may encounter uneven data distribution across the nodes. For example, when a node leaves a 20-node cluster, some nodes may receive 10% of total cluster data, while others will not receive any additional data. This may cause performance issues with nodes that suddenly handle more data than expected.
In most cases, this can be remedied by increasing the number of partitions in your cluster. GridGain aims to reduce overhead caused by rebalancing data, so having smaller partitions means that data can be spread more evenly even in these scenarios.
If you are already in a situation of unfavorable data distribution, you can also force GridGain to redistribute data off the node by changing its consistent ID. This will trigger the rebalance process, usually resulting in a more fair distribution.
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.