Command Line Tool
To allow a user to control the process of building a TensorFlow cluster on top of an Apache Ignite cluster, Ignite provides a simple command line tool with the following commands.
Start Command
The start
command starts a new TensorFlow cluster on top of an Apache Ignite cluster for the specified cache and then starts training (specified by JOB_DIR
, JOB_CMD
, and JOB_ARGS
). When everything is started, Apache Ignite maintains all processes and automatically restarts them in case of any failure. The output of the start
command is an output of training.
Usage: ignite-tf start [-hV] [-c=<cfg>] CACHE_NAME JOB_DIR JOB_CMD [JOB_ARGS…]
Starts a new TensorFlow cluster and attaches to user script process.
CACHE_NAME
: Upstream cache name.
JOB_DIR
: Job folder (or zip archive).
JOB_CMD
: Job command.
[JOB_ARGS…]
: Job arguments.
-c
, --config=<cfg>
: Apache Ignite client configuration.
-h
, --help
: Show this help message and exit.
-V
, --version
: Print version information and exit.
Internally it means the following procedure:
-
Determine the placement of partitions for the specified cache.
-
According to the partitions placement, start workers on the appropriate nodes.
-
Start training code on a random node in the cluster with
TF_CONFIG
that contains information about workers placement. -
Route output of training to output of
start
command. -
In case of failure, stop everything and start again from the first step.
-
If training is successfully completed, stop everything.
Stop Command
The stop
command stops the specified TensorFlow cluster and corresponding training.
Usage: ignite-tf stop [-hV] [-c=<cfg>] CLUSTER_ID
Stops a running TensorFlow cluster.
CLUSTER_ID
: Cluster identifier.
-c
, --config=<cfg>
: Apache Ignite client configuration.
-h
, --help
: Show this help message and exit.
-V
, --version
: Print version information and exit.
Attach Command
The attach
command attaches to the specified training and routes output of this training to the output of the attach command.
Usage: ignite-tf attach [-hV] [-c=<cfg>] CLUSTER_ID
Attaches to running TensorFlow cluster (user script process).
CLUSTER_ID
: Cluster identifier.
-c
, --config=<cfg>
: Apache Ignite client configuration.
-h
, --help
: Show this help message and exit.
-V
, --version
: Print version information and exit.
Ps Command
The ps
command prints identifiers of all running TensorFlow clusters.
Usage: ignite-tf ps [-hV] [-c=<cfg>]
Prints identifiers of all running TensorFlow clusters.
-c
, --config=<cfg>
: Apache Ignite client configuration.
-h
, --help
: Show this help message and exit.
-V
, --version
: Print version information and exit.
Cluster Manager
Apache Ignite has a complex infrastructure that maintains a TensorFlow cluster. A quick overview of this is shown in the following diagram:
© 2024 GridGain Systems, Inc. All Rights Reserved. Privacy Policy | Legal Notices. GridGain® is a registered trademark of GridGain Systems, Inc.
Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are either registered trademarks or trademarks of The Apache Software Foundation.