In case you hadn’t noticed, this year’s annual Spark conference is, for the first time, the Spark+AI Summit. The fact that Spark and AI should be together is predictable even without… using AI to figure it out. But there’s only one way to add continuous learning to Spark+AI, to make AI learn and adapt to new information in near real-time like a person. It is not the AllSpark, which is used to create a Transformer in the movies. It’s in-memory computing.
If you’re interested in learning how in-memory computing implements continuous learning for machine and deep learning with Spark, come visit us at the Spark+AI Summit. You can also read about on in-memory computing and Spark here, or download a technology note on the topic. But here’s a brief explanation of the problem and how in-memory computing solves it.
Most companies process streaming data to be able to identify certain patterns and respond in “real-time”. Companies do it to watch and improve the customer experience or the operations of connected devices (the Industrial Internet of Things or IIoT), manage SLAs across order-to-cash processes, implement cyber-attack and fraud detection, manage IT systems, or enforce regulatory compliance.
Spark is great as a general purpose stream processing engine for turning streaming data into understandable events. But it doesn’t have a built-in rules or decision-making engine. If you’re trying to replace decisions traditionally made by people, you could code it. However, several limitations with Spark will result in a lot of code. It also won’t be able to do continuous learning.
The first big challenge is data storage. Every data processing engine needs data management. When Hadoop first emerged, you had MapReduce for processing, and HDFS for storage. Spark has since supplanted MapReduce as the defacto processing engine for Hadoop. What’s the best storage for Spark? Remember, Spark does everything in-memory using RDDs, DataFrames or DataSets? HDFS is a disk-based file system that is really slow compared to in-memory computing. Most projects to date have used a lot of code and databases, HDFS or some other disk-based mechanism. That has created a performance bottleneck.
The second big challenge is saving and sharing state. Spark doesn’t save or share state easily. RDDs are read-only by default, for example. This may not seem like a big deal, until you try to watch for any patterns. You have to save state, and share state to track any behavior over time. The way developers have done it traditionally, again, is save state to disk or a database.
The third big challenge is the network. Any calculations, machine learning, and any other general data processing require a lot of data movement. Unless you have a good distributed data management system, that data will need to move over the network each time to get to the processing.
Machine or deep learning can’t be continuous or even respond to the latest data with these limitations. In the case of machine or deep learning, you might be waiting for terabytes or petabytes to be moved into specialized hardware before you kick off the machine or deep learning. It takes 1 hour to move 3.6TB on a dedicated 10GigE network … if you’re lucky. Then the learning might take several additional hours because the infrastructure doesn’t scale well enough to easily process petabytes.
The good news is that implementing streaming analytics and continuous learning can be done much more easily with in-memory computing, and in particular with Apache Ignite and GridGain, the commercially supported version of Ignite. Apache Ignite is another top level Apache Project. It’s actually in the top 5 projects when measured by the number of commits and mailing list activity, with 2x the commits compared to Spark in 2017.
Ignite has the broadest in-memory data management support for Spark. It eliminates the first bottleneck, the disk, by managing all data in memory for Spark and providing it as Spark APIs – RDDs, DataFrames, even as in-memory HDFS – that provide direct access to Ignite’s/GridGain’s in-memory data grid (IMDG). It eliminates the second bottleneck, sharing state, by providing mutable (writable) RDD and DataFrame APIs that can easily be shared across Spark jobs.
Ignite also eliminates the third bottleneck, the network, by design. It allows nodes of a cluster to be installed on every machine running Spark jobs and uses configurable data affinity to distribute the right data to the right nodes. It’s because Ignite is designed for large-scale distributing in-memory computing where the only way to eliminate the network bottleneck is to not move the data over the network.
Ignite implements a concept called massively parallel processing (MPP) on the Ignite Compute Grid. MPP allows Ignite to easily perform large scale processing in memory in ways that Spark cannot, processing that Spark can easily use. Ignite allows user-defined partitioning of data across the cluster in a way that puts all data that any processing would need all on the same machine. MPP then distributes any custom Java, .NET or C++ code across the cluster on the Compute Grid, which executes it locally on each node, collects and returns the results like it’s one operation.
Ignite provides a broad, integrated implementation of distributed MPP algorithms including its distributed SQL and machine and deep learning that are easily invoked in Spark. The existing integration between Spark and Ignite, for example, takes SparkSQL that is part of DataFrames, and selecting adds in Ignite SQL to accelerate certain operations. This improves performance up to 1000x compared to using SparkSQL or a traditional database’s SQL operations.
The Ignite Continuous Learning Framework is what makes big data analytics and in-place continual learning for real-time responsiveness and automation a reality. It is build on MPP-style machine or deep learning algorithms that run in memory collocated with the data. This approach can deliver near real-time continuous learning against petabytes of data for two reasons. First, the data doesn’t need to be moved over the network via ETL or some other method before running the machine learning algorithms. They run in place. Second, Ignite is designed for horizontal, linear scalability with in-memory performance.
Ignite provides several standard machine learning algorithms optimized for MPP including linear and multi-linear regression, k-means clustering, decision trees, k-NN classification and regression. It also includes a multilayer perceptron for deep learning. Developers can develop and deploy their own machine learning algorithms across any cluster as well using the Compute Grid.
As the Spark+AI Summit name implies, if you’re not already using machine learning with Spark, you should be. So get started. If you’re not using in-memory computing with Spark in some form to help simplify data ingestion and access, data preparation, or data storage you should be. So get started.
You can read more on gridgain.com about Ignite and GridGain’s support for Spark, download the note on Ignite and Spark or just come by the GridGain booth at the Spark+AI Summit if you want to learn more.