If you’re not interested in John Cleese, just listen to Akmal Chaudhri explain how machine and deep learning work with Apache Ignite. But if you really want to understand the problem before diving into the details, I recommend you learn from John Cleese.
Many years ago, long before machine learning but long after Lisp was invented, John Cleese made a big impression on me at a conference. It wasn’t just because I almost ran into him, looked up and up (he’s tall) and said “hi”, to which he responded “Oh! Hi!” It’s because he spoke at that conference about the Tortoise Brain and the Hare Brain. While he had gotten some of his ideas from a book by Guy Claxton, it was his own interpretation that made its mark.
He pointed out that we have two brains; the tortoise brain that takes time to learn, and the hare brain that just acts without thinking. That is exactly how machine learning and streaming work today. Machine learning, and deep learning are part of the tortoise side of the brain. That’s the side of the digital brain that learns, that builds the models and drives action. Stream processing technologies like Apache Spark side are part of the hare side of the digital brain that act without having to think.
Now here’s the problem. The two sides of our brain work together continuously. When we do our jobs we’re constantly learning and constantly reacting based on the latest information. With a digital brain, machine learning and stream processing are not enough on their own. Something needs to allow machine and deep learning to run continuously on the latest real-time streaming data, and then allow Spark and other streaming technologies to act on it.
The brain stem between the two, and the memory they share is in-memory computing. In-memory computing is what makes continuous learning and action possible.
The first challenge is that machine and deep learning traditionally are not real-time. First you need to get at a lot of data to train machine learning models. Moving data takes time; a lot of time. You can move 3.6TB of data on a dedicated corporate network … in an hour if you’re lucky. Once you have the data, it takes time to run the models and learn, and traditionally the dedicated machine learning infrastructure does not scale as well for real-time performance.
With the right in-memory computing, you no longer face either of these problems. You can run machine and deep learning in-place, where the data is, so that you don’t have to move it, and you can run it against even petabytes of data with near real-time performance by scaling horizontally and running in memory. Apache Ignite is the only in-memory computing platform that includes machine and deep learning algorithms optimized to collocate the processing with the data in memory across a distributed cluster.
The second challenge is getting the continuously changing results into Spark or other streaming technologies to act in real-time on the latest results. In-memory computing helps with that by acting as the memory data management for Apache Spark. Apache Ignite has the broadest integration with Apache Spark. This includes RDD, DataFrame and HDFS level integration.
Now that you’ve learned the basics of the tortoise brain and hare brain, learn the details from Akmal on machine and deep learning with Apache Ignite.