The GridGain Data Lake Accelerator, released today, is an in-memory solution for digital businesses that need to enrich operational data with historical data stored in data lakes to improve real-time analytics and decision automation.
A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning.
The GridGain Data Lake Accelerator is available for use with the GridGain Enterprise Edition and GridGain Ultimate Edition. A free 30-day trial of the GridGain Data Lake Accelerator is available from the GridGain Downloads page.
The GridGain Data Lake Accelerator boosts data lake access by providing bi-directional integration with Apache™ Hadoop®. This integration brings the historical data into the same in-memory computing layer as the operational data, enabling real-time analytics and computing on the combined data to drive real-time business processes. It leverages the GridGain Unified API and native Apache Spark™ connector to power real-time HTAP (hybrid transactional/analytical processing) in which transactions and analytics are performed on the same operational dataset.
Typical use cases for the GridGain Data Lake Accelerator include using historical data to enrich real-time data streams, calculating thresholds for real-time operational triggers from historical trends, and displaying historical and real-time data together in operational dashboards. For example, a transportation company might be collecting a continuous stream of data from its vehicle engines.
The data is ingested, processed and analyzed and then stored in a data lake, with only the most recent data retained in the operational data store. When an anomalous reading in the live data triggers an alert for a particular engine, the system needs to analyze the engine data to identify the root cause of the problem.
An infrastructure powered by GridGain’s in-memory computing platform, Kafka, Spark and Hadoop makes this possible. Apache Kafka feeds the live streaming data to the GridGain in-memory computing platform and to the Hadoop data lake. Spark retrieves the required data from the data lake and delivers it to the in-memory computing platform. The GridGain in-memory computing platform maintains the combined data set in memory and runs real-time queries across the data set. The result is deep and immediate insight into the causes of the anomalous reading.