Back in the day, when today’s parents were software engineers, whenever you built large-scale systems you had to size everything. For a while many people forgot, except for the few in banking, or the people who rewrote Sabre Systems and other high-volume systems. But then data volume, variety and velocity took off.
Now you HAVE to think about HOW you improve performance and scalability because in many cases just adding a “cache-aside” pattern will not work.
We increasingly need more data to do a simple task, and if we try to move that much data over the network, we will fail. At HomeAway, which includes VRBO in case you don’t know them, Chris Berry did some simple math that he shared at the In-Memory Computing Summit.
HomeAway was running as many as 2300 batches of 200 calculations per second for calculation prices (daily rates) for homes, 12 billion per day overall. Each calculation required 250K of data. If you do the math like Chris did, that’s 115 gigabytes (GB) of data per second at peak loads. That’s why Chris couldn’t even consider a traditional cache like Redis or Memcached. Because as he put it, you can’t violate the laws of physics, or in this case, the laws of the network.
How bad was the bottleneck? Today, a 10 Gigabit (10GbE, pronounced “10 Gig E”) network is the larger network you might see in a company since it was introduced in 2015. It can move 1 gigabyte/sec peak if you’re very lucky and don’t have collisions. A 1 gigabit (1GbE) network, which might move 100 MB/sec, is more common since it was introduced in 2000. Some companies still have 100 megabit (100MbE) networks deployed. So, if HomeAway were to move all this data across the network each time they needed to do a calculation they might be able to handle 2 of those 2300 batches on a dedicated 1GbE or 20 batches on a 10GbE network.
But that means they’d still need over 100 10GbE networks, or even 10 of some new 100GbE networks, to handle peak loads, which is just not feasible. And this is why traditional caches like Redis and Memcached fail. They help offload reads from an existing database by caching. But any client that wants to use the cache when it’s configured as a grid or cluster still needs to fetch the data across the network to bring it to the computation.
What made much more sense was to collocate data and processing, or move the processing directly and data onto the same machine. That’s what Apache Ignite does, and that’s one reason why HomeAway chose GridGain, the enterprise-ready version of Apache Ignite.
But isn’t this an extreme case? Do you really have this problem? Even if you only have 100th the network needs of HomeAway, you STILL have a bottleneck, even on the newer networks. And the moment you start to automate decisions in real-time. That’s really what HomeAway started to do with its calculations. So if you’re having a performance problem, don’t just assume it’s your database getting overloaded. Figure out what kind of performance problem you have! It might be a network issue!
First, figure out how much data you need to cache. Determine the amount of data touched on at least a daily basis that is involved with the performance issues. In the case of HomeAway they needed a ½ TB. Do the math:
- How many requests/sec or calculations/sec do you need to do?
- What data is used for those calculations?
- Round up for the worst case, and look forward 2-3 years, since what you’re building needs to last for at least a few years.
If you’re not going to store the cache on the same machine as the computation or have a dedicated cluster or grid for your cache, then this calculation will tell you if you have, or when you will have a network problem. Also, make sure you look across your future caching needs for all your applications for the next 2-3 years. If you start caching the same data in different caches for different applications, you will effectively multiply your network loads by the number of apps using that data.
If it’s a network problem, then will need to collocate a subset of your computing with the data across a cluster of machines. Redis does not collocate computing with the data. It’s a cache, even when it’s distributed for scale and it’s really hard to partition data across its cluster to try and minimize traffic.
If you still want to use something like Redis, then count the number of network trips you’ll need. As a “cache-aside” cache, Redis calls the applications, does an initial load, and then sends the data ( for 2 trips.) Each refresh and subsequent use is another two trips. When you’re collocated, the only trips are the initial write and subsequent updates. If you have frequently updated data, you may be in a lot of trouble.
If you don’t know which applications will cause the problem over the next 2-3 years, go dig into any digital business or customer experience initiatives, especially those that are trying to deliver an improved experience or outcome in real-time. I guarantee you will find a network problem.
Let me explain it this way. Take your existing data about your customers that your employees or systems use to make decisions “in batch” and add it up. Now take the amount of data your company wants to add about the customer. Take that total data and divide it by the number of customers to get the data per customer. Now find out how many customers you handle during peak loads, and multiply that by the amount of data per customer.
Starting to see the problem? Every 1-Gbyte you want to use takes a second just to go across a dedicated 10 GbE if you’re lucky, and adds at least a second of delay in a customer experience, which is annoying to most consumers.
With Apache Ignite, the data only moves across the network twice each time it’s updated; once from the initial app updating it to the node in the Ignite cluster, then once to the underlying database. It’s even less if GridGain is collocated with the application or database, or if GridGain native persistence is used. That’s because Ignite is a read-through/write-behind cache directly in the line of all data flow that holds the most up-to-date version of the data in-memory. Ignite then collocates the processing with the data, so each time a calculation is requested, that calculation happens on the same machine as the data. In the case of large joins, Ignite has already allocated data based on data affinity to ensure data is local.
For a more detailed analysis, you can read the GridGain® and Redis® feature comparison.
That’s how HomeAway made this bottleneck that was 100x the size of their network … GoAway so that you can have a Home Away from Home.