Splunk, AWS, and the Battle for Burst Balance

By: Karl Cepull | Senior Director, Operational Intelligence

Splunk and AWS: two of the most adopted tools of our time. Splunk allows fantastic insight into your company’s data at an incredible pace. AWS allows an affordable alternative to on-premise or even other cloud environments. Together both of these tools can come together and allow for one of the best combinations to further the overall ability to show the value in your data. But, there are many systems that need to come together to make all of this work.

In AWS, you have multiple types of storage options available to you for your Splunk servers with their Elastic Block Storage (EBS) offering. There are multiple drive types that you can use – e.g. “io1”, “gp2”, and others. The ‘gp2’ volume type is perhaps the most common one, particularly because it is usually the cheapest. However, when using this volume type, you need to be aware of Burst Balance.

Burst Balance can be a wonderful system. At its core, what Burst Balance does is allow your volume’s disk IOPS to burst higher when needed, without you needing to pay for the guaranteed IOPS all of the time (like you do with the “io1” volume type). What are IOPS? This stands for Inputs/Outputs Per Second, and represent the number of reads and writes that can occur over time. Allowing the IOPS to burst can come in handy when there is a spike in traffic to your Splunk Heavy Forwarder or Indexer, for example. However, this system does have its downside that can actually cause the volume to stop completely!

The way Burst Balance works is on a ‘credit’ system. Every second, the volume earns 3 ‘credits’ for every GB of configured size. For example, if the volume is 100GB, you would earn 300 credits every second. These credits are then used for reads and writes – 1 credit for each read or write. When the volume isn’t being used heavily, it will store up these credits (up to a cap of 5.4 million), and when the volume gets a spike of traffic, the credits are then used to handle the spike.

However, if your volume is constantly busy, or sees a lot of frequent spikes, you may not earn credits at a quick enough rate to keep up with the number of reads and writes. Using our above example, if you had an average of more than 300 reads and writes per second, you wouldn’t earn credits fast enough to keep up. What happens when you run out of credits? The volume stops. Period. No reads or writes occur until you earn more credits (again 3/GB/sec). So, all you can do is wait. That can be a very bad thing, so it is something you need to avoid!

The good news is that AWS has tools that you can use to monitor and alert if your Burst Balance gets low. You can use CloudWatch to monitor the Burst Balance percentage, and also set up an alert if it gets low. To view the Burst Balance percentage, one way is to click on the Volume in the AWS console, then go to the Monitoring tab. One of the metrics is the Burst Balance Percentage, and you can click to view it in a bigger view:

As you can see in the above example, the Burst Balance has been at 100% for most of the last 24 hours, with the exception of around 9pm on 3/19, where it dropped to about 95% shortly, before returning to 100%. You can also set up an alarm to alert you if the Burst Balance percentage drops below a certain threshold.

So, what can you do if the Burst Balance is constantly dipping dangerously low (or running out!)? There are three main solutions:

  1. You can switch to another volume type that doesn’t use the Burst Balance mechanism, such as the “io1” volume type. That volume type has guaranteed, consistent IOPS, so you don’t need to worry about “running out”. However, it is around twice the cost of the “gp2” volume type, so your storage costs could double.
  2. Since the rate that you earn Burst Balance credits is based on the size of the volume (3 credits/GB/second), if you increase the size of the volume, you will earn credits faster. For example, if you increase the size of the volume by 20%, you will earn credits 20% faster. If you are coming up short, but only by a little, this may be the easiest/most cost-effective option, even if you don’t actually need the additional storage space.
  3. You can modify your volume usage patterns to either reduce the number of reads and writes, or perhaps reduce the spikes and spread out the traffic more evenly throughout the day. That way, you have a better chance that you will have enough credits when needed. This may not be an easy thing to do, however.

In summary, AWS’s Burst Balance mechanism is a very creative and useful way to give you performance when you need it, without having to pay for it when you don’t. However, if you are not aware of how it works and how it could impact your environment, it can suddenly become a crippling issue. It pays to understand how this works, how to monitor and alert on it, and options to avoid the problem. This will help to ensure your Splunk environment stays running even in peak periods.

Want to learn more? Contact us today!