Implementing an EBS snapshot schedule seems like an easy enough, no brainer task. In general, the process revolves around
Get some AWS tags for the EBS data you want to protect. At a minimum, the tags probably will describe retention and backup frequency
Find some lifecycle policy that meets your criteria or define one anew
Add your EBS volume (or entire EC2 instance) tags to the snapshot lifecycle policy
Go to lunch…
Not particularly challenging from an implementation perspective.
Also, it isn’t particularly surprising to most people that the first snapshot is going to transfer a full backup of your data and you’ll be charged for it.
But there are hidden costs lurking here. And, these costs don’t immediately show up and can be especially damaging if you are trying to forecast your AWS EBS costs with any reasonable time horizon that the finance guys with their Gregorian calendar are happy with… like months, quarters, and years.
Consider this Basic Test
You can perform this test in your own environment and get immediate results if you wish.
Provision an empty EBS volume and put a filesystem of your choosing on it. It doesn’t matter, as an example let us make this volume 10 GB in size
Fill up your EBS volume with some junk data, we don’t really care how… just make sure it is 100% full. You can ‘dd’ the blocks, copy files… whatever you like.
Afterwards, go ahead and delete that junk data… all of it… you should have nothing there, 0% full.
Take a snapshot of your empty EBS volume with no data on it. If you recover this snapshot, you will see an empty filesystem with no restorable data
Question: How much are you going to pay for that?
Answer: Full price for 10 GB of storage of EBS snapshot data
But wait… there is no recoverable data at all in this snapshot… how can that be.
Well, the key to understanding this is the method that AWS uses to create snapshots. AWS has a mechanism to track changed blocks and it really has no knowledge of the underlying filesystem that you are using. AWS only knows that a block is changed and whenever that happens it is going to be included in the list of blocks that need to be protected when you take a snapshot. And, the action of taking a snapshot for AWS is quite simply to copy these changed blocks to a special EBS snapshot bucket that represents your volume… which you pay for.
Example of Active Database Churning on a Flat File
Below, we simulate a different real-world scenario where an active database is churning on a flat file in a spacious filesystem.
The specifics are below,
3 TB EBS volume on a Linux partitioned with an ext4 filesystem
Generate daily block change of 750 GB, continuously overwriting a single file with ‘dd’, i.e. the filesystem is always 25% full
The provisioned 3 TB EBS volume costs 3000 GB * $0.1/GB-month / 30 days= $10 a day as shown with the constant, grey line. So, let us start with what one might assume your costs are going to be, i.e. the dashed simple assumption that you see above. If you keep 4 snapshots and each snapshot is 25% of the total volume size.
Assuming your change rate is α and the number of snapshots you keep is β, then a simple assumption is that you can multiply the EBS snapshot costs for the entire volume by α * β to come up with a total cost for the snapshot lifecycle policy. In this case, a simple assumption would estimate the cost as,
(4 * 0.25) * 3000 GB * $ 0.05 / GB-month / 30 days = $5 / day.
But, when you run the test you find that you actually end paying,
((1-0.25) + 4 * 0.25) * 3000 * $0.05 / GB-month / 30 days = $8.75 / day.
In a general way, the correct initial term in parenthesis must be expressed at steady-state as follows ((1-α) + β * α) to account for all of the dead space on the volume that will end up being tracked and is not part of the active filesystem. Note that in this case, not properly accounting for that results in a cost that is a whopping 75% more than the simple assumption!
Furthermore, imagine an unsuspecting engineer making a simple assumption on a 25% used filesystem like this with a very low change rate α, it will take much longer to reach steady-state and the potential for seriously underestimating the costs for EBS snapshots can skyrocket when that steady-state is reached!
How to Remediate the Cost
So, as your systems get used… you end up paying more and more for this graveyard of filesystem dead space. In the secondary storage world, they use things like compression and deduplication to minimize the impact of this. However, this isn’t something that is available to you in the one low price fits all world of an AWS solution. As the filesystem churns and creates blocks with garbage and then reallocates them elsewhere, the long-term impact is you are going to pay for a fully provisioned volume in S3 whether there is anything useful in that data from a snapshot perspective that you forecasted for, or not.
Fortunately, the driving force for the minimization of this effect fully aligns with the cost of reducing the provisioned cost of the volume in the first place. Here are some strategies that you should consider to minimize the cost implications of this affect.
When you provision a volume, don’t do it blindly and make sure that there isn’t a lot of wasted space on the filesystem. But, do give yourself the requisite space to grow.
Don’t partition EBS storage to include multiple filesystems unless it is absolutely required
Configure an alerting mechanism for filesystems on EBS storage such that you are informed when the filesystem is short on disk space, and that gives you time to manually or preferably programmatically grow the filesystem as needed
Recognize that other data protection strategies exist that can be exploited to reduce cost if appropriate, such as file level backup