Azure Blob Storage Lifecycle Management

Synopsis

I’ve been working through some cost optimization in our cloud environment and am currently focused on Object Storage, tiers and lifecycle management and wanted to post my collective thoughts and observations around it.  You can find more in depth analysis in the sources listed at the bottom of the post.

Storage Tiers

The 4 storage tiers in Azure Blob Storage are Hot, Cool, Cold and Archive.  Hot, Cool and Cold are all online tiers, meaning the data is available instantly.  Archive is an offline tier, and retrieval time can take a few hours.

Lifecycle Management and Access Patterns

Performance characteristics across Hot, Cool and Cold are very similar, so the use case for what tier your objects belong in is largely determined by access patterns, specifically how frequently the data will be accessed, how long it needs to be accessible and how long you plan to keep it accessible.  Cool, Cold and Archive all have minimum retention periods.  30 days for Cool, 90 days for Cold and 120 days for Archive.  Deleting or moving objects outside of a tier before it’s retention period will carry an early deletion penalty.  Cost is calculated both by how much it costs to store objects and read/write operations.  It costs the most to store objects in Hot but read/write operations are the least, whereas in Archive, storage costs are the cheapest but read/write operations is the most expensive.

Migrating Across Tiers and the Default Tier

Migrating objects in or out of a tier costs money via read/write operations and that cost can be significant.  For example, I’ve seen storage accounts with a default tier of cool, but if we migrated those objects to cold we could save $300k/year, but to move that data into cold would cost approximately $200k, meaning you’d need basically 3 quarters to see an ROI.  For this reason it’s extremely important that the default storage tier of a storage account is appropriate for your use case.  You can only set the default storage tier at the storage account level, not the container level, meaning that if you do have different use cases for different containers, it could be worthwhile to keep them in a different storage account with a different default tier. At the biotech company I currently work for, we store very large amounts of genetic sequencing data.  We have to keep this data around for something like 7 years per FDA regulations.  Once this data is written into object storage, it’s accessed via a pipeline, which takes that data and produces clinical results.  After that, sometime in the not so distant future, those results are statistically validated by a biostats team.  Because this data is infrequently accessed but needs to be available, the cool storage tier is our default tier, with a lifecycle management policy that might move it to archive at 180 days or 1 year.

Sources:
https://learn.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
https://learn.microsoft.com/en-us/azure/storage/blobs/access-tiers-best-practices