An abundance of options is not always a blessing. Sometimes it just creates confusion. Consider the options that present themselves if you want to configure a mission-critical SQL Server infrastructure for high availability (HA) and disaster recovery (DR) on Azure. You could run SQL Server on multiple virtual machines (VMs) configured as a Windows Failover Cluster Instance with the VMs spread across Fault Domains or across multiple Azure Availability Zones. You could use a storage solution such as Premium File Share, Azure Shared Disk or Storage Spaces Direct; or, you could use a third-party tool to create a SANless cluster solution. And what about the Always On Availability Group option built into SQL Server? What about Azure’s Site Recovery service?
While all of these options are designed to increase availability, not all these options will provide you with true high availability — which is defined as the ability to access your application and data 99.99% of the time. Some — but not all — of these options can provide disaster recovery support, which involves building out an infrastructure in a geographically distinct region that can be brought online quickly if the entirety of your primary infrastructure goes offline.
HA via FCI
A traditional approach to configuring SQL Server for HA involves creating a Windows Failover Cluster Instance (FCI). In an on-premises deployment, this would involve creating a cluster of at least two servers running an instance of SQL Server — ideally in two separate data centers to eliminate any single points of failure — plus a third server or perhaps a file share witness to ensure your FCI cluster quorum.
In Azure, you’ll do the same thing, just using VMs instead of standalone servers, with each VM in your cluster running in a distinct Azure Availability Zone (think of an AZ as a distinct cloud data center). By locating each node in the FCI in a different AZ, the entire AZ housing the VM actively running SQL Server could go offline — but the intelligence in the FCI would then bring the instance of SQL Server in the other AZ online to take over for the VM that is no longer available. From the point of view of a VM running SQL Server, this approach eliminates the single point of failure issue that would loom over any configuration in which all VMs are running in the same AZ.
Conceptually, that’s sound, but running an FCI in the cloud — any cloud — raises issues relating to storage that are not present in an on-premises deployment. In an on-premises FCI, you might choose to use a shared storage system of some kind — a SAN or a NAS system — that all nodes in the FCI could access. If the active node goes down, a backup node could be activated and it would access the SAN or NAS to carry on the business of reading and writing to the SQL Server database on the shared storage system. But until relatively recently, it was not possible to share storage in that same way in Azure (or in AWS or Google Cloud Services, for that matter).
Premium File Share
Azure’s Premium File Share now provides an option to configure shared, zone-redundant storage. Effectively this creates a storage system that exists in two separate AZs. If one AZ goes offline, taking down the active VM and its local instance of Premium File Share, the FCI will fail over to the second AZ, where a standby VM can step in and begin interacting with its own local instance of Premium File Share, which contains a mirror image of all the data that had been in storage in the AZ that has gone offline.
In theory, configuring an FCI using a Zone-Redundant Premium File Share can provide you with the 99.99% SQL Server availability that you’re seeking in a HA configuration. However, relying on a Zone-Redundant Premium File Share incurs a feature/function cost that may be unacceptable to some organizations: Certain storage features cannot be used in a configuration using a Zone-Redundant Premium File share, including disk bursting, read-only host caching, and Azure Disk Encryption. If your organization intends to use any of these features in its SQL Server deployment, you’ll want to choose another path than one involving a Zone-Redundant Premium File Share.
Azure Shared Disk
An alternative to Azure’s Zone-Redundant Premium File Share is Azure Shared Disks, which is an Azure managed disks feature that enables you to attach a managed disk to multiple VMs simultaneously. Used within an FCI, an Azure Shared Disk would appear to offer the same kind of shared storage resource that one might have used in an on-premises SAN. Yet Azure Shared Disk has many of the same limitations as Azure Premium File Share. Disk busting and read-only host caching is unavailable if maxShares is set to >1 (which it naturally would be in an FCI). Azure Site Recovery and Azure Disk Encryption are also unavailable. Azure Shared Disk is only available for ultra disks, premium SSDs, and standard SSDs, and there are additional limitations depending on which one you use.
Finally, though you can configure an FCI so that the VMs are located in separate AZs, you can only attach an Azure Shared Disk solution to VMs in a single region. This makes it possible, if the aforementioned limitations work for you, to configure a true HA solution in Azure using Azure Shared Disk, but not one that you can extend to support DR (which would involve sharing or replicating data to a separate region).
Storage Spaces Direct
What about an FCI that uses Azure’s Storage Spaces Direct option? Storage Spaces Direct takes a different approach to shared storage solution, one more like that found in a traditional SAN or NAS configuration. The nodes in an FCI can each access the storage in a Storage Spaces Direct configuration, so if one node goes dark, another can continue to access the data in storage. You can also take advantage of disk bursting, read-only host caching, and Azure Disk Encryption — eliminating the limitations inherent in an FCI that uses Zone-redundant Premium Share.
However — and you knew there was a “however” coming — the essential downside of an FCI that uses Storage Spaces Direct is that you cannot configure storage across multiple AZs. Without the ability to distribute storage across multiple AZs, you cannot configure a SQL Server solution that will ensure true HA. That single point of failure will always exist, even if there are multiple VMs within the AZ that could take over if the primary VM were to go dark for any reason. If the whole AZ were to go dark for some reason — and that does happen — then your entire SQL Server infrastructure would go offline. Given that the acceptable level of downtime in a true HA configuration is less than five minutes per month, and given the complexities of bringing an entire data center back online, you could be looking at far greater periods of downtime if you peg your HA goals on Storage Spaces Direct.
Another FCI configuration for HA involves the use of third-party SANless Clustering software. In a SANless Cluster, each VM in the FCI is configured with its own local storage system. Nothing is shared in the manner of Azure Premium File Share or Storage Spaces Direct. Instead, the SANless Clustering software uses synchronous, block-level replication to ensure that any data written to storage attached to the primary VM is also written to storage attached to one or more secondary VMs in separate AZs (or asynchronously to a VM in a remote region as a DR option). Synchronous data replication ensures that the data on the secondary FCI nodes in the same region always mirrors the data on the active/primary node. If the active VM goes dark — or if the entire AZ where the node resides goes dark — the FCI will fail over to a secondary cluster node where SQL Server can begin working with the up-to-date data in its local storage system and continue to provide production support for your organization and/or its customers.
If you’re using a SANless Clustering solution for DR as well as HA, failover to the DR infrastructure is manual and there’s a risk that the data in storage on the DR infrastructure might not always be perfectly in sync with the data on the active VM (due to lag time in asynchronous replication), but rarely will it be more than a few seconds out of sync. That difference is more acceptable in a DR solution: Better to be able to bring a DR infrastructure online quickly, even with a slight loss of data, than to have no data and no infrastructure to bring online if the data centers in an entire region are brought down.
And, yes, that does happen as well.
The downside to SANless Clustering? It involves using third-party software rather than a Microsoft-native solution, so you’ll have to purchase a license for each VM in your FCI to take advantage of SANless Clustering. Though the SANless Clustering software will work with any edition of SQL Server from 2008 forward and though it integrates very cleanly into Windows Server FCI management software, it’s still another piece of software that you’ll need to learn how to configure and manage.
Non-FCI Approaches to HA
There are two approaches to HA on the list of options at the top that do not involve creating an FCI. In fact, one is a HA solution that is built into SQL Server itself: SQL Server Availability Groups (AGs). An AG solution operates somewhat like an FCI with SANless Clustering: From within the AG management console, you’ll identify two (or more) VMs, each with their own local storage and their own local copy of SQL Server (which in an AG is always active). The AG software then replicates any writes to the primary SQL Server to the SQL Server database in storage on the secondary VMs — which can be located in separate AZs to eliminate single points of failure and ensure HA.
Like a SANless Cluster, the AG software can replicate data from the primary to the secondary SQL Server databases in a synchronous manner, so even in high performance production settings, the AG software can replicate data very quickly. Should the primary VM running SQL Server go down, one of the live secondaries can begin acting as the new primary instance of SQL Server. The SQL Server database is identical to the one that had been active on the primary VM, so that secondary instance can easily step in and support the organization’s SQL Server needs.
An AG can also be used to replicate SQL Server databases asynchronously to infrastructure in a remote region, which makes it viable as a DR solution as well.
But as with all the solutions we’ve examined so far, there are some gotchas. As a feature of SQL Server, AG only replicates data associated with SQL Server — so no other data residing in storage on the primary VM is replicated to the secondary VMs. Moreover, AG only replicates user-named SQL Server databases. That’s an important consideration because system-created databases, such as those monitoring users, jobs, passwords and the like, are not replicated to the secondary VMs. You’ll need to find another replication solution to ensure that your secondary VMs have copies of those databases.
The other gotcha with an AG is that the true power of an AG depends on the edition of SQL Server you’re using. SQL Server Standard Edition can support replication of only one SQL Server database to one secondary VM. If you need to replicate more than a single database or you want to replicate one or more databases to more than a single secondary VM, you’ll need to use SQL Server Enterprise Edition, even if you have no other reason to use SQL Server Enterprise Edition. The gotcha there is price: SQL Server Enterprise Edition is far more expensive than SQL Server Standard edition, and you’ll need a license for every vCPU core — not just of every VM — in your configuration. The fundamental HA features are all present if you take the AG path, but it may cost a lot of money to achieve the assurance you seek.
And what about Azure Site Recovery? Azure Site Recovery sounds like an ideal solution: It replicates your VMs and data from one AZ to another in a separate region. If your primary VM goes dark, you can turn to the VM in the secondary region and use it to meet your SQL Server needs. The problem here, though, is response time. Azure Site Recovery is actually a DR solution, not a HA solution. The SLA associated with the service promises a recovery time objective (RTO) — that is, the amount of time it takes to bring your remote infrastructure online after an unplanned outage — of two hours. That may be perfectly acceptable in the face of a catastrophe that decimates the data centers in an entire region, but that RTO is inconsistent with the goals of a HA configuration. In a HA configuration designed for true 99.99% application availability, the SLAs guarantee less than five minutes of unplanned downtime per month and an RTO measured in seconds, not hours.
The Bottom Line
When it comes to configuring SQL Server for HA on Azure, you have options, but each option is qualified in one way or another. Using an FCI with Storage Spaces Direct won’t provide true HA because Storage Spaces Direct must be housed in a single AZ. An FCI with Zone-Redundant Shared Storage or Azure Shared Disks might work if you don’t need the performance and security features that these options don’t support. An FCI configured as a SANless Cluster would provide the HA you seek (and DR support as well), but you must be willing to license and manage a non-Microsoft solution. Similarly, an AG approach to HA would ensure the HA of your user-named SQL Server databases, but only your user-named SQL Server databases. And, if you need to replicate more than one database (or to replicate one or more databases to more than one secondary VM (to a remote DR infrastructure for example), you’ll need to use the more expensive SQL Server Enterprise Edition — which may be a costly pill to swallow if you have no other reason to use the Enterprise edition.
Feature image via Pixabay.