BLOG

High Availability
A note of OCI Architect Associate & Professional

High Availability

To design a high availability architecture, three key elements should be considered— redundancy, monitoring, and failover:

  • Redundancy means that multiple components can perform the same task. The problem of a single point of failure is eliminated because redundant components can take over a task performed by a component that has failed.
  • Monitoring means checking whether or not a component is working properly.
  • Failover is the process by which a secondary component becomes primary when the primary component fails.

Measuring HA

Plan High Availability for Compute Instances

To plan for high availability of your compute instances, the key design strategies you should consider are:

  • Eliminating single points of failure by properly leveraging fault domain and availability domains.
  • Using monitoring, instance pools and load balancer.
  • Ensuring that your design protects both the data availability and integrity of your compute instances.

Distribute Instances Across Fault Domains

Distribute Instances Across Availability Domains

Ensure High Availability and Integrity of Your Data

  • Block Volume Summary
  • Volume Durability
  • Volume Replication

Plan High Availability for Network Resources

To plan for high availability of your network resources, the key design strategies you should consider are:

  • Determine the right size of your network's subnets.
  • Plan high availability configurations for these key components: Load Balancers, IPSec VPN Connections, and FastConnect Circuits.

Determine the Right Size of Subnets

Plan High Availability for Load Balancers

Understand FastConnect and VPN High Availability Design

  • Schedule regular maintenance by Oracle, your provider, or your own organization.
  • Avoid single points of failure, even if you are planning to use multiple interfaces for availability. High availability connections require redundant hardware, even when connecting from the same physical location.
  • Consider a dual provider approach to ensure network diversity when selecting FastConnect providers.
  • Provision sufficient network capacity to ensure that the failure of one network connection doesn’t overwhelm and degrade redundant connections.

Plan High Availability for IPSec VPN Connections

For instance, the following diagram shows two networks in separate geographical areas that each connect to Oracle Cloud Infrastructure. Each area has a single on-premises router, so two IPSec VPN connections can be created. Note that each IPSec VPN connection has two static routes: one for the CIDR of the particular geographical area, and a broad 0.0.0.0/0 static route.

In one scenario, the CPE 1 router in the preceding diagram goes down. If Subnet 1 and Subnet 2 can communicate with each other, the VCN is still able to access the systems in Subnet 1 because of the 0.0.0.0/0 static route that goes to CPE 2. The following diagram illustrates this scenario:

In another scenario, you add a new geographical area with Subnet 3 and connect it to Subnet 2. You would add a route rule to your VCN’s route table for Subnet 3 so that the VCN can reach systems in Subnet 3 without creating a new VPN connection because of the 0.0.0.0/0 static route that goes to CPE 2. The following diagram illustrates this scenario:

Plan High Availability for FastConnect Circuits

To avoid a single point of failure for FastConnect, consider the following redundancy options:

  • Multiple FastConnect locations within each metro area
  • Multiple routers in each FastConnect location
  • Multiple physical circuits in each FastConnect location

Oracle handles the redundancy of the routers and physical circuits in the FastConnect locations. In your network design with FastConnect, we recommend considering the following redundancy configurations for your high availability requirements:

  • Availability domain redundancy: Connect to any FastConnect location and access services located in any availability domain within a region. This configuration provides availability domain resiliency via multiple POPs per region. Peering connections terminate on routers in the POP.
  • Data center location redundancy: Connect at two different FastConnect locations per region.
  • Router redundancy: Connect to two different routers per FastConnect location.
  • Circuit redundancy: Have multiple physical connections at any of the FastConnect locations. Each of these circuits can have multiple physical links in an aggregated interface/LAG, which adds another level of redundancy.
  • Partner/provider redundancy: Connect to the FastConnect locations by using single or multiple partners.

For the Oracle provider scenario, we recommend that you set up redundant circuits with two different FastConnect locations by the same provider or different providers. With this configuration, you can have redundancy on both the circuits and the data center levels. The following diagram illustrates FastConnect connection with two virtual circuits and two different FastConnect locations:

Oracle’s FastConnect partners have redundant links to the Oracle network. As a customer of the partner, you should have redundant links to the partner’s network. These connections should be on different routers, both in your network and in the partner’s network. When you provision virtual circuits, provision them across your multiple provider links.

This diagram illustrates these redundant connections:

Some additional configuration strategies you should consider are to:

  • Avoid Impact During Planned Maintenance
  • Continuously Test Redundant Paths

Use Both IPSec VPN and FastConnect

When you set up both an IPSec VPN connection and FastConnect virtual circuits to the same DRG, remember that the IPSec VPN uses static routes but FastConnect uses BGP. Oracle Cloud Infrastructure advertises a route for each of your VCN’s subnets over the FastConnect virtual circuit BGP session, and overrides the default route selection behavior to prefer BGP routes over static routes if a static route overlaps with a route advertised by your on-premises network. The following diagram illustrates this configuration:

Plan High Availability for Storage

Understand Oracle Cloud Infrastructure Storage Services

  • Block Volume
  • Object Volume
  • File Storage

Understand Best Practices for the Storage Layer

To achieve high availability and durability for your architecture, you should follow these best practices when configuring your storage layer.

  • Use Object Storage to back up application data. Data is stored redundantly across multiple storage servers across multiple availability domains. Data integrity is actively monitored by using checksums, and corrupt data is detected and automatically repaired. Any loss in data redundancy is automatically detected and corrected, without any customer impact.
  • Use Block Volume policy-based backups to perform automatic, scheduled backups and retain them based on a backup policy. Consistently backing up your data allows you to adhere to your data compliance and regulatory requirements.
  • If you need an immediate, point-in-time, direct disk-to-disk copy of your block volume, use the Block Volume cloning feature. Volume cloning is different than snapshots because there is no copy-on-write or dependency to the source volume. No backup is involved. The clone operation is immediate, and the cloned volume becomes available for use right after the clone operation is initiated. You can attach and use the cloned volume as a regular volume as soon as its state changes to available.
  • If you need to safeguard data against accidental or malicious modifications by an untested or untrusted application, use a block volume with a read-only attachment. A read-only attachment marks a volume as read-only, so the data in the volume is not mutable. You can also use read-only attachments when you have multiple Compute instances that access the same volume for read-only purposes. For example, the instances might be running a web front end that serves static product catalog information to clients.
  • When your workload requires highly available shared storage with file semantics, and you need built-in encryption and snapshots for data protection, use File Storage. File Storage uses the industry-standard Network File System (NFS) file access protocol and can be accessed concurrently by thousands of Compute instances. File Storage can provide high-performance and resilient data protection for your applications. The File Storage service runs locally within one availability domain. Within an availability domain, File Storage uses synchronous replication and high availability failover to keep your data safe and available.
  • If your application needs high availability across multiple availability domains, use GlusterFS on top of the Block Volume service.
  • Plan and size your storage capacity by considering future growth needs.

Plan High Availability for Databases

To plan for high availability of your databases, the key design strategies you should consider are:

  • Use these key tools: Exadata Database systems, 2-Node RAC DB Systems, and Data Guard.
  • Configure your CPU and storage layer to scale automatically.

Use Exadata Database Systems

Use 2-Node RAC DB Systems

You can configure the Database service to automatically back up to Oracle Cloud Infrastructure Object Storage. The following diagram shows the deployment of a 2-node RAC DB System to support the high availability of a three-tier web application:

Use Data Guard

You can perform following actions with Data Guard configuration to support high availability:

  • Switchover: Reverses the primary and standby database roles. Each database continues to participate in the Data Guard association in its new role. A switchover ensures no data loss. You can use a switchover before you perform planned maintenance on the primary database.
  • Failover: Transitions the standby database into the primary role after the existing primary database fails or becomes unreachable. A failover might result in some data loss when you use Maximum Performance protection mode.
  • Reinstate: Reinstates a database into the standby role in a Data Guard association. You can use the reinstate command to return a failed database to service after correcting the cause of the failure.

Scale CPU and Storage Automatically