What is High Availability and does your project need it?

Insights from the specialists at hosting provider Aéza

High Availability is often misunderstood as "a system that never fails." In practice, that doesn't exist. Any infrastructure will eventually face failures: servers can go down, networks can fail, disks can break, or even entire data centers can become unavailable.
The real idea behind HA is different — the system must continue to work even when one of its components fails.
Therefore, High Availability is an architectural approach: a system is designed from the ground up to survive failures and recover automatically.

How High Availability is Implemented in Hosting

In standard hosting, high availability is not automatically included. The provider supplies the infrastructure — servers, network, and data centers. The client is typically the one who builds the fault-tolerant architecture.

For a service to keep running during failures, a project can utilize:

multiple application servers

load balancers (Nginx, HAProxy)

database replication

container orchestrators (e.g., k3s or Kubernetes)

In this model, the hosting provider gives you the infrastructure on which you can build a fault-tolerant system.

Simply put: the hosting provider supplies the servers, but High Availability is the architecture that the project builds itself.

High Availability (HA) is an architectural approach to system design aimed at ensuring the maximum possible availability of services and minimizing downtime during infrastructure component failures.

In the context of distributed systems, High Availability is achieved by eliminating single points of failure (SPOF) and implementing mechanisms for automatic service recovery during failures.

Key Principles of Building an HA Architecture

1. Redundancy

Critical system components are duplicated: application servers, databases, network devices, and storage. If one node fails, a backup automatically takes over its functions.

Example: Multiple instances of a backend service run behind a load balancer (e.g., Nginx, HAProxy, or a cloud Load Balancer). If one instance fails, traffic is automatically distributed to the remaining ones.

2. Failover Mechanisms

The system must be able to automatically switch to backup resources when a failure is detected.

Example: A PostgreSQL database cluster using Patroni or built-in replication. If the primary node fails, one of the replicas automatically becomes the new primary.

3. Health Checks and Monitoring

Infrastructure components regularly check each other's status. When a failure is detected, the system initiates recovery procedures or redistributes the load.

Example: Kubernetes performs liveness and readiness probes. If a container stops responding, the orchestrator automatically restarts the pod or creates a new one.

4. Horizontal Scaling

Instead of increasing the power of a single server, the system scales by adding new instances of services.

Example: An Auto Scaling Group in AWS automatically launches additional application instances during load spikes.

5. Data Replication

To ensure data availability, replication mechanisms are used to store copies of data on multiple nodes.

Example: PostgreSQL primary-replica replication or multi-node database clusters (e.g., Cassandra, CockroachDB).

6. Geo-redundancy

To enhance fault tolerance, the infrastructure can be distributed across multiple physical locations.

In the context of hosting, this means using servers located in different data centers or cities. If one site becomes unavailable (e.g., due to a network incident or hardware problems), the service can continue operating on servers in another location.

Example: An application is deployed on two servers in different data centers of a hosting provider. Traffic is distributed via DNS balancing or an external balancer. If one site becomes unavailable, users continue connecting to the remaining server.

Availability Metrics

The level of High Availability is typically expressed through SLAs (Service Level Agreements) and the uptime metric:

99.9% — ~8 hours of downtime per year

99.99% — ~52 minutes of downtime per year

99.999% — ~5 minutes of downtime per year

Each additional "nine" requires a significantly more complex architecture and increases infrastructure costs.

When High Availability is Truly Necessary

HA architecture is justified in systems where downtime directly impacts business processes:

financial services and payment systems

SaaS platforms with a large user base

high-load APIs

e-commerce during peak sales

24/7 critical systems

Example: If a payment service is unavailable for even a few minutes, it can lead to direct financial losses and SLA violations with clients.

When High Availability Might Be Overkill

Not every project requires a full-fledged HA architecture. It is often excessive for:

MVPs or early-stage startups

internal corporate tools

pet projects

low-traffic services

systems where planned downtime is acceptable

Example: A small internal service for automating team tasks can run perfectly well on a single server with regular backups. The cost of maintaining a full HA infrastructure would be significantly higher than the potential damage from a brief downtime.

Anti-Patterns in Building High Availability

1. "HA" without eliminating Single Points of Failure

Sometimes systems are called high-available, even though critical components still exist in a single instance.

Example: Multiple application servers run behind a single load balancer. If that load balancer itself fails, the entire service becomes unavailable.

2. Replication without a Thought-Out Failover Process

Having database replicas doesn't inherently guarantee high availability if the switchover process is manual or takes a long time.

Example: PostgreSQL has replicas, but when the primary fails, manual administrator intervention is required to promote one of them.

3. Synchronous Dependencies Between Services

Tight coupling between services can lead to cascading failures.

Example: If service A synchronously depends on service B, and B depends on C, the failure of C can degrade the entire chain.

4. Lack of Failure Testing

An architecture might look fault-tolerant on paper but fail in real-world scenarios.

Example: A failover mechanism is configured but never tested. During an actual primary node failure, the system doesn't switch over correctly.

5. Excessive Infrastructure Complexity

Sometimes teams implement complex HA solutions that significantly increase operational risks and maintenance costs.

Example: A small, low-traffic service is deployed in a multi-region architecture with several clusters, even though business requirements don't necessitate it.

Conclusion

High Availability is not a specific technology, but a collection of architectural decisions aimed at minimizing failures and system recovery time.

The decision to implement HA should be based on business requirements, the acceptable Recovery Time Objective (RTO), and the acceptable Recovery Point Objective (RPO).