Contents

HA Cluster Notes and Application Design

When designing systems with higher traffic, you will eventually encounter cluster-related issues.

Cluster

A collection of one or more machines (nodes) with three different purposes:

Load Balancing

Allows multiple machines to share tasks as evenly as possible, accelerating application execution.

High Availability (HA)

For high availability and redundancy, if one machine suddenly fails, others can take over.

High Performance Computing

High-performance/parallel computing systems, abbreviated as HPC clusters, combine the hardware of multiple machines to increase computing power, used to solve tasks that a single machine cannot handle.

HA Operating Modes

There are many types, such as N+1, N+M, … But the most common is a two-node cluster. A two-node cluster has two operating modes:

  • Active-Passive
  • Active-Active

Active-Passive (AP)

A master-slave design. Under normal circumstances, only the master (Active) provides the service. When the master (Active) encounters a problem, the slave (Passive) takes over. Once the master (Active) recovers, it switches back, and the master (Active) continues to handle the service.

Advantages:

  • Fast fail-over speed.
  • Relatively simple design and configuration.

Disadvantages:

  • Cannot perform load balancing simultaneously, wasting some hardware resources.

Active-Active (AA)

Both machines simultaneously run their own independent services (both are Active), and also provide mutual redundancy (acting as the other’s Passive). When one machine encounters a problem, the other takes over its service.

Advantages:

  • Neither machine is idle during normal operation, resulting in high operational efficiency.

Disadvantages:

  • The machine’s load increases after fail-over, leading to slower performance.
  • Relatively complex design and configuration.

Application Design

  • There needs to be a relatively simple way to start, stop, force-stop services, and check the current status of services.
    => When designing the application, there should be a command-line interface or script to achieve this.
    => Services on both machines should be able to know each other’s status and be able to start or stop in case of an accident.
  • Shared storage is required, and the application should record its state as meticulously as possible to shared storage.
    => This ensures nothing is lost when switching between the two machines.
  • It should be possible to restart another node and restore it to the state before the failure occurred.
    => Restoring to the pre-failure state can be done using the state saved to shared storage.
  • When the application crashes, the data stored on shared storage must not be corrupted.
    => The other side needs to use it.

Remark

  • Consider scenarios that occur during application upgrades.
  • Some SQL or NoSQL databases inherently support these types of configurations, which can be adopted to reduce a lot of trouble.