Skip to main content

HA Cluster Notes and Application Design

·430 words·3 mins
Kuo-Hsiu (Kourtney) Lee
Author
Kuo-Hsiu (Kourtney) Lee
Software developer. Writing about systems, architecture, software design, and technologies.

When designing systems with higher traffic, you will eventually encounter cluster-related issues.

Cluster
#

A collection of one or more machines (nodes) with three different purposes:

Load Balancing
#

Allows multiple machines to share tasks as evenly as possible, accelerating application execution.

High Availability (HA)
#

For high availability and redundancy, if one machine suddenly fails, others can take over.

High Performance Computing
#

High-performance/parallel computing systems, abbreviated as HPC clusters, combine the hardware of multiple machines to increase computing power, used to solve tasks that a single machine cannot handle.

HA Operating Modes
#

There are many types, such as N+1, N+M, … But the most common is a two-node cluster. A two-node cluster has two operating modes:

  • Active-Passive
  • Active-Active

Active-Passive (AP)
#

A master-slave design. Under normal circumstances, only the master (Active) provides the service. When the master (Active) encounters a problem, the slave (Passive) takes over. Once the master (Active) recovers, it switches back, and the master (Active) continues to handle the service.

Advantages:

  • Fast fail-over speed.
  • Relatively simple design and configuration.

Disadvantages:

  • Cannot perform load balancing simultaneously, wasting some hardware resources.

Active-Active (AA)
#

Both machines simultaneously run their own independent services (both are Active), and also provide mutual redundancy (acting as the other’s Passive). When one machine encounters a problem, the other takes over its service.

Advantages:

  • Neither machine is idle during normal operation, resulting in high operational efficiency.

Disadvantages:

  • The machine’s load increases after fail-over, leading to slower performance.
  • Relatively complex design and configuration.

Application Design
#

  • There needs to be a relatively simple way to start, stop, force-stop services, and check the current status of services.
    => When designing the application, there should be a command-line interface or script to achieve this.
    => Services on both machines should be able to know each other’s status and be able to start or stop in case of an accident.
  • Shared storage is required, and the application should record its state as meticulously as possible to shared storage.
    => This ensures nothing is lost when switching between the two machines.
  • It should be possible to restart another node and restore it to the state before the failure occurred.
    => Restoring to the pre-failure state can be done using the state saved to shared storage.
  • When the application crashes, the data stored on shared storage must not be corrupted.
    => The other side needs to use it.

Remark
#

  • Consider scenarios that occur during application upgrades.
  • Some SQL or NoSQL databases inherently support these types of configurations, which can be adopted to reduce a lot of trouble.