Failover is the process of automatically switching to a standby or redundant computer server, system, hardware component, or network if the active one fails or experiences an abnormal termination. This process is crucial in maintaining continuous availability and high reliability in computer networks.
Systems designers often incorporate failover capabilities into servers and networks, ensuring that operations can continue uninterrupted in case of failure. At the server level, failover automation typically employs a "heartbeat" system. This system connects two servers either through a separate cable, such as RS-232 serial ports, or a network connection. As long as a regular "pulse" or "heartbeat" is maintained between the primary server and a secondary server, the secondary server remains on standby. If there is any interruption in the heartbeat, indicating a failure in the primary server, the secondary server automatically takes over, bringing its systems online. Some sophisticated failover systems utilize all servers concurrently and can redistribute the workload among the remaining servers if one fails. Additionally, there might be a third server with spare components for immediate replacement to prevent any downtime.
Failover systems can sometimes be configured to require human intervention. This setup, known as "automated with manual approval," runs automatically but only after receiving human approval. Once failover is activated, the system resumes normal operations without noticeable interruptions to users.
Failback is the reverse process, where the system, component, or service that was in a failed state is restored to its original, working condition. During this process, the standby system that took over during the failover returns to its standby status.
Virtualization technology has made failover practices less dependent on physical hardware. Through migration, a running virtual machine can move from one physical host to another with minimal or no service disruption.
Failover and failback technologies are extensively used in Microsoft SQL Server databases. Here, the SQL Server Failover Cluster Instance (FCI) is configured on top of the Windows Server Failover Cluster (WSFC). In this setup, SQL Server groups and resources can manually or automatically switch to a second node in case the first node faces issues. Once the issues are resolved or maintenance is completed, a failback operation is performed to revert operations to the original node.