Page 124 - DCAP103_Principle of operating system
P. 124

Unit 4: Process Management-III



            A bad system call should never be able to take down the kernel. An RTOS should, therefore,   Notes
            employ opaque handles for kernel objects. It should also validate the parameters to all system
            calls.

            4.4.3 Fault Tolerance and High Availability
            Even the best software has latent bugs. As applications become more complex, performing more
            functions for a software-hungry world, the number of bugs in fielded systems will continue to
            rise. System designers must, therefore, plan for failures and employ fault recovery techniques
            of course, the effect of fault recovery is application-dependent—a user interface can restart itself
            in the face of a fault, a flight-control system probably cannot. One way to do fault recovery is
            to have a supervisor thread in an address space all its own. When a thread faults (for example,
            due to a stack overflow), the kernel should provide some mechanism whereby notification can
            be sent to the supervisor thread. If necessary, the supervisor can then make a system call to
            close down the faulted thread, or the entire process, and restart it. The supervisor might also
            be hooked into a software “watchdog” setup, whereby thread deadlocks and starvation can be
            detected as well.
            In many critical systems, high availability is assured by employing multiple redundant nodes in
            the system. In such a system, the kernel running on a redundant node must have the ability to
            detect a failure in one of the operating nodes. One method is to provide a built-in heartbeat in the
            interprocessor message passing mechanism of the RTOS. Upon system startup, a communications
            channel is opened between the redundant nodes and each of the operating nodes. During normal
            operation, the redundant nodes continually receive heartbeat messages from the operating nodes.
            If the heartbeat fails to arrive, the redundant node can take control automatically.

                              Figure 4.3: Redundancy via System Heartbeats




                                                                       Active








                        Active













                                            Redundant









                                             LOVELY PROFESSIONAL UNIVERSITY                                   117
   119   120   121   122   123   124   125   126   127   128   129