Essay about Failures in a Distributed System

A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, where all component work together to perform a single set of related tasks. A distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components. A distributed system must have the following characteristics: *…show more content…
Four types of failures that can occur in a distributed system are: * Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure. * Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded. * Network failures: A network link breaks. * Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized when a message is delayed longer than a threshold period, etc. Timing failures - there is suspicion of a possible failure related to the motherboard. This can be a result of a specific message strongly implicating the motherboard in some sort of erratic system behavior. It may also be the case that the motherboard probably isn't the problem, but that we want to rule it out as a possible cause. Since the motherboard is where all the other components meet and connect, a bad motherboard can affect virtually any other part of the PC. For this reason the motherboard must often be checked to ensure it is working properly, even if it is unlikely to be the cause of whatever is
