HW4: Chapters 11 & 12

11.4. What is the common characteristic of all architectural styles that are geared to supporting software fault tolerance?

Diversity. This is due to the logic that a redundant/diverse system will be able to survive one portion failing, by having a fallback which can take the same input without failing in the same way. These are often designed with multiple layers or types of structure behind the diversity, such as n-version, three-channel, etc. Diversity is the best way (after writing good code) to improve a software’s ability to function reliably despite faults.

11.7. It has been suggested that the control software for a radiation therapy machine, used to treat patients with cancer, should be implemented using N-version programming. Comment on whether or not you think this is a good suggestion.

N-versioning essentially means that there is some number N such that that number of completely independent versions of the target specification are created, then run concurrently, with an overseer which determines which one is most correct on output. The upsides of such a design is that any individual failure on any individual version can be ignored or outvoted by the other versions. Also, since they are generally diverse, it is unlikely that they all fail in the exact same manner, reducing the likelihood of having a complete failure.

However, the downsides are that the resources for development need to be split among parallel development teams; 3 bad systems doing the same thing as 1 good system could mean that the correct result gets outvoted, rather than the other way around. Similarly, 3 good systems simply means that 1/3 the total progress on the project has been achieved, and the hardware costs to support the triply developed software means that costs stay higher in the long run.

This type of system is essential for softwares which can harm significant numbers of people on failure; airplanes and nuclear power plants being a perfect example. However, a radiation therapy machine may or may not be at the same scale. One could argue that, if a large number of hospitals buy the machine, a large number of patients utilize the machine before a fault is discovered, and the fault leads to death, then it would globally cause more harm than a single airplane failure, which generally only has to crash once or twice before the product line is pulled and fixed. From this perspective, the radiation therapy machine absolutely requires n-version programming – moreso than ‘more visible’ systems like airplanes.

11.9 Explain why you should explicitly handle all exceptions in a system that is intended to have a high level of availability.

Exceptions are the most easily quantified and fixed of bugs; they have custom error messages in most languages, and are easily detected and fixed for that very reason. Because of this, there is basically no excuse for not handling exceptions properly. They can make recovering from certain faults (especially data type faults) extremely easy, and explicitly defining the course of action to the software for each type of fault in each possible context is a very solid manner of making your software resistant to common issues.

12.5. A train protection system automatically applies the brakes of a train if the speed limit for a segment of track is exceeded, or if the train enters a track segment that is currently signaled with a red light (i.e., the segment should not be entered). There are two critical-safety requirements for this train protection system:

1: The train shall not enter a segment of track that is signaled with a red light.

2: The train shall not exceed the specified speed limit for a section of track.

Assuming that the signal status and the speed limit for the track segment are transmitted to on-board software on the train before it enters the track segment, propose five possible functional system requirements for the onboard software that may be generated from the system safety requirements.

The system must have redundant sensors and diverse signal interpretation, so that an error cannot prevent the train from receiving the speed/signal status
The system must have some detection (sensors + analysis) for track conditions, in case of ice/snow/rain which would lower the safe speed of train operation below the local speed limit
The system needs safe, reliable default values in case of being unable to receive the speed limit of the next track segment.
The system needs to stop the train (assume red light) in case of being unable to receive the ‘red light’ status of the next track segment.
The system needs access to emergency brakes and sensors in place to engage them in order to prevent a head-on collision in case of detecting something on the tracks
The system needs to handle speed changes such that when raising the speed limit, the train does not accelerate before crossing into the new segment; but when lowering the speed limit, the train has decelerated to the new limit before crossing into the new segment.
The system needs tracking of the status of hardware (especially battery, fuel, engine functions, brake functions, sensors, etc.); it must alert management when components are worn down or when it has reached a recommended maintenance stage, and give severe warnings (including refusing to embark) in case of essential components not responding to the system/ responding as broken.
The system needs a method of communicating with the outside world in order to give live updating of any issues that may occur;
The system must be able to receive an ’emergency stop’ command from outside (including from other trains)
If the train stops for any reason, it must broadcast an ’emergency stop’ command to the nearby track to prevent other trains from potentially crashing into it the system should have verbose logging in case of audit / error