HW4: Chapters 11 & 12

29 Aug 2019

11.4. What is the common characteristic of all architectural styles that are geared to supporting software fault tolerance?

Diversity. This is due to the logic that a redundant/diverse system will be able to survive one portion failing, by having a fallback which can take the same input without failing in the same way. These are often designed with multiple layers or types of structure behind the diversity, such as n-version, three-channel, etc. Diversity is the best way (after writing good code) to improve a software’s ability to function reliably despite faults.

11.7. It has been suggested that the control software for a radiation therapy machine, used to treat patients with cancer, should be implemented using N-version programming. Comment on whether or not you think this is a good suggestion.

N-versioning essentially means that there is some number N such that that number of completely independent versions of the target specification are created, then run concurrently, with an overseer which determines which one is most correct on output. The upsides of such a design is that any individual failure on any individual version can be ignored or outvoted by the other versions. Also, since they are generally diverse, it is unlikely that they all fail in the exact same manner, reducing the likelihood of having a complete failure.

However, the downsides are that the resources for development need to be split among parallel development teams; 3 bad systems doing the same thing as 1 good system could mean that the correct result gets outvoted, rather than the other way around. Similarly, 3 good systems simply means that 1/3 the total progress on the project has been achieved, and the hardware costs to support the triply developed software means that costs stay higher in the long run.

This type of system is essential for softwares which can harm significant numbers of people on failure; airplanes and nuclear power plants being a perfect example. However, a radiation therapy machine may or may not be at the same scale. One could argue that, if a large number of hospitals buy the machine, a large number of patients utilize the machine before a fault is discovered, and the fault leads to death, then it would globally cause more harm than a single airplane failure, which generally only has to crash once or twice before the product line is pulled and fixed. From this perspective, the radiation therapy machine absolutely requires n-version programming – moreso than ‘more visible’ systems like airplanes.

11.9 Explain why you should explicitly handle all exceptions in a system that is intended to have a high level of availability.

Exceptions are the most easily quantified and fixed of bugs; they have custom error messages in most languages, and are easily detected and fixed for that very reason. Because of this, there is basically no excuse for not handling exceptions properly. They can make recovering from certain faults (especially data type faults) extremely easy, and explicitly defining the course of action to the software for each type of fault in each possible context is a very solid manner of making your software resistant to common issues.

12.5. A train protection system automatically applies the brakes of a train if the speed limit for a segment of track is exceeded, or if the train enters a track segment that is currently signaled with a red light (i.e., the segment should not be entered). There are two critical-safety requirements for this train protection system:

1: The train shall not enter a segment of track that is signaled with a red light.

2: The train shall not exceed the specified speed limit for a section of track.

Assuming that the signal status and the speed limit for the track segment are transmitted to on-board software on the train before it enters the track segment, propose five possible functional system requirements for the onboard software that may be generated from the system safety requirements.