Code Verification

Friday, 14 June 2013 12:41
Railway Signalling Railway Signalling

Thales Austria asked Reivertech to perform a software verification of a new Java system used to control real-time rail signalling systems due to go into the field in early 2013. Essentially, the task required an in-depth code review of the Thales software by an independent consultant with knowledge of both rail systems and the Java language.

However, this was no code review in the classic sense. Reivertech was not requested to check the code's suitability of purpose, because that was the job of another external consultant (the so-called validator). What Reivertech had to do was ensure that the code obeyed the constraints of real-time rail signalling, essentially:

  • Disallowing use of the heap
  • Suppression of garbage collection
  • Code Diversity through dual-channel programming

Disallowing use of the heap sounds insane in a Java context, right? Well, as it happens, it isn't too hard once you understand that what is meant by that is: allocate and initialize your data structures and, once complete, tell other components you are ready for operation. From that point, allocate no more heap memory. This is not as hard as it sounds, but it does require being super-disciplined regarding things like iterators (i.e. you can't create any) and using ring buffers for things like event handling (all entries are pre-allocated and then essentially reused).

Performing the Verification

The easy part was using code metric tools like Findbugs and PMD to check for the obvious violations. A custom suite of Findbugs rules had been implemented to check for the most insidious violations, so that meant that the work consisted of determining the violations in any given release candidate, cataloguing them and writing checklists of how to deal with them.

More difficult was the code review part, which meant reviewing all code modules to be used in the operational phase of the product and determining whether the code was using the heap or violating other real-time rules. Using reflection was, for example, a big no-no. It was also possible to automatically check whether certain modules invoked the heap by writing a custom JUnit runner which invoked implementation code and asserted that the heap had not been touched when the tests completed.

Much of the review was performed in pairs, with both a Reivertech consultant and a Thales developer.

Suppressing Garbage Collection

Why suppress garbage collection? The simple reason is that the JVM used required preemptive garbage collection, i.e. the garbage collector would simply interrupt other operations when it needed to run. For various reasons, it was not possible to use a JVM whose garbage collector did not preempt other threads. As the GC might conceivably need, say, 500ms to complete, this was unacceptable for a real-time transport system where half a second between a signal being told to go to red and actually being able to do so might be the difference between a controlled stop of a train and a collision with the potential for fatalities.

The theory was (and is) that if no heapspace is ever allocated, then memory usage will be static in the system, and the garbage collector would never run, at least not once the system had entered its main operating phase, which might last months or even years between restarts.

Code Diversity

It is a fact of life that hardware systems fail from time to time. Indeed, there is a statistical probability that any system will fail eventually, through obsolescence or other factors. There is even the possibility of an electromagnetic interference causing a single bit to flip somewhere in RAM which can cause a computation to produce erroneous results. To combat this, Thales uses the principle of code diversity, which means not only that there are several software components working in parallel (and which must produce the same results in order to ensure that such an anomaly has not occurred) but that these components be implemented with diverse code modules and algorithms.

Results

After about 4 weeks, Reivertech produced a set of interim results, which were acted upon. Of course, the very act of implementing our recommendations meant that, technically speaking, a re-verification was necessary. Fortunately, the delta of code to reinspect was small and, indeed, no re-verification was necessary after 2-3 such iterations. The  software was then passed on to the external validator for further assessment. Our job was done!