Test more and better: Boeing Starliner crash analysis has ended
Boeing Starliner After-Flight Service, NASA/Bill Ingalls Photo
Twice Almost Lost
In NASA terminology, the mission was recognized as "almost lost with wide media coverage" ("high visibility close call"). The next term is an accident with the loss of a ship and, possibly, loss of life. The status is quite rare, and the last time the situation was so marked out when in 2013 astronaut Luca Parmitano almost drowned in an outer space right in the spacesuit due to a clogged filter of the water cooling system.
The first bug showed itself 31 minutes after the start. The spacecraft did not fulfill the expected maneuver to transition to the flight path to the ISS from the initial orbit. The MCC tried to correct the situation, but, as evil, these attempts were superimposed by communication problems, and as a result, Starliner ended up in orbit unsuitable for approaching the ISS and empty fuel tanks. Due to an error in the code, the ship synchronized with the launch vehicle the flight timer, not at the time the countdown began, but 11 hours before launch. As a result, the on-board computer believed that the ship was at a different stage of flight than it was in reality.
Separation of Staliner ship compartments, frame from Boeing video
The second bug did not have time to prove itself. Because of the first problem, NASA and Boeing experts began to analyze the code for the subject “but have we missed anything else?” And, as it turned out, not in vain. During the landing process, after the braking maneuver, the ship had to be divided into a descent vehicle and a service module (shown in the illustration above, almost all spacecraft go through a similar procedure, for example, Soyuz is divided into three compartments, and Crew Draron resets the service module before braking). After separation, the service module had to perform a maneuver to move away from the ship, but due to an error in the code, the procedure was incorrectly transferred to the controller controlling the process. As a result, the service module could hit the lander and cause trouble there.
The third problem was not so critical, but drank a lot of blood from ground personnel. Throughout the mission, the ship had problems with communication with the ground, which made it difficult to manage it from the MCC, and in the case of a manned flight, it would lead to difficulties in negotiations with astronauts.
Two critical problems, each of which would lead to the loss of the ship, if not for the intervention of the MCC, appeared at the design and development stage and managed to leak through numerous checks at the testing stage. Both problems could be detected during testing, and Boeing processes could and should have found and fixed them.
What to do?
The full report contains proprietary and trade secret information, therefore NASA published only a general overview, which is still very interesting.
21 recommendations are directly related to testing. First of all, it is necessary to improve integration testing at both the hardware and software levels. On my own behalf, I note that errors not caught at the stage of integration testing still occupy a large share in the causes of space accidents. Further, before each flight, it is necessary to conduct a “dress rehearsal” with the maximum involvement of flying equipment, analyze its behavior and limitations, and take measures to detect gaps in the simulations.
10 recommendations related to the requirements, but in fact they also relate to testing. Requirements with multiple conditions should be better analyzed and decision coverage should be increased - test coverage of conditions in program code. Let me remind you that 100% decision coverage means 100% statement coverage, but not vice versa.
35 recommendations should improve processes. And according to what exactly they propose to improve, it is possible to reconstruct the discovered problems. Strengthening the code review and test data should fix the problem that errors in the code were not noticed either during the writing of the code (on the review code) or during the testing process (the test data was obviously insufficient). Greater involvement of experts in safety-critical areas should eliminate gaps in insufficient competency. And the proposal of changes in the documentation of the decision-making commissions should correct the situation when the flaws in the development and testing were not noticed or received too low a priority for elimination.
7 recommendations are corrections in the code that will eliminate bugs taking into account the flight time and the procedure for separating the service module, as well as make the antenna selection algorithm more reliable.
And the last 7 recommendations include organizational structure and hardware. The changes await the organizational structure of safety messages (obviously, for a better message passing “we have an important problem to solve here”), the external audit should be improved, and an additional filter will be added to the ship’s design to protect it from out-of-band interference.
Despite the fact that there is nothing joyful in the history of emergency flight, it will serve to improve the processes of creating space technology and flight safety. Of course, it’s a shame to miss production bugs that could and should have been found during testing long before the flights. Now the first test flights are more likely to confirm the correctness of the decisions made, rather than detect undetected problems. The test flight was very instructive, but also very expensive. Boeing is now obligated to conduct at its own expense another test launch to confirm the safety and suitability of the ship for flight. Its exact date is still unknown, so far in the plans for November 2020.