E/E Architectures on the Way to Level 4
by Wilfried Steiner, Director TTTech Labs
In an SAE system for automated- driving of Level 4, the driver would trust that the vehicle is driving safely and, consequently, may not pay attention to the driving situation. The driver could be texting, working, or even sleeping. Since critical failures may not occur in these systems, a suitable safety architecture is required.
The human trust of a machine must be met by the design of the L4 automated driving system as an ultra-high reliable system, i.e., this system may not fail more frequently than, say, 10^-8 failures per hour, which is about once every ten-thousand years. Such low failure rate ensures that there is a high probability that a driver will never experience a system failure of the automated driving system. Achieving this is a challenge, because components such as hardware, chips, and software – the building blocks of these systems – fail much more frequently. However, there is a rich body of engineering practice and scientific literature on how to construct ultra-highly reliable systems from less reliable components. For example, aerospace flight control systems are required to fail even less frequently.
Ultra-highly reliable systems can only be constructed by means of appropriate decomposition of the overall system into subsystems and fault-containment units with a sufficient level of redundancy. Today’s automotive E/E architectures (E/E – electric/electronic) often do not follow such a systematic top-down design approach for ultra-highly reliable systems but are rather driven by the decomposition into functional domains. Subsystems and fault-containment units are often retrofitted into a finished system decomposition in a bottom-up manner. This latter approach entails high risks that the reliability goals cannot be met, and “architectural band-aids” must be introduced, typically resulting in dangerous emerging behavior as a side effect. In this paper we will review design challenges and provide a conceptual architecture for ultra-high reliable L4 automated driving systems.
The purpose of a safety architecture
Safety, as well as security, is a property of the overall system that is binary: an autonomous car is acceptably safe, or it is not. When we develop the system, this safety property translates into top-level requirements that guide the system implementation. This is typically a stepwise process in which multiple layers of requirements and designs are derived and we can always verify and validate, step by step, that design and implementation satisfy the respective requirements. This requirements-driven system development process has been proven to result in safe systems in adjacent industries, like aerospace, but it is also a common approach in the automotive domain. If safety was considered only late in system development, it would be introduced retrospectively by adding safety band-aid over safety band-aid with the risk that major safety gaps remain, and adequate system safety may not be achieved.
Safety standards and risk awareness vary around the world. In our society, the demand for safety is high. Annual accident statistics give an indication of the current societally acceptable fatality rate. Whether an autonomous car is acceptably safe once it causes fewer fatalities than human drivers or only when it is several orders of magnitude better is a topic of ongoing debate. We think that the systems we design should be at least a hundred times better.
It is a key requirement that an autonomous car continue to operate safely even when some part of it fails. This system property is called “fail-operational”. Designing such a fail-operational system is immensely difficult because there are many different parts that may fail in many different ways. Thus, the number of things that may go wrong is enormous. The only chance to manage this enormous failure space is to adhere to a safety architecture. This safety architecture defines so called fault-containment units (FCUs), i.e., parts of the system that may fail as a whole, and also defines the interactions between these FCUs. On this level of abstraction, the right safety architecture can already ensure that the failure of any FCU will not cause a complete system failure and that system will remain operational.
Fail-operational systems are common in the aerospace domain, allowing us to learn about possible adequate safety architectures. However, the functional complexity of an autonomous car is also unprecedented in the aerospace domain. Even though real autopilot systems do exist in aerospace, self-driving cars have a much more complex task to solve, simply because of the many more dynamic objects one finds on a street, compared to the sky. Moreover, autopilot systems that are used in aerospace must be monitored by trained pilots during critical flight phases, like landing, which means that these systems typically do not exceed the comparable level 2 in automated driving.
During cruise, autopilot systems might be compared to Level 3 systems: the pilot does not need to monitor the autopilot but must be present in the cockpit and must be ready to take over control within a couple of seconds. Moreover, autopilot systems that are used in aerospace must be monitored by trained pilots, which means that these systems typically do not exceed the comparable Level 2 in of automated driving. On the positive side, since the self-driving car is operating on the ground, once a failure is detected it can quickly enter a safe space like an emergency lane.
A Safety Architecture Proposal for Self-Driving Cars
Kopetz has proposed a safety architecture for self-driving cars in  depicted in Figure 3. Automated driving systems that follow this architecture can safely replace the human driver. Quite literally, these systems generate output that otherwise would be produced by the human driver: essentially the setpoints for the car’s speed and direction. Even better, these systems will continue to operate even in presence of failures.
The architecture distinguishes four subsystems:
- Computer Controlled Driving Subsystem (CCDSS)
- Monitoring Subsystem (MSS)
- Critical Event Handling Subsystem (CEHSS)
- Fault-Tolerant Decision Subsystem (FTDSS)
Both the CCDSS and the CEHSS periodically produce output used to determine the behavior of the self-driving car. The MSS monitors the output of both the CCDSS and the CEHSS. The FTDSS receives the CCDSS and CEHSS output as well as that of the MSS monitoring. In absence of failures, the CCDSS will provide its output to the FTDSS, which will have that output checked by the MSS. In a failure-free scenario, the MSS will approve the CCDSS output and inform the FTDSS. The FTDSS will then forward this MSS-approved output to the receivers (e.g., the actuators). A simple protocol in the receivers selects one output per cycle.
The CCDSS, the MSS, and the CEHSS form one fault-containment unit (FCU) each. This means that if some portion of a subsystem fails, then we consider the complete subsystem faulty. For example, if the CCDSS was realized as a stand-alone electronic control unit (ECU) and let this ECU implement a special purpose chip for object recognition, then in the event this chip failing, the complete ECU is considered faulty.
A safety architecture must also define the FCUs’ failure behavior. In our case, CCDSS, MSS, and CEHSS can fail arbitrarily: such a faulty subsystem my send any sequence of messages on their interfaces to the FTDSS. The FTDSS itself is composed of two FCUs with limited failure behavior, which can be established by common fail-safe technologies like lock-step mechanisms.
Even in this rather simple architecture, there are many different failure scenarios. In some scenarios, for example, the CCDSS will fail to produce a safe output and the MSS will detect this failure. In these scenarios the FTDSS will forward the output of the CEHSS rather than the output of the CCDSS. As there are many different failure scenarios, manual inspection of them all is cumbersome and error-prone: some scenarios may easily be overlooked or interpreted incorrectly. Thus, we have used model-checking to exhaustively explore all possible failure scenarios. We invented this approach almost twenty years ago in the context of network protocol verification in Steiner et al. .
Using exhaustive fault-simulation we showed that a system implementing the Kopetz architecture is guaranteed to satisfy the following properties if only one FCU fails:
- consistency: all non-faulty receivers will use the same values and these values will be safe;
- termination: all non-faulty receivers will use some values at the end of each cycle; and
- non-triviality: all non-faulty receivers will only use output provided by the FTDSS FCUs as values.
It worth noting that the proposed architecture enables implementation of a simple protocol in the receivers to pick one of potentially many outputs from the different FTDSS FCUs per cycle. It is simple in a sense that the receivers do not need to exchange information among themselves to achieve the consistency property as defined above. Typically, two rounds of communication are required, but the structure of the architecture simulates these two rounds by a first communication between the CCDSS, MSS, and CEHSS to the FTDSS FCUs and by a second communication from the FTDSS FCUs to the receivers. In that regard, the architecture shares similarities with the NASAS ROBUS architecture .
The Way Forward: Standardized Safety Architectures
Safety standards can have two roles: giving guidance on how to build a safe system and enabling comparison of different solutions. A standardized safety architecture can define a minimum quality level to avoid safety shortcuts in particular. Of course, that will only be the case when the standard is indeed accepted and adopted by the automotive industry. As a response to these safety system requirements, the Kopetz architecture has been proposed in the context of The Autonomous (https://www.the-autonomous.com/resources). The Autonomous is the open platform shaping the future of safe autonomous mobility by building an ecosystem of players involved in safe autonomous mobility development to align on relevant safety subjects.
 Kopetz, Hermann, “An Architecture for Driving Automation”, 2020,
 Steiner, Wilfried, John Rushby, Maria Sorea, and Holger Pfeifer. “Model checking a fault-tolerant startup algorithm: From design exploration to exhaustive fault simulation.” In: International Conference on Dependable Systems and Networks, 2004, pp. 189-198. IEEE, 2004.
 Miner, Paul S., Mahyar Malekpour, and Wilfredo Torres. “A conceptual design for a reliable optical bus (ROBUS).” In Proceedings. The 21st Digital Avionics Systems Conference, vol. 2, pp. 13D3-13D3. IEEE, 2002.