Recently, we have seen evidence in our shop of a disturbing fault trend that merits wider discussion in the avionics community. My coworkers and I happen to be involved in a great deal of test equipment overhaul. We also supply parts for that purpose. As a result, we see a cross-section of equipment problems from all over the world, which gives us a chance to extract useful trends and patterns.
One pattern is the steadily increasing failures of programmable, read-only memories (PROMs) and erasable PROMs (EPROMs) in equipment typically made since the 1980s. We are used to seeing failures in earlier programmable devices, especially 25xx-series EPROMS and early PROMs that were not technologically mature or stable. The new pattern of failures involves mature and standardized parts like 27xx series EPROMS and corresponding PROMS, programmed by top-of-the-line personnel and used in equipment by major manufacturers.
Data loss within the EPROM can cause degraded performance and catastrophic equipment failure. It may be progressive, too, leading to increasingly broad failure over time. Significantly, these parts had been operated under easy temperature and stress conditions. They had been handled expertly at all stages of assembly and use, and had been optically protected, if ultraviolet (UV) programmable. Still, they failed–and within a less-than-20-year window. What does that mean for parts operating in a much harsher environment?
I have long been skeptical of more recent electrically erasable programmable read-only memory (EEPROM) technology as a stable, long-term data storage device in avionics systems. I’ve seen far too many cycle-, static- and temperature-induced failures to be confident in this widely used technology. However, these failing EPROMs and PROMs –from mainstream makers of mature parts–supposedly are much more reliable and stable than EEPROMS. Yet none of the three technologies is proving to be reliable over time. It is significant that in these systems, neither the central processing units (CPUs), nor the logic, passive or discrete parts are failing. The programmable parts are what fails.
The implications for avionics manufacturers are significant. Many airborne systems use firmware programming in EPROMS to hold both data and executable instructions. In addition, more and more systems use field programmable gate array (FPGA) and programmable array logic (PAL) parts to replace logic configurations. Most of this technology is based on EPROM cell designs and has the same potential for data loss, which will be expressed as a change in logic function.
We all have considered part failures in our analysis of systems. But it is much harder, if not impossible, to factor in unspecific or random data loss and its implications for system operation. Only a few weak defenses against this problem exist, other than periodic replacement. But they are well worth incorporating into any design, even though they are not perfect.
EPROM or PROM data can be expressed in a checksum (a total of all contents), and this value can be stored in a specific location. At program execution, the two values can be compared, and if the values do not agree, the operation can be halted and the equipment flagged. Unfortunately, this technique fails if the control portion, itself, is flawed or if other failures prevent this check, but it can be a useful tool to catch degraded data or program areas. It cannot restore system operation, but it can block operation of a known faulty system.
Redundant data storage and selection algorithms can improve the chances of good data. But, again, a single fault in a control instruction can prevent this from working, so it may not prove useful. When line replaceable units (LRUs) come in for periodic overhaul, a checksum test of each EPROM by a good programmer would be worthwhile, especially if the unit has no internal checksum test.
The thought to take away from this column is a simple one: silicon digital storage has content and quality problems that inevitably will emerge. Under avionics conditions of elevated or reduced temperatures and of radiation (high altitude or in space), they will fail more often, and cannot be safely ignored in a system’s design and operation.
There are also transient problems with silicon data storage, including:
Specific pattern sensitivity,
Upset by static discharge or power supply fluctuations,
Momentary or permanent faults induced by ionizing radiation, and
Total or partial loss of program storage caused by imperfect prior erasure or programming.
As integrated circuit features become finer in pitch, these failure modes become much more pronounced, and this can become aggravating, as chip makers often shrink parts without advising customers or changing part numbers. The failed parts can’t be rebuilt to restore equipment operation because the source files no longer exist.
Walter Shawlee 2 may be reached by e-mail at [email protected].