← Back to blog

Why 10% of Firefox Crashes Come From Cosmic Rays

Mozilla traced 10% of unexplained Firefox crashes to single-bit memory errors caused by cosmic rays. Here's why this matters more than you think.

software-engineeringreliabilityhardwarefirefoxcosmic-rays

Why 10% of Firefox Crashes Come From Cosmic Rays

Mozilla published a fascinating analysis this week. After years of investigating unexplained Firefox crashes that don't correspond to any software bug, they've concluded that roughly 10% of these mystery crashes are caused by single-bit errors in RAM. Bitflips. Caused, in many cases, by cosmic rays.

Yes. Subatomic particles from outer space are crashing your browser.

This sounds absurd. It's also completely real and has significant implications for how we think about software reliability, especially as we build increasingly complex systems that run on consumer hardware.

How a Cosmic Ray Crashes Your Browser

Let me explain the physics briefly, because it's genuinely wild.

Cosmic rays are high-energy particles that originate from supernovae, black holes, and other violent cosmic events. They travel across the universe at nearly the speed of light. When they hit Earth's atmosphere, they create showers of secondary particles, including neutrons.

These neutrons occasionally strike silicon atoms in your computer's memory chips. When a neutron hits a silicon atom at the right angle with enough energy, it can knock electrons loose. If this happens in a memory cell that's storing a bit of data, it can flip that bit from a 0 to a 1 or vice versa.

One bit. One flip. And suddenly the number 42 becomes 170. A pointer that referenced a valid memory address now points to garbage. A boolean that was true is now false. A branch that should have gone left goes right.

The program doesn't know this happened. There's no error. No exception. The hardware didn't report a fault. From the software's perspective, a variable just silently changed its value for no reason.

If that bit was in a region of memory that doesn't matter much (like a pixel in an image buffer), you might never notice. If it was in a critical data structure, a function pointer, or a security check, the result ranges from a crash to a security vulnerability.

The Scale of the Problem

How often does this happen? More than you'd think.

The standard estimate for consumer DRAM is about one bitflip per gigabyte per month under normal conditions. Google published research showing even higher rates in their data centers. A study from 2009 found error rates of 25,000 to 70,000 per billion device hours per megabit. That's not negligible.

Your laptop has 16 or 32 GB of RAM. Running 24/7, that's potentially dozens of bitflips per month. Most of them will land in unused memory or data that's overwritten before it matters. But some percentage will hit live data structures in running programs.

Firefox is one of the most commonly running programs on consumer machines. It uses substantial amounts of RAM. Run the probability and it makes sense that a meaningful percentage of unexplained crashes trace back to hardware bit errors.

Mozilla's analysis was clever. They looked at crash dumps where the program state was internally inconsistent in ways that no software bug could produce. Values that were impossible given the code path that led to them. Memory corruption patterns that didn't match any known exploit or buffer overflow. They cross-referenced these with memory hardware reports and altitude data (cosmic ray intensity increases with altitude, so users at higher elevations should experience more bitflips).

The correlation held. Users at higher altitudes had disproportionately more of these mystery crashes. The pattern was consistent with cosmic ray-induced bitflips.

Why ECC Memory Matters

There's a solution to this problem. It's called ECC (Error-Correcting Code) memory. ECC RAM adds extra bits to each memory word that allow the hardware to detect and correct single-bit errors automatically. A cosmic ray flips a bit, the ECC logic catches it, corrects it, and the software never knows anything happened.

ECC memory is standard in servers and data centers. It's standard in workstations. It's slowly becoming more common in higher-end consumer hardware. But most laptops and desktops still ship with non-ECC RAM.

The reason is cost and marketing. ECC memory is slightly more expensive. It requires memory controllers that support it (Intel has historically limited ECC to their Xeon line, though AMD has been more permissive). Consumers don't know to ask for it. PC manufacturers don't advertise it.

This drives me nuts. We're running increasingly complex software, making increasingly important decisions based on computer output, and the underlying hardware can silently corrupt data at a rate of multiple times per month. We know the solution. We've known it for decades. We just don't deploy it in consumer hardware because the market doesn't demand it.

Linus Torvalds has been ranting about this for years. He called the industry's failure to adopt ECC memory universally "one of the absolute worst decisions the industry has made." He's right.

The Software Reliability Angle

Beyond ECC, Mozilla's findings raise uncomfortable questions about software reliability in general.

We test software against bugs in the code. We test against invalid inputs. We test against network failures and disk errors. We almost never test against random bit corruption in memory. Because we assume the hardware is reliable. It mostly is. But "mostly" is doing a lot of work in that sentence.

For mission-critical systems (avionics, medical devices, financial trading), this is well-understood. These systems use radiation-hardened chips, ECC everywhere, redundant computation, and voting algorithms to handle hardware faults. The techniques exist. They're just not applied to consumer software.

Should they be? For a browser, probably not. A Firefox crash is annoying, not dangerous. But what about AI models running inference on consumer hardware? What about self-driving car systems? What about medical AI processing diagnostic images?

A single bitflip in a neural network's weights during inference could change a prediction. The model would produce a different output with full confidence, and nobody would know the hardware had corrupted it. For a chatbot, that's a weird response. For a medical diagnostic, that's a potential misdiagnosis.

What I Think About When I Think About Cosmic Rays

There's something humbling about this. We build these incredibly complex software systems. Millions of lines of code. Sophisticated algorithms. Rigorous testing. And then a particle that has been traveling through space for millions of years, from an exploding star in a galaxy we'll never see, smacks into a silicon atom in your laptop and crashes your browser.

We tend to think of computers as deterministic machines. Input determines output. Same code, same input, same result. Bitflips break that assumption. Your computer is a physical object in a physical universe, subject to physical processes that software can't control or predict.

For most purposes, this doesn't change how we write software. The bitflip rate is low enough that the vast majority of computation is reliable. But at scale (millions of users, billions of compute hours), these rare events become statistical certainties. Some percentage of your crashes will always be hardware. Some percentage of your bugs will be unfixable in software.

The pragmatic takeaway: if you're running anything that matters, use ECC memory. If you're building anything safety-critical, assume the hardware will corrupt data and design accordingly. And if Firefox crashes for no apparent reason, maybe don't blame the software team. A star exploded, and your browser caught the shrapnel.