Chips are the brains of everything electronic. And like anything generated by software programs and high-tech factories, they can have flaws. Intel learned this the hard way today as it announced that it had a bug in a companion chip for its popular Sandy Bridge graphics-processor chips for PCs. The mistake will cost the company $300 million in lost sales and $700 million in repairs — making it the biggest such problem the company has ever seen.

It’s certainly a huge shock to the supply chain for personal computers, and it shows just how vulnerable the electronics industry is, given its reliance on a small number of very large suppliers. If this happened to any other company in the supply chain, it could have caused some serious disruption. But Intel has been able to figure out the problem and recover quickly from it. Still, it raises the question: Why are these chip flaws so deadly?

One reason is that Intel’s sheer size. The company makes about 80 percent of the processors used in personal computers. It has three major chip factories manufacturing its mainstay chips at any given time, and is in the process of adding a fourth. Those chips can churn out the chips by the millions per quarter. If there is a mistake in the chip design and it isn’t caught early, the flaw can propagate through tens of millions of chips very quickly. The sheer volume of shipments is the reason why this $1 billion flaw is the biggest in Intel’s history, compared to the $400 million Pentium bug in 1994.

Intel dodged a bullet in a big way here. The flaw is in a chipset — a companion chip that handles input-output functions in a PC — that is made in one of the company’s five-year-old factories, not one of its brand-new ones. The chipset goes with Sandy Bridge, which combines microprocessor and graphics functions in a single chip. Sandy Bridge has been hugely successful for Intel and has been designed into 500 PCs. Luckily, the company caught the problem relatively early. By comparison, the Pentium chip had already been shipping for some time before a math bug was brought to Intel’s attention back in 1994.

Stephen Smith, an Intel vice president, said in an interview that about 8 million of the flawed chips have already been shipped to PC manufacturer customers and perhaps only 500,000 are in the hands of users. That’s actually less than a day’s worth of PC shipments, in the grand scheme of things.

Another reason that chip flaws can be disastrous is that it takes so long to make each chip from beginning to end. The typical process takes about two to three months, starting with a bare circular wafer of silicon. Chip makers build structures on top of the wafers by adding and erasing layers. The chips are baked at high temperatures, cleaned, and washed. Then they are shipped to another assembly factory where they are sliced into individual chips and packaged. The process involves dozens of steps.

Smith (pictured right) said that Intel lucked out in another sense. The flaw can be fixed in the upper metal layers of the process. That means that it can make the changes in chips in the midst of the factory — only by changing some of the last layers added to the chips in the process — and still get those chips shipping to customers within about four weeks.

“When we looked at the issue and the engineers understood the problem, we saw we could make an easy fix in the upper metal layers, rather than deal with changes to the entire production line,” Smith said.

Since there are hundreds of millions or sometimes billions of electronic components on each multi-layer chip, the complexity of today’s modern chips is enormous. A typical design will look like a multi-level city on the scale of Manhattan. Each chip takes the work of hundreds of skilled engineers, each working with design software that automates the processing of design.

In this case, Smith said there was a circuit-level error in a couple of the connections between a couple of transistors, which are the basic building blocks of electronic circuitry. Intel discovered the flaw fairly quickly, after receiving returned units from a PC maker which saw failures in some machines shipped to customers.

Anandtech described the flaw in more detail here: “The problem in the chipset was traced back to a transistor in the 3 gigabit per second PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports. The fact that the 3Gbps and 6Gbps circuits have their own independent clocking trees is what ensures that this problem is limited to only ports 2 – 5 off the controller.”

Sometimes, changing software will enable programs to get around the flaw. But that still takes a lot of work, since it means software companies have to issue patches for every piece of software that runs on the chips. Some types of chips allow for software updates as well, making changes to the micro code of that runs on a chip or in the BIOS (startup software) of a PC already out in the field. But that’s not the case with the Intel chip set, code-named Cougar Point.

A chip set serves as a traffic cop for the PC, handling tasks that the main Sandy Bridge chip doesn’t, such as input-output functions. In this case, the Cougar Point chip set had a flaw. The chip set’s ability to handle Serial-ATA (SATA) ports could potentially degrade over time. That would mess up hard disk drives or DVD drives in the PC. Since those systems are critical, Intel had to fix the hardware.

To fix the problem, Intel’s engineers have to fix the design with their automated design software. They then simulate the new design to make sure they don’t create any new errors. Then the circuit design has to be re-translated to a physical layout, which is like going from a construction blueprint to an actual guide for a tool handler. After that translation is done and checked for errors, the changes have to propagate through the production system, including changing the masks, or templates, that are used to print patterns on the surfaces of chips. Intel had to change some of its masks here, but it didn’t have to start over from scratch.

The reason that the delay is so expensive, amounting to $300 million in lost revenue for the quarter, is that Intel can’t ship its Sandy Bridge microprocessors — which have been designed into more than 500 computers — if it can’t ship the chip sets that go with them. Currently, Intel is making quad-core Sandy Bridge processors and the company is delaying the launch of its dual-core Sandy Bridge processors as it works through the fix. Also, Intel has to gradually ramp up production for a period of time before it starts running on all cylinders again.

The $700 million repair cost suggests there is a lot of inventory already in the hands of PC customers. But manufacturers have a few options in how they fix the problem. The Series 6 Serial-ATA ports degrade over time. In three years, there’s a five percent chance a failure will happen. If the PC manufacturer only uses ports 0 and 1 in their machine designs, they can actually ship PCs with the flawed chips, since only ports 2 through 5 are subject to failure. If the PC manufacturers want to replace their chip sets entirely, then that is where Intel incurs replacement costs.

It could have been much worse for Intel, given how many chips it ships. The company is lucky that the problem was discovered early. The bottom line is that fixing a chip design error is like remaking a stone carving, since changing hardware isn’t nearly so easy as fixing software.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.