There is no fix for Intel’s crashing 13th and 14th Gen CPUs — any damage is permanent

mox@lemmy.sdf.org · 10 months ago

There is no fix for Intel’s crashing 13th and 14th Gen CPUs — any damage is permanent

tal · edit-2 10 months ago

To put this another way, Intel had at least three serious failures that let the problem reach this level:

A manufacturing defect that led to the flawed CPUs being produced in the first place.
A QA failure to detect the flawed CPUs initially (or to be able to quickly narrow down the likely and certain scope of the problem once the issue arose). Not to mention having a second generation of chips with the defect go out the door, I can only assume (and hope) without QA having initially identified that they were also affected.
A customer care issue, in that Intel did not promptly publicly provide customers with information that Intel either had or should have had about likely scope of the problem, mitigation, and at least within some bounds of uncertainty (“if it can be proven that the problem is due to an Intel manufacturing defect on a given processor for some definition of proven, Intel will provide a replacement processor”), what Intel would do for affected customers. A lot of customers spent a lot of time replicating effort trying to diagnose and address the problem at their level, as well as continuing to buy and use the defective CPUs. It is almost certain that some of that was not necessary.

The manufacturing failure sucks, fine. But it happens. Intel’s pushing physical limits. I accept that this kind of thing is just one thing that occasionally happens when you do that. Obviously not great, but it happens. This was an especially bad defect, but it’s within the realm of what I can understand and accept. AMD just recalled an initial batch of new CPUs (albeit way, way earlier in the generation than Intel)…they dicked something up too.

I still don’t understand how the QA failure happened to the degree that it did. Like, yes, it was a hard problem to identify, since it was progressive degradation that took some time to arise, and there were a lot of reasons for other components to potentially be at fault. And CPUs are a fast moving market. You can’t try running a new gen of CPU for weeks or months prior to shipping, maybe. But for Intel to not have identified that they had a problem with the 13th gen at least within certain parameters at least subsequent to release and then to have not held up the 14th gen until it was definitely addressed seems unfathomable to me. Like, does Intel not have a number of CPUs that they just keep hot and running to see if there are aging problems? Surely that has to be part of their QA process, right? I used to work for another PC component manufacturer and while I wasn’t involved in it, I know that they definitely did that as part of their QA process.

But as much as I think that that QA failure should not have happened, it pales in comparison to the customer care failure.

Like, there were Intel customers who kept building systems with components that Intel knew or should have known were defective. Far a long time, Intel did not promptly issue a public warning saying “we know that there is a problem with this product”. They did not pull known defective components from the market, which means that customers kept sinking money into them (and resources trying to diagnose and otherwise resolve the issues). Intel did not issue a public statement about the likely-affected components, even though they were probably in the best position to know. Again, they let customers keep building them into systems. They did not issue a statement as to what Intel would do (and I’m not saying that Intel has to conclusively determine that this is an Intel problem, but at least say “if this is shown to be an Intel defect, then we will provide a replacement for parts proven to be defective due to this cause”). They did not issue a statement telling Intel customers what to do to qualify for any such program. Those are all things that I am confident that Intel could have done much earlier and which would have substantially reduced how bad this incident was for their customers. Instead, their customers were left in isolation to try to figure out the problems individually and come up with mitigations themselves. In many cases, manufacturers of other parts were blamed, and money spent buying components unnecessarily, or trying to run important services on components that Intel knew or should have known were potentially defective. Like, I expect Intel, whatever failures happen at the manufacturing or QA stages, to get the customer care done correctly. I expect that to happen even if Intel does not yet completely understand the scope of the problem or how it could be addressed. And they really did not.

toddestan@lemm.ee · 10 months ago

I’d argue there was a fourth serious failure, and that was Intel allowing the motherboard manufacturers to go nuts and run these chips way out of spec by default. Granted, ultimately it was the motherboard manufacturers that did it, but there’s really no excuse for what these motherboards were doing by default. Yes, I get the “K” chips are unlocked, but it should be up to the user to choose to overclock their CPU and how they want to go about it. To make matters worse, a lot of these motherboards didn’t even have an easy way to put things back into spec - it was up to you to go through all the settings one by one and set them correctly.