After some expensive failed experiments in building a dedicated MythBox out of a MiniITX form factor system, I decided what I really wanted was a quiet machine that would do everything. That is, a machine I could just leave on all the time, downloading, serving, processing, recording... anything. The problem I had in attempting to do this with my regular desktop machine was that my machine was (a) situated only a couple of metres from my bed, and (b) loud. Loud enough to make sleeping anywhere near it uncomfortable.
With the new goal to fix the problem of noise, I bought a new case on my way back from visiting my parents over the Christmas break: an Antec P180. I also bought a new, quiet PSU: an Antec TruePower Trio 550W.
As I work during the day (starting a new year in the lab), all of the following happened over several consecutive nights. I had to spend a couple of nights disassembling everything from the old case and reassembling everything in the new case. It wasn't until some days later that I was finally able to check whether my machine had not only survived the trip home (which, honestly, was not really anything to worry about), but also whether it all worked successfully inside the new case. As you can imagine, I was pretty confident it would.
Spoiler: Filesystem corruption.
At first it didn't boot completely, but stalled partway through. Strange, I thought, that it would stall like it did. I tried a few times. Sometimes I could get as far as the KDE desktop. Sometimes it wouldn't quite make the login screen. Sometimes it didn't even get as far as booting X.
It looked like a problem getting data from the hard drive. Every attempt, it spoke of ATA command timeouts, Buffer I/O errors, and ATA abnormal status 0xD0. Buggered if I know what that means, though. I kept rebooting and retrying and rereading, to try to figure out just what in the hell was going wrong.
At some point something totally screwed up and scattered a bunch of erroneous bits over the 250 GB hard drive—or at least the main partition of it. (The hard drive also hosted a small WinXP partition for gaming.) Then Linux became unbootable.
You know that feeling you get when you lean on a chair just that little bit too far and it starts to fall? Just before you frantically re-balance yourself? “Oh, hell...”
At this point I figured the best immediate course of action was to leave the bloody thing turned off until I could figure out the next step. I used the base system from another project that didn't go so well, plugged the 250 GB hard drive in, and did what I could with fsck to salvage and back-up everything onto a separate 400 GB hard drive (the one I had originally intended to store recorded TV on).
The fsck managed to recover many files but, unfortunately, it had no idea what those files were, what they should have been called, or where they should have been located in the filesystem hierarchy. In total I was left with a couple of hundred files all neatly numbered and tucked under /lost+found, including files that used to constitute /sbin/init.
As you might imagine, I was not my usual cheery self at work for the next few days.
Now, it occurred to me as I was performing this salvaging process that if the hard drive itself had a fault then more than likely it would be popping up some kinds of new errors. But it was working, despite the corruption, without issues. The only difference with the hard drive here was that I had limited it to SATA 150 so that the dopey VIA SATA controller in this system would detect it at all. But I wouldn't have thought that that in itself would be enough to solve the problem.
It crossed my mind that it might be my Linux install doing something funky, but I ruled that out quickly. I don't remember upgrading anything in the intervening time, and the error messages looked to me to be very hardware-fault-esque.
As I could see it, the only thing left to blame was the motherboard, an ASUS A8N-SLI. So I blamed it. As far as I could tell, it was a hardware fault on the motherboard. Somehow something had stuffed up as it was either travelling or being moved from one case to another. Perhaps I shorted something accidentally? I try to be careful about these things, grounding myself, etc., but one can only do so much.
Having convinced myself the SATA controller on the motherboard was broken, I tracked down the warranty details. The board was just over a year old but, luckily, ASUS had a 3 year warranty on all their motherboards when I bought it. It was still a bit awkward, though, since I had originally bought it along with a bunch of other stuff for my own and my dad's machines, and he was the one with the receipt. I called him to get him to send a copy to me, but that would take a few days.
Added to this awkwardness was the fact that the shop that we bought this stuff from didn't operate in the same place anymore. Now they were further away. I called up their head office to chat to them about this, and they told me to just bring it to them. Okay, I thought, that's not too bad; it's somewhat of a drive, but I can still make that trip.
After a few more days (heading into the second week of this madness, now) I had received the receipt in the mail and was on my way to the shop to submit it for warranty. The guy there was friendly and took the board (and box and as many accessories that came with it as I could spare) without a hint of a problem.
Five days later I got a call on my mobile, in the middle of the day, from the manager of this shop telling me that there was nothing wrong with the motherboard. He added that Linux doesn't support SATA.
Now, let's just pause to consider this for a moment. This was entering its third week, while I had been without my usual capable system for over two weeks. I had filesystem corruption that I had managed to partially salvage using another motherboard. I had been using SATA and Linux together happily for over a year already. And now you're telling me that the motherboard's fine and Linux and SATA don't work together? ... Are you high?
I told him, in no uncertain terms, that what he had just told me was so much bovine waste. “I've been using SATA in Linux for a year already without problems!” He deflected, saying that he was just the manager and was only relaying what he had been told by the tech.
“Oh, and by the way,” he informed me, “we won't support a warranty on this anyway since you got it from a place that isn't part of our franchise anymore. You'll have to go take it to them.”
Nevermind that I had phoned in four days before to check exactly this. I pretty clearly said where I had originally bought it from (while they were still trading under the name of this store) and explicitly asked if it was okay. I was told it would be fine. I asked if the store there was still open. I was told it wasn't anymore, and it didn't matter anyway because warranty claims go to these guys, anyway. If there actually was going to be a problem, why the hell did this moron tell me that there wouldn't be? Why was I told to bring it to these “Linux doesn't support SATA natively” numbskulls, in a location out of my way, when they were in no position to warrant it?
So after that phone call I was too furious to do anything involving interactions with other people. I spent half an hour in the lab, by myself, quietly working on something completely unrelated to try to calm down a bit. Even then I was still fuming. I told my supervisor what had happened and that I was going to take a bit of time out to pick up the motherboard. (He'd never seen me so mad. A day or two later, he told me that I was almost shaking with anger and frustration, and he was being deliberately careful to not say anything that might possibly aggravate me further.)
One 90 minute round trip to the shop and back, later...
They didn't charge me anything, though I knew they could have tried. I'm not sure if they forgot about it or just decided it better to not piss me off any further and rather ignore it.
So the machine was back home. Again I tried my hand at diagnosing the cause. This time I noticed that WinXP didn't seem to have so much of a problem with it as Linux seemed to. Occasionally it wouldn't boot all the way, but usually it seemed okay. It's possible that my testing in Windows wasn't thorough enough, though—certainly it wasn't as thorough as with Linux.
I also noticed that limiting the drive to SATA 150, even while on the ASUS motherboard, seemed to help. Leaving it at SATA 150, it was, for the most part, running fine. Tired of fighting this crap, I began reinstalling Debian, and I was prepared to leave it at SATA 150 rather than 300. But, just in case there was some idea that I had missed, I put the issues to the folks on the Humbug (my local Unix/Linux user group at the time) mailing list.
I got a few suggestions, but they seemed to converge towards it being a good idea to investigate the power supply. Apparently many weird effects have been in the past attributed to power supplies, and I wasn't aware of that. And not just dodgy supplies, too; someone even mentioned a situation where a PSU, tested to be working fine, was somehow incompatible with the particular case it was intended to be housed in. (For some bizarre, unknown reason, it just didn't work right while in that case.)
So, with that in mind, I tried substituting out the new power supply to see what happened using the old one I had. I couldn't imagine what could be causing errors like these by the PSU, but I gave it a go anyway.
It worked. No problems. No errors. Booting and working just fine. Something about the spanking new mid-range power supply was causing problems that the old bog-standard PSU did not.
I did note that there was something different about this setup as compared to using the old power supply. The old supply had no dedicated SATA plugs, whereas the new one does, and so to use the old PSU I had to use Molex to SATA power adaptor cables. On the new supply, however, I had just plugged the SATA power straight to the dedicated SATA supply lines.
So, for one more time, I tried using the new power supply, but with the Molex–SATA adaptors instead of the dedicated SATA power. It worked. I was running at SATA 300 and without the ATA errors.
Some thorough testing confirmed it. Yep. That was it. That was the problem. The PSU's SATA power cables and the hard drives weren't playing nice. Finally!
Of course, now I wanted to know what the problem with the SATA supply was. I poked around the outputs of the SATA supply cables with a multimeter to see what I could see, and what I could see was nothing out of the ordinary. Everything was within spec. Maybe with a current draw it'd be clearer what was amiss. But as it stood, I couldn't work it out any further.
Now it's taken me at least a few weeks to finally restore my machine to about where it was before this all happened. I was able to salvage most of my personal files from the original install that got corrupted, and what wasn't still there I had either a backup of, or it wasn't important enough to worry about. Since this happened, I've put VDR back on this machine, running as my digital TV receiver and PVR device, 24 hours a day. Like I had hoped to do.
I think there are a few morals to be learnt from this story. Some of these everyone should already know, but it's worth repeating:
- You can't trust support staff to know a whole lot about Linux. (Doesn't support SATA! Right.)
- You can't trust sales staff to know a whole lot about their own company. (Sure, bring it in! Even though you bought it at some other place which still exists but isn't part of our brand anymore.)
- Using the most appropriate plugs you're given doesn't guarantee that it'll work.
- Power supplies can cause some very strange behaviours.
Update 2007-09-01
I really am a sucker for punishment. Last weekend I got talked into believing that upgrading was a good idea... way too easily. You see, the small 40 mm fans in my system (one on the motherboard over the northbridge, and one in my video card) were making noise. Not loud noise, really, but loud enough to annoy me as I tried to sleep. Particularly the northbridge fan. I thought it would be cool if I had a motherboard without any fans. I knew such existed, but I was wary of doing anything about it, given my previous experience with installing computer parts.
Aside from that, I was considering my options for getting a new, bigger monitor, and upgrading my memory. But it appeared that the DDR RAM that I was after had become disproportionately expensive as everyone moved to DDR2 RAM. Someone suggested that it'd be plenty cheap to just buy new parts to support DDR2, and for some reason I was feeling risky. I took the prompt and picked out for myself some parts: a fanless motherboard that supports DDR2 RAM (Asus M2N-SLI Deluxe), 2 GiB of DDR2-800 RAM (Kingston), a CPU (since, of course, the standard pins had changed since last time I bought a motherboard—I swear, they might as-well just go back to soldering the CPU straight in... anyway: Athlon X2 4400+), a fanless video card (Gigabyte 8600GT), and a monitor (Benq 20" widescreen, which was, perhaps not so surprisingly, the most expensive part).
After pulling out the old parts and putting in the new, things seemed to be going well. I ran it a little, used it to watch some TV, left it on overnight, and it survived okay. Things seemed to be going well. I shut the machine down in the morning since I hadn't gotten the network working for a silly reason (the interface I was trying to use as “eth1” was actually being called “eth3” by udev) and it would've been pointless to leave it running with no way to act as a server, and no TV shows worth recording. That evening I booted the machine up, did some web browsing, and then things started failing with segfaults, crashing, and a buildup of instability. Oh hell.
Without going into much detail (these posts already have far too much of that), I narrowed this problem down to use of the Cool'n'Quiet function of the CPU. After more desperate searching and probing, eventually I figured out that, even though the RAM was passing memtest86+ testing, the voltage setting for the RAM in the BIOS was defaulting to a value lower than what the RAM was actually designed for. It was being undervolted, and was made very unhappy when the CPU changed speed. So... no wonder there were segfaults.
Things seemed to be going well, again, once I pushed the voltage up to the correct level in the BIOS. Everything was running quieter than before the upgrade, and using less power, too. Happiness!
If I weren't such a masochist, one day I might learn to stop doing things like this to myself...
Comment to add? Send me a message: <brendon@quantumfurball.net>