SegFaults following a partition fill - cannot work out why - Linux, Unix, and *nix like Operating Systems forum at WebmasterWorld

Forum Moderators: bakedjake

Message Too Old, No Replies

SegFaults following a partition fill - cannot work out why

Any experience or help appreciated

AlexK

2:26 pm on Dec 7, 2004 (gmt 0)

(RH8)
I recently sent a large upload to /var (wrong partition - tired) and filled it! The mistake was easily rectified, but ever since apache, mysql, ps --you name it-- is throwing segmentation errors, and the server is becoming defunct, as is my site.

The server has been rebooted, apache, mysql and php have all been reinstalled clean, yet the (new) apache log fills up with segfaults from children, and the mysql tables show constant index errors.

Could it be a hardware issue (and how to confirm it?), or is there an obvious other software component to check-out? The server is only 18 months old and in a colo. After a bumpy 9 months start (seemed to be config issue with mysql) it has run faultlessly, now this.

I seem to be watching my site going down the toilet.

Can anyone draw a direct connection between the /var partition fill and the current experience of segfaults?

jollymcfats

4:48 pm on Dec 7, 2004 (gmt 0)

Have you done a REPAIR TABLE on all of your tables? They and/or their indexes were probably corrupted when the drive filled. That *may* also be the reason the apache children are segfaulting- perhaps the mysql client libraries are croaking due to the corruption.

AlexK

5:00 pm on Dec 7, 2004 (gmt 0)

Yup, both via php (scripts in the maintenance section of the site) and--when that failed on occasions--by ssh using 3 different varieties of myisamchk options, depending on how bad the corruption was.

In any case, every table was checked and/or repaired *before* bringing the site back up, for the very reason that you mentioned.

I am now far more familiar with mysql repair options than any sane person has any right to be.

jollymcfats

5:41 pm on Dec 7, 2004 (gmt 0)

Offhand, I can't think of any other lingering /var faults that wouldn't be fixed by a reboot & the init scripts. I think your best bet to get to the cause is to examine a core file or perhaps strace something likely to go down in flames.

But it might be easier to do a clean OS install. Sounds like you have recent practice reloading your software and data. ;)

AlexK

5:57 pm on Dec 7, 2004 (gmt 0)

Thanks for trying...

As for the OS reinstall, I agree, but have to wait for my colo hosts as I do not have a static IP, nor enough experience to launch into it myself.

I still have a lingering feeling that there is some obvious component to checkout.

PS checking the apache error-log shows 8 hours up before the first child segfault, almost exact each of 3 times.

jollymcfats

9:16 pm on Dec 7, 2004 (gmt 0)

Do you keep a swap file in /var? (check /etc/fstab)

mcavic

9:55 pm on Dec 7, 2004 (gmt 0)

The only time I've seen ps crash has been if there's a kernel issue, or the CPU is overheating (SCO Unix in both cases).

AlexK

1:18 am on Dec 8, 2004 (gmt 0)

Do you keep a swap file in /var?

swap is on /dev/sda3, /var is mounted on a different device.

The only time I've seen ps crash has been if there's a kernel issue, or the CPU is overheating

This is one clincher, is it not? It happened only once (with ps -aux) after the server was rebooted and has never repeated. At the time I put it down to the fact that I had logged in as root whilst my colo host was also logged in as root, and got out fast.

Hardware or software?, that is the question.

Unfortunately, now that I need a fast response from my colo host I`m not getting it, so thanks for the responses here.

jollymcfats

1:24 am on Dec 8, 2004 (gmt 0)

I suppose I'll add that I've seen ps crash when memory has gone bad, or swap has bad blocks. (Hence the swap file question.)

One thing you might try is doing an rpm verify to check for random corruption. It's unlikely to find anything in /var, but if some system library is corrupted it should turn up.

Personally I'd try the strace. If there's some consistent file or library being accessed before the segv it will show up.

AlexK

5:36 am on Dec 8, 2004 (gmt 0)

One thing you might try is doing an rpm verify

The colo host did a

rpm -Va

earlier this morning with no obvious suspects. He also mailed me that he will get to the colo on Thurs and upgrade the OS (hurrah!).

After much research, my prime suspect is the rh8 apache-prefork-mpm/php combo, which php themselves state is not to be used in a production environment [uk.php.net]. Since restarting apache seemed to give 8 hours grace (shades of Windows, huh?) I did an

apachectl graceful

but this did not help much:

[Wed Dec 08 02:14:56 2004] [notice] Graceful restart requested, doing restart
...
[Wed Dec 08 02:14:58 2004] [notice] Apache/2.0.40 (Red Hat Linux) configured -- resuming normal operations
[Wed Dec 08 02:55:04 2004] [notice] child pid 1995 exit signal Segmentation fault (11)

...so have done yet another stop/start, which is even more Windows-like.

Once again, the responses from yourselves have helped enormously, so thank you. With little sleep, no food, no isp-response and a Google PR [checkpagerank.com] for the site of 0, I was getting mighty desperate (desparate?).

killroy

9:43 am on Dec 9, 2004 (gmt 0)

Closest thing to my own experience (running Apache on XP Pro!), is the harddisk an oldish IBM GXP? I have one of those left, and everytime it fills up by accident, it goes bonkers, until I extract data, reformat and put everything back, then it's fine again...

AlexK

2:09 pm on Dec 9, 2004 (gmt 0)

Just got a report from the front...

Lance (the main man from the ISP) is at the colo and using memchecker from a CD to test the memory (2 x 512k sticks of Crucial-supplied ECC memory). Sure enough, it is failing - at ~912 MB! One stick bad, one OK. The memory is under guarantee, so that`s OK, but, 18 months of hell to find out...

One question to ask: the mem-checker reported that the ECC was switched off. This could be just because of the testing routine, but does anyone know of a linux util to report ECC status?