Forum Moderators: phranque

Message Too Old, No Replies

trying to figure out why my server went dark

         

Megunticook

10:03 pm on Oct 1, 2018 (gmt 0)

5+ Year Member



My Apache server on AWS EC2 went dark for an hour early this morning. Trying to figure out why but having a little trouble deciphering the logs. Any comments or insights would be most appreciated.
This is Apache 2.4.34 on Amazon Linux 2, 4GB memory (typically only 25-30% used). It hosts a single WordPress site, with the database hosted on a dedicated RDS instance.

Looking at my metrics and logs, it appears there was a big memory spike just after 2am, it went up to over 90%, then there's an hour and a half gap during with the server was unresponsive (and no metrics being collected during that time, no log entries).

httpd error log shows nothing right before the crash, but has this at the recovery point:
03:49:36 [Mon Oct 01 06:20:54.569826 2018] [proxy_fcgi:error] [pid 18162:tid 140310726268672] (70007)The timeout specified has expired: [client 11.111.111.111:43245] AH01075: Error dispatching request to : (polling)


messages log shows this:

Oct 1 06:10:01 ip-111-11-11-111 systemd: Starting User Slice of root.
02:10:02
Oct 1 06:10:01 ip-111-11-11-111 systemd: Started Session 2880 of user root.
02:10:02
Oct 1 06:10:01 ip-111-11-11-111 systemd: Starting Session 2880 of user root.
02:10:02
Oct 1 06:10:01 ip-111-11-11-111 systemd: Removed slice User Slice of root.
02:10:02
Oct 1 06:10:01 ip-111-11-11-111 systemd: Stopping User Slice of root.
02:39:28
Oct 1 06:19:42 ip-111-11-11-111 dhclient[3079]: XMT: Solicit on eth0, interval 113520ms.
03:49:36
Oct 1 06:19:46 ip-111-11-11-111 dhclient[3079]: XMT: Solicit on eth0, interval 129820ms.
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: amazon-cloudwat invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: amazon-cloudwat cpuset=/ mems_allowed=0
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: CPU: 0 PID: 5768 Comm: amazon-cloudwat Not tainted 4.14.62-70.117.amzn2.x86_64
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: Call Trace:
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: dump_stack+0x5c/0x82
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: dump_header+0x94/0x229
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: ? do_try_to_free_pages+0x2a1/0x330
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: ? get_page_from_freelist+0x525/0xba0
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: oom_kill_process+0x213/0x410
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: out_of_memory+0x2af/0x4d0
03:49:36
Oct 1 06:39:28 ip-111-11-11-111 kernel: __alloc_pages_slowpath+0xa19/0xe00
03:49:36


And ssl_error log has this:

02:39:28
[Mon Oct 01 06:11:26.699501 2018] [proxy_fcgi:error] [pid 18162:tid 140310625556224] (70007)The timeout specified has expired: [client 11.11.11.111:6018] AH01075: Error dispatching request to : (polling), referer: https://example.com/product/phoebe-hat-mitts-kit/
03:49:36
[Mon Oct 01 06:11:26.699465 2018] [proxy_fcgi:error] [pid 18162:tid 140310785017600] (70007)The timeout specified has expired: [client 11.11.1.111:48590] AH01075: Error dispatching request to : (polling), referer: https://example.com/product-category/hand-knitting-patterns/


What exactly happened to cause the memory spike?

Do I want to rethink my configuration in my httpd.conf file? Here's what I have now:


<IfModule mpm_event_module>
ServerLimit 300
MaxRequestWorkers 300
StartServers 2
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 25
MaxConnectionsPerChild 1000
</IfModule>


Please help me understand exactly what happened here and how I can fix it. Thanks!

justpassing

7:44 am on Oct 2, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes it ran out of memory.

I am not familiar with this kind of logs, at my own servers, when I was unexperimented and dumber, it happened I ran out of memory, but in that case, the logs were reporting more information, about the memory usage (which process uses what).

Don't you have logs about PHP , PHP-FPM, MySQL/MariaDB ?

iamlost

4:03 pm on Oct 2, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




Apache appears to be loading your PHP page using fcgi via PHP-FPM (FastCGI Process Manager), which is taking longer than the configured time to respond.

I'd check to see if:
* your PHP is stuck in a loop
* a database request is timing out
* there is a hard coded non-responsive request in your PHP

justpassing

4:22 pm on Oct 2, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



If it's PHP-FPM, it's possible that too many childs were spawned.

Megunticook

2:15 pm on Oct 3, 2018 (gmt 0)

5+ Year Member



Thanks for the responses everyone.

I don't see anything in the php-fpm logs at the time of the shutdown. But I found these entries in the log file "error.log-20181002." If the timestamps are accurate, all of this occurred after the outage, but perhaps there's a clue here. Do I need to revisit my <IfModule mpm_event_module> block in the Apache config?

Here's the log activity:

[01-Oct-2018 07:49:36] NOTICE: fpm is running, pid 3119
[01-Oct-2018 07:49:36] NOTICE: ready to handle connections
[01-Oct-2018 07:49:36] NOTICE: systemd monitor interval set to 10000ms
[01-Oct-2018 15:51:30] NOTICE: [pool www] child 3215 exited with code 0 after 28914.058528 seconds from start
[01-Oct-2018 15:51:30] NOTICE: [pool www] child 29529 started
[01-Oct-2018 15:51:39] NOTICE: [pool www] child 3217 exited with code 0 after 28922.768615 seconds from start
[01-Oct-2018 15:51:39] NOTICE: [pool www] child 29537 started
[01-Oct-2018 15:51:39] NOTICE: [pool www] child 3218 exited with code 0 after 28922.995524 seconds from start
[01-Oct-2018 15:51:39] NOTICE: [pool www] child 29538 started
[01-Oct-2018 15:52:51] NOTICE: [pool www] child 3216 exited with code 0 after 28995.217703 seconds from start
[01-Oct-2018 15:52:51] NOTICE: [pool www] child 29681 started
[01-Oct-2018 15:53:23] NOTICE: [pool www] child 4710 exited with code 0 after 27921.342289 seconds from start
[01-Oct-2018 15:53:23] NOTICE: [pool www] child 29710 started
[01-Oct-2018 20:40:35] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 3 idle, and 11 total children
[01-Oct-2018 20:40:36] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 3 idle, and 12 total children
[02-Oct-2018 08:59:53] NOTICE: [pool www] child 13161 exited with code 0 after 44356.284643 seconds from start
[02-Oct-2018 08:59:53] NOTICE: [pool www] child 19861 started
[02-Oct-2018 09:08:52] NOTICE: [pool www] child 13182 exited with code 0 after 44875.929775 seconds from start
[02-Oct-2018 09:08:52] NOTICE: [pool www] child 20324 started
[02-Oct-2018 09:18:52] NOTICE: [pool www] child 13180 exited with code 0 after 45476.997745 seconds from start
[02-Oct-2018 09:18:52] NOTICE: [pool www] child 20859 started
[02-Oct-2018 14:01:18] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 2 idle, and 15 total children
[02-Oct-2018 14:01:34] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 1 idle, and 16 total children
[03-Oct-2018 10:46:37] NOTICE: [pool www] child 12282 exited with code 0 after 64784.146714 seconds from start
[03-Oct-2018 10:46:37] NOTICE: [pool www] child 4290 started

Megunticook

2:56 pm on Oct 11, 2018 (gmt 0)

5+ Year Member



I reviewed and tweaked my server settings a bit. We'll see if this resolves it.

In httpd.conf:


#configure MPM event module
# Optimized settings for AWS EC2 t2.medium
<IfModule mpm_event_module>
ServerLimit 500
StartServers 2
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 25
MaxRequestWorkers 500
MaxConnectionsPerChild 1000
</IfModule>


In www.conf:

pm.max_children = 40
pm.start_servers = 8
pm.min_spare_servers = 4
pm.max_spare_servers = 8
pm.max_requests = 1000

robzilla

7:31 pm on Oct 11, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are you sure it was a local issue and not the infrastructure?