Welcome to WebmasterWorld Guest from 3.80.6.254

Forum Moderators: Robert Charlton & goodroi

Is this Googlebot 'stress testing' my server?

     
5:02 am on Oct 11, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


Normally, I get about 15k page requests from Googlebot per day (total pages on the site is about 7.5k unique and another 5k categories and tags).

Today, I was just going through the logs and I noticed that Googlebot crawl has been showing these sudden 'bursts' of requests.. as many as 13 requests per second for 2-3 seconds at a time. This repeats every 10 minutes or so.

I don't remember seeing this before. I used to be on Cloudfront, and moved to a 'naked server' (with just 2 gb of RAM and 2 vcpu) in Singapore two days ago.

Could this be Google stress testing the new configuration to see how much load it can take? Should I increase the power of the server? Frankly, it's been designed keeping in mind the regular traffic of about 1-10 requests per second. However, if I enlarge the server, I have to pay for one whole month. So, I don't want to do it unless it's actually some kind of 'stress test' and I stand to gain from making it bigger.

Has anyone seen anything like this? Normally, Google requests only 5 pages or so in a minute, and not 13 pages in a second.

I know that they do 'test' a server initially. But I've never seen this level of testing before.

You can see the logs here: [drive.google.com...]

Also, if this is Google stress testing it, how much RAM and vCPU should have provision for my site? During heavy load, it typically add extra servers and then remove them when the flood is over (usually because of stories going viral and so on). I've served 38 million requests (according to GA) per month using a 4GB + 2vCPU Linode instance. During peak times, I did offload some of the IPv6 traffic to a separate server though (for a couple of days).

At most, the CPU and RAM utilization never go above 60-70%, according to Linode's charts.
5:12 pm on Oct 11, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


udpate - in the interest of testing, I temporarily moved the server behind cloudfront, and this behavior disappeared, and crawl rate fell and there was no spike. I don't know if it has anything to do with the fact that cloudfront sends a lot of 304 responses and the 'naked server' doesn't.
5:48 pm on Oct 11, 2019 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Aug 30, 2019
posts:145
votes: 25


Hello-

Today, during 12 hours, I had 10 times more requests from Googlebot, than on an average day. May be an index refresh.
6:03 pm on Oct 11, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


I don't know if it has anything to do with the fact that cloudfront sends a lot of 304 responses and the 'naked server' doesn't
It might. Do your pages include dynamic content that would normally prevent the server from sending out a 304 response? And then if they're on a CDN it sends out “flat” pages, so a 304 becomes possible? “Dynamic” doesn’t necessarily mean that anything has actually changed. It just means anything built on the fly, like a php navigation header. Google ought to be able to tell the difference, and see whether anything has really changed, but maybe it can’t.

As you probably know, Googlebot ignores the “Crawl-delay” directive; you have to tell them in GSC what speed you want.

I'm assuming you know how to distinguish the real Googlebot from fakers.
5:46 am on Oct 12, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


Thanks. On the point of fake googlebots, I screen for googlebot by filtering requests from 66.249.x.x.

This spurt was occurring exactly at intervals of 10 minutes, and went on for 12-13 hours, after which I moved it behind Cloudfront again and it stopped.

When I said 304 requests, I should have been clearer. I meant to say - 304 requests from my origin (when Cloudfront is on). In other words, my origin server provides 304s only to Cloudfront, and not, as far as I have seen, when it's out in front facing the world.

I'm not a networking expert, but perhaps Cloudfront may be asking 'has it changed?' instead of making the request as such, while others (including Googlebot) may be simply issuing a get requests and expect a 200?

Anyway, I'm not complaining, given that these are essentially static pages, except for the comments section, which is anyway loaded via JS from Facebook.

Nevertheless, I would seriously like to know what the 'spurts' mean, and whether others have noticed it too. I plan to 'go naked' again to see if it recurs.
10:50 am on Oct 13, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


Moved it back to a naked server, and I can see that now, Google requests as many as 35 pages at once (see the log [url] [drive.google.com...]
6:13 pm on Oct 13, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


as many as 35 pages at once
Yikes. Since they refuse to honor “Crawl-Delay” have you tried setting a rate in GSC? 35 at once really seems overkill, unless you've got the kind of site that gets thousands of human visitors every minute so the odd 35 wouldn't even be noticeable. (I'm guessing you don't, since you did notice ;))
2:36 am on Oct 14, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


Well, there are times it gets hit by a ton of visitors, so I have to be, technically, ready for even 100 visitors a second. It's a news site, and sometimes it does get hectic. But the site's been having some issues in the last two months, and has seen intermittent downtimes (or up to 3 minutes) multiple times as we were fixing it (not the whole thing, but sections/segments, especially the uncached pages). So perhaps Google's trying to figure out if we're really back? Also note that I started noticing it only after I went behind Cloudfront for a couple of weeks and came back. Perhaps this started when it was behind CF, but I couldn't see it because CF caches requests at its end and may have been serving these multiple requests.
Btw, the server is not getting overwhelmed as far as I can see. It's got two cores, and I've upgraded it from 2GB to 4GB when I started noticing these spikes, and the two cores both move up to 4% CPU consumption (according to htop) when these spikes happen (and they always happen every 10 mins). I'm assuming it's temporary, and leaving the 'let Google determine the crawl speed' setting on.
Moreover, I've lost 90% of my organic traffic over the last two months because of the technical issues, downtimes and its impact on Google. So, I'm hoping this is some kind of test.
3:46 am on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


Gotta ask ... is there a problem with serving 304? After all that simply means "page not modified". In most cases I just "ignore" these, though wonder why the same critter wants to see the same page again that "soon".

[httpstatuses.com...]
3:49 am on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


Meant to add that my server response is all of 228 bytes ... hardly a strain on the server!

More concerned with too many 206 (partial served) than 304!
3:54 am on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


Also note that I started noticing it only after I went behind Cloudfront for a couple of weeks and came back.


This falls into the "make up your mind" category. Me, I've never bothered with Cloudflare as my sites are not prone to induce ddos attacks and while there are other reasons for Cloudflare that's generally the biggie for many.
5:40 am on Oct 14, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


I guess you don't write about the Chinese Govt :)
More than that, for people who live outside US and Europe, these CDNs help to ensure a uniform, reliable experience for customers, cause their 'internal' networks tend to be far more reliable than the internet trunk lines because of poor peering.

On Tangor's point of 304, it's quite healthy to see 304s, though I've never really dug deep into when it's used (outside of proxy cases like Cfront and Cflare). If I'm not wrong, that's when the proxies add a 'revalidated' header. I've almost never seen 206s, so you might want to check what that's about. And 499 (which means 'abandoned' by the user) usually signifies a serious problem.
7:38 am on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


My 206 entries are on PDF files I offer, some of which are 4mb in size. :)
9:36 am on Oct 14, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


Ah, that explains it. Do PDFs rank well? In the last year or two, I do notice a lot more PDF links in search results compared to earlier, when one'd have to specifically say filetype:pdf.
5:03 pm on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 887


I don't know the mechanics of it, but I do know that when I started using ssi's (the minimalist form of “dynamic” content) my server no longer sent out 304 responses on page requests. At a guess, it means “the server has taken no action relative to this material since the last time you asked”. This is different from “the content of the URL is identical to last time” (which is what search-engine computers are for).

One source of 206 responses is if the request comes in with an If-Modified-Since header. If the answer is no, the server records a 206 and the requester gets the same content they would get if they'd made a HEAD request; if yes, they get the file. It's less trouble for the requester because they don't have to send separate HEAD and GET requests--which you may notice some minor robots doing routinely.

I do notice a lot more PDF links in search results compared to earlier
Same here, now that you mention it. Are search engines getting better at reading pdf's, or are the pdf's themselves getting better? If the file is just a flat page image--which an older pdf may well be--do major search engines now run their own OCR, so there's text content to find in searches even if you, the human user, can't get your browser to find that same text?
11:11 pm on Oct 14, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


Getting afield of the OP's original question, PDF's are part of the modern browser's ordinary rendering. OLDER pdf's might be something different, but most I see these days are pretty "standard"...
7:00 am on Oct 15, 2019 (gmt 0)

New User

joined:Sept 25, 2019
posts:5
votes: 0


The same issue I was found in my E-commerce store. So. I use robots.txt to block source from google which will not helpful, like URL which contains "?", "tag". These pages also creating a duplicate issue. Try robots.txt to block Googlebot which will help to decrease Googlebot crawl request.
7:35 am on Oct 15, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


I won't turn it off until it crosses 100 simultaneous requests (actually these are not simultaneous, as it takes 3 seconds for Google to issue these 40 or so requests, which means each takes about 70 ms).
As I said, it's not really stressing the server much. CPU remains at just 4% during the spike.
9:47 am on Oct 16, 2019 (gmt 0)

Junior Member

joined:Aug 22, 2017
posts: 71
votes: 3


update - it's over as of 12 hours ago.
9:52 am on Oct 16, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10563
votes: 1122


Heh! ... One reason why I wait at least 14 days before I panic whenever g fiddles with things. :)

Thanks for the update!