Pages Crawled Too Often - Getting an overeager Googlebot to chill - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages Crawled Too Often - Getting an overeager Googlebot to chill

Dan99

10:48 pm on Mar 26, 2016 (gmt 0)

I run a website that I've just given crawl permission to several bots. Now Googlebot is hitting my site very often, and as it downloads all my files, it is doing it repeatedly. A given file might be downloaded four or five times in one day. Yes it is really Googlebot. I have confirmed that. It obeys robots.txt, so I can just tell it to go away, and it will. I can obviously just deny access to the IPs that it is hitting me with as well. But I'm OK with being crawled. But it's a waste of bandwidth to get hammered on repeatedly for the same files.

So how can I tell this bot to just, er, chill?

On thing that is a little odd is that these repeated hits often show different User Agents. Is this where Goggle's left and right hands don't know what the other is doing, and that different parts of Google want independent access?

lucy24

11:16 pm on Mar 26, 2016 (gmt 0)

these repeated hits often show different User Agents

At a minimum, page requests will come from three variants of the mobile Googlebot, in addition to the vanilla googlebot. These should drop off after a while if they see that all receive identical content, though they will never disappear entirely. Requests with the "If-Modified-Since" header (they don't always send it) will give the time of the last visit by that specific User-Agent. So it isn't that one hand doesn't know what the other hand is doing, it's that the right hand needs to find out if the left hand is receiving identical service.

Google ignores the "Crawl-Delay" directive, but professes to honor any limitations you set in GWT/GSC/whatever-it's-called-this-week. So be sure to try that.

If it's a newly indexed site, why not give it a few weeks. It is not every day people complain that Google is crawling their site too often ;)

Dan99

11:43 pm on Mar 26, 2016 (gmt 0)

Thank you. I'll do that. That's interesting. FWIW, I did set Crawl Delay to 100, and the accesses from Googlebot are really just every minute or two, so maybe it is obeying that directive.

Why does Googlebot work this way? What's wrong with just a vanilla Googlebot? Why should it be concerned about getting different content from different variants? I guess I'm not sure what incentive is driving this.

lucy24

1:11 am on Mar 27, 2016 (gmt 0)

Why should it be concerned about getting different content from different variants?

It's about responsive web design vs. user-agent detection. I said "three mobile googlebots" but in fact the most common one-- the one with the iPhone UA-- no longer even calls itself Googlebot-Mobile. It's just Googlebot plus the rest of the UA string.

If someone out there has a site that relies heavily on user-agent detection and serving different content to different devices, they can probably shed more light on what, exactly, the various googlebots do, and how often they do it.

Dan99

1:33 am on Mar 27, 2016 (gmt 0)

Hmm. You'd think that if it were denied access with one user agent, it would just try with another. But instead it just hits the site with all of them. But yes, maybe different user agents are actually served different content.

Yes one of the UAs is in fact an iPhone one. That struck me as odd, but I guess if they want some reassurance that mobile devices will be getting the same content as everyone else, they've got to try that.

Dan99

10:31 pm on Mar 27, 2016 (gmt 0)

As it turns out, Googlebot is retrieving my files not just with different user agents. For any one file, the identical file is accessed by Googlebot more than once with *exactly* the same user agent, but with a different IP. Sometimes less than a day later. So maybe the left hand knows what the right hand is doing, but evidently one Googlebot IP doesn't know what the other IP is doing.

There is some silliness about this.

lucy24

11:35 pm on Mar 27, 2016 (gmt 0)

but with a different IP

Are they all legitimate Google IPs? Within the US, it should only be 66.249.64.0/20 for crawling. (The adjacent 66.249.80.0/20, like the other Google addresses in the US, is for various Googloid functions of varying legitimacy.) It's a widely spoofed UA, so I hope you're blocking anything that claims to be the Googlebot but comes from elsewhere.

Rumor has it they're also using non-US ranges, but only if you've got non-US-targeted content. Has anyone worked up a list of legitimate Google IPs outside the US? I don't remember seeing one.

:: detour to check something ::

Ohhh yeah. Thousands of fake googlebots*-- but not even slightly convincing fakes, based on the few I spot-checked.

* I just searched logs for 403 responses to "Googlebot".

Dan99

11:54 pm on Mar 27, 2016 (gmt 0)

66.249.64.174 - - [25/Mar/2016:01:11:48 -0500] "GET /xxx/yyy/yyy.pdf HTTP/1.1" 200 7624646 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.64.174 - - [25/Mar/2016:05:24:09 -0500] "GET /xxx/yyy/yyy.pdf HTTP/1.1" 200 7624646 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.69.174 - - [25/Mar/2016:22:38:56 -0500] "GET /xxx/yyy/yyy.pdf HTTP/1.1" 200 7624646 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

So there are two identical hits here, and one with a different User Agent. All the same IP and the same day. That IP checks out as true Googlebot, I think. I get other hits with the same User Agent but with different IPs, for particular files. All of my hits are from self-identified Googlebot crawlers whose IPs trace back to Googlebot. Basically all 66.249.

I'm seeing lots like this. Multiple hits with the same IP and User Agent. Of course, I also get a few that are 304s, in which little bandwidth is used in the response. As I said, Googlebot appears a little overenthusiastic.

encyclo

1:02 am on Mar 28, 2016 (gmt 0)

How bad is the bandwidth and/or performance hit in real terms? Bandwidth is cheap these days, and if site performance is not adversely affected by Googlebot's indexing, why would you need to reduce it? Is the monetary cost excessive? Isn't is a good thing that Googlebot is enthusiastic about your site content? If site performance is affected, isn't the problem the lack of appropriate resources rather than excessive indexing?

I just don't see that four or five visits a day is either excessive or problematic.

Dan99

1:52 am on Mar 28, 2016 (gmt 0)

That's a fair comment. I'm not paying anything for bandwidth, and site performance probably isn't reduced. I just find it a bit strange that Googlebots do this. I suspect that if Googlebots did this to everyone, the Googlebot performance would be affected.

Four or five visits per day PER FILE isn't problematic. But I have thousands of files, and Google wants all of them. Sure messes up my logs.

keyplyr

8:14 am on Mar 28, 2016 (gmt 0)

I'm not paying anything for bandwidth, and site performance probably isn't reduced. I just find it a bit strange that Googlebots do this.

You'll get used to it. This is the way it works. Google, and other SEs, have different bots for different purposes. They all need to get the files.

Just a FYI - many here probably wonder why you're upset with this. Getting Googlebot to come around and get files is a good thing :)

Robert Charlton

9:11 am on Mar 28, 2016 (gmt 0)

...a website that I've just given crawl permission to several bots. Now Googlebot is hitting my site very often

The above raises a question for me, which is whether this site is fairly new to crawling by Googlebot. That perhaps could account for a level of curiosity, of a kind which Googlebot might not have for, say, an older, stable site whose behaviors it knows.

I'm also wondering whether there is something about this site itself that might be prompting Googlebot to come back and check often (albeit the figures you present are not unusual). In particular, is there anything about the site that changes regularly? I ask in part because I've seen sites that purposely change content just to prompt crawling (not a technique I'd recommend, btw), so if something on the site was changing frequently, frequent crawl might be expected.

(In this regard, also, one of your posts from almost a year ago asks about "a conditional serve based on time/date",which did get me thinking about content changes that Google might want to pay attention to. I have no idea whether this is the same site, and whether it's something regular or not, but in this context it prompts my question).

Dan99

1:03 pm on Mar 28, 2016 (gmt 0)

This website is actually a forum for professional presentations. I have a large invite list for this website, but I don't mind public access. That is, the priority is my invitees, and not the public. No click bucks are being made off of it so I have no great incentive to advertise it widely. So if Googlebot is going to be sloppy about crawling, I suppose I could simply deny access to it again. To be very honest, getting bots to crawl through your site is a good thing if your highest priority is getting the largest audience. It just isn't in this case.

The website has been around for six or seven years, and I used to allow Googlebots to access it. Then it got to the point that those accesses were sort of manic, like it is now. I removed access for a year, and now I'm trying again. The site changes roughly once a week, as more presentations are added to the site. Yes, I suppose Google is perplexed that it reappeared, but that doesn't justify manic crawling.

I guess I'm just surprised that a solid outfit like Google wants to run bots that are really sloppy.

I haven't get implemented that conditional serve based on time/date yet. Was just exploring possibilities for managing access to invitees.

lucy24

9:06 pm on Mar 28, 2016 (gmt 0)

I should think that any new site-- new to the search engine, that is-- gets an extra lot of crawling at first. They have to visit frequently just to get a sense of how often the content changes. Whether they care how often it changes is a whole different part of the process.

Wait a few months and they'll level off. If they're still requesting all your pages many times a day ... well, heck, it can only be because lots of people are responding* to SERPs, so all is good.

* Maybe. I have a couple of pages that seem to get crawled disproportionately often. I think there must be something about the content that causes them to show up often in searches, even if the search doesn't lead to an unusual number of human visitors.

Dan99

9:20 pm on Mar 28, 2016 (gmt 0)

Yes, thanks. I *think* I'm seeing some relaxation in Googlebot. A few days ago about 95% of my accesses were from Googlebot, and now it's down to maybe 70%. Actually, even those are mostly 302s now. That's where, I guess, the bot doesn't yet trust me to keep things largely the same. In a few weeks, I guess, it'll conclude that repeated frequent accesses are kinda stupid.

keyplyr

9:26 pm on Mar 28, 2016 (gmt 0)

With my personal site, it's Bing. Crawls every one of my 260 pages 3Xs a day. Then msnbot crawls them every other day. And don't even ask about their image bots. This has been the reality for 10 of the 18 years the site has been up.

Dan99

9:38 pm on Mar 28, 2016 (gmt 0)

Heh. I banned Bing long ago. For exactly that reason. If anyone wants desperately to find me, they can use Google. Like I said, I'm not desperate for everyone to be able to find me.

lucy24

2:56 am on Mar 29, 2016 (gmt 0)

even those are mostly 302s now

I hope you meant 304 :)

Dan99

3:13 am on Mar 29, 2016 (gmt 0)

Ouch. Right. 304!

There were a couple of 302s, but that's because they were looking for something in a place where I used to keep it long ago and got redirected as a result. Not entirely obvious why they'd use old info, but I guess their philosophy is that in order to find everything, you try everything.

lucy24

7:06 am on Mar 29, 2016 (gmt 0)

Not entirely obvious why they'd use old info

Wholly independent issue. They're apparently in the middle of some major housecleaning; in my case they've been requesting URLs that ceased to exist up to five years ago. I checked my notes and this is no exaggeration. They're requesting stuff in /fun/ that was spun-off to /hovercraft/ early in 2011, and then other stuff in /hovercraft/ that was spun-off to /fonts/ later the same year. Uh, yeah, Google, it's still at the new URL. The one you've been successfully crawling for several years now. (Admittedly, I only recently noticed I have two completely unrelated pages, in unrelated directories, with the identical filename*-- and both have been redirected at least once. So sometimes I'm not sure what they're asking for.)

:: detour to check something ::

wtf? I know Google sometimes sends a referer when requesting stylesheets, but why are they giving /directory/index.html as referer, when URLs in "index.html" have never existed on this site, and in some cases on no site ever?

* I mean, ahem, filenames within the visible URL. I also have lots and lots of files named index.html :)