Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why would googlebot fetch a page on my site 200 times in a day?

         

ChanandlerBong

6:05 pm on Mar 10, 2011 (gmt 0)

10+ Year Member Top Contributors Of The Month



I use a CMS (drupal) and noticed today this one particular page being requested every 3 minutes. 250 times yesterday in total, about 130 so far today. There are no errors on the page and when I use "fetch as googlebot" in WMT, it fetches it fine and reports "success" with a big green tick. When I look at how googlebot sees it, I see all the code right through to the closing html tag so it's not choking or tripping up on anything.

any polite way I can say "right, enough!" to googlebot for that page?

goodroi

12:58 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It could have just been a glitch. Imagine if you were Google and had to regularly crawl a billion pages. No matter how hard you try while crawling billions of pages the bot will have an occasional glitch.

You may also want to lookup the ip address just to make sure it is really googlebot and not a broken scraperbot impersonating googlebots user agent.

tedster

1:53 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does your server respond with a "304 Not Modified" status - or is your server just replying with all the html and a "200 OK" status, perhaps changing something small about the page every time?

ChanandlerBong

3:27 am on Mar 11, 2011 (gmt 0)

10+ Year Member Top Contributors Of The Month



Yes, it's definitely googlebot. I checked as soon as it got into double figures.

My server responds with "HTTP/1.1 200 OK" but also with:

Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: store, no-cache, must-revalidate, post-check=0, pre-check=0

Wondering if either of those is causing a problem. Having said that, every page on my site has exactly the same data in the header.

indyank

3:31 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tedster, doesn't most CMS platforms return 200 OK, though things don't change?

TheMadScientist

3:35 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually, most dynamic pages always return a 200 OK not a 304 (at least on Apache). If you want to serve a 304 Not Modified header you have to set the php (or whatever) file to do it manually.

indyank

3:39 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



you are right.

ChanandlerBong

3:41 am on Mar 11, 2011 (gmt 0)

10+ Year Member Top Contributors Of The Month



just been reading up on that 1978 expire date in the header and it's a drupal "thing"...birthday of the founder of drupal.

is it normal to have an expire date way in the past?

TheMadScientist

3:46 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's actually a way to keep the page from being 'not refreshed' on a new load ... I think it was one of those 'super special, just for IE tricks' to get it to reload the page every time it was opened and not serve it from the local cache, but I don't remember for sure and I'm not sure it's still necessary, but it should not be an issue.

If it was causing the problem, then the issue should be present on all (or most for sure) Drupal pages, and it's not, so that's probably not the issue.

tedster

3:58 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



most dynamic pages always return a 200 OK not a 304 (at least on Apache)

Yes, but unless the URL changes very frequently, you can cache the content on your server and get a lot of savings. Drupal, Wordpress, Joomla, MovableType, Typepad - most CMS in fact - all have caching systems available.

But that wasn't my purpose in asking the question. I wanted to explore the chance that something is slightly different every time googlebot requests that URL. When certain anomalies crop up, they are known to test servers for various kinds of potential problems. Still, the most likely scenario is some kind of crawling bug, I think.

inbound

4:07 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is your site present in Google News? I don't have the type of site that would be so I don't have first hand experience, but if I was Google I would only have one need to look at a page every 3 minutes.

TheMadScientist

4:11 am on Mar 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah, I was wondering if there was a twitter feed or something on the page maybe that was causing some 'interest', but then got sidetracked with the header discussion ... Good to know about the caching ... I don't use off the shelf software installations, but know people who do, so that's good knowledge to have and be able to pass along.

ChanandlerBong

4:55 am on Mar 11, 2011 (gmt 0)

10+ Year Member Top Contributors Of The Month



no, the site is actually very new and four days ago, there were only 3 pages in the G index. Now that's up to about 500. So no inbound links yet, no twitter, no digg, nothing like that. Just seems googlebot on its own has decided this one page is really important.

It's now moved onto another one. Still indexing many others but keeps returning to this one particular page, indexing it every 1-2 minutes for hour or so, then going off again.

honestly...I think googlebot's been at the bottle.