Site Returning 301 Response when Crawled by Specific Googlebot.

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Site Returning 301 Response when Crawled by Specific Googlebot.

Coleman123

7:56 am on Oct 16, 2013 (gmt 0)

Hey everyone,

I frequent this forum quite often and really appreciate all the dedicated members of this site!

Unfortunately, I've encountered an issue that I'm unable to find a previous discussion about.

Over the last several days i've been monitoring my data logs and have found that any page on my site returns a 301 header when crawled by this googlebot IP address: 66.249.74.145

Then googlebot recrawls with this IP address and a 200 response is given: 66.249.74.220

Has anyone experienced this?

My only thought is that this IP is crawling using HTTPS.. but I'm just learning to use my log data and can't figure out how to track this.

Thanks for any response in advance!

lucy24

10:14 am on Oct 16, 2013 (gmt 0)

Do Not Panic.

Like any good search engine, the googlebot periodically asks for pages using the "wrong" hostname (with or without www., whichever form you don't use). Assuming you redirect these requests to the right form, it will show up in logs as a 301 just like any other redirect.

HTTPS to/from HTTP is another perfectly reasonable possibility, assuming you've got a redirect in place.

:: detour to check ::

Well. I have no idea what that's all about, but if I manually request my own site with https (which I don't use), the browser simply hangs. Tried two different browsers to make sure. Wonder what it's doing? Anyone know what's supposed to happen when you request the wrong protocol?

Samizdata

10:19 am on Oct 16, 2013 (gmt 0)

My only thought is that this IP is crawling using HTTPS

I would say the most likely cause is that the bot is hitting the non-www version of your site, in which case a 301 is exactly the response that most webmasters would want to serve.

No cause for alarm, IMHO.

...

Coleman123

10:34 am on Oct 16, 2013 (gmt 0)

@Lucy24
@Samizdata
- Great! Thanks for the quick responses!

g1smd

8:43 pm on Oct 16, 2013 (gmt 0)

@Lucy : What should your site do if https is requested?

Depends what it has been set up to do and what you want it to do. In many cases, it won't resolve to anything and nothing will be returned.

The most useful response is a redirect to www and http.

lucy24

10:06 pm on Oct 16, 2013 (gmt 0)

What should your site do if https is requested?

"I don't know-- something."

As far as I can make out, requests in https go into limbo and never reach my site. All the redirect forms I can think of--
%{SERVER_PROTOCOL} HTTPS [not sure if that's even right, but it doesn't create an error]
%{HTTPS} on
%{THE_REQUEST} HTTPS
--have no effect. If I sit back and do nothing, I eventually get a browser error saying the attempted connection timed out.

Maybe I should ask the host.

For comparison purposes, https requests to the present site lead to an Apache error message:

An error occurred during a connection to www.webmasterworld.com.

SSL received a record that exceeded the maximum permissible length.

(Error code: ssl_error_rx_record_too_long)

The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.

* Please contact the web site owners to inform them of this problem.

web site owners, hm, does that mean engine? Yeah, I guess someone might like to know ;)

JD_Toims

10:13 pm on Oct 16, 2013 (gmt 0)

--have no effect. If I sit back and do nothing, I eventually get a browser error saying the attempted connection timed out.

Your host probably doesn't have accounts without an SSL cert. set to Listen on port 443 [or whatever they decide to use for https], so when a browser tries to connect it can't, because Apache isn't "listening" and your .htaccess is never seen [or something along those lines].

Note: I get the same error from one hosting acct. as you're seeing here. I haven't tested others to see their response.

phranque

6:09 am on Oct 17, 2013 (gmt 0)

yes - most likely that host isn't listening on the secure port and the secure requests aren't getting any response.
note that with https: the secure handshake must occur before any web server stuff happens.

Robert Charlton

7:42 am on Oct 17, 2013 (gmt 0)

any page on my site returns a 301 header when crawled by this googlebot IP address...

Trying to get my head around a description of this in layman's language... what if anything should be done about this?

It sounds like this particular Googlebot is in effect requesting https protocol and is getting redirected to http, which is what I assume canonicalization would be doing, except that http isn't being returned and this Googlebot comes back and requests http.

But I think I'm leaving something out of the description here.

JD_Toims

8:21 am on Oct 17, 2013 (gmt 0)

what if anything should be done about this?

Nothing since "bot 2" is getting the correct page and response.

If it was more widely reported there could probably be a more definite conclusion, but for a guess, to me it sounds almost like split processing -- EG "bot 1 on IP 1" starts a spidering run and requests non-www [or https or whatever]. If it receives a 200 OK it sends the page to processing, but if it receives a redirect the new location is passed to "bot 2 on IP 2" and "bot 2" spiders URLs "bot 1" is redirected to.

It actually makes a bit of sense to me they might "pass redirects" to a different bot/ip since they have said if there are more than 3 or so redirects they may not be followed.

It also seems like they could speed up indexing and processing by "passing redirects" to different bots, because they could have one spider "everything" and send "good" requests to be processed, then another with a dynamic URL list only request the locations the first bot was redirected to and send "good" request to be processed, then a another with a dynamic URL list only request the locations the second bot was redirected to and send "good" requests to be processed, then just dump anything after N redirects, which could be 3 or 4 or whatever they feel like today.

lucy24

10:39 am on Oct 17, 2013 (gmt 0)

It's especially noticeable with the bingbot because of its morbid fascination for robots.txt. So you'll see requests running in tandem-- robots.txt, robots.txt, page, page, with one of each pair getting 301.

It sounds like this particular Googlebot is in effect requesting https protocol

We don't know that; it was simply Option B. It's far more likely to be the with/without www. issue. Don't know how it works when you've got your own server, but logs on shared hosting don't mention hostname at all. So all you see is a request for /foobar.html with 301 response, followed by a second request for /foobar.html this time getting a 200.

they have said if there are more than 3 or so redirects they may not be followed

Most robots, including search engines, don't really "follow" redirects in the way that a human does, where you type something in and the browser instantly bounces over to the new URL. Generally they just make a note of the information and act on it later.

Same goes for 403s. It's very rare (in fact it's scary!) for a malign robot to behave differently depending on the response it gets; most of them just hammer away through the shopping list.

:: detour to most recent available logs ::

Ooh, look, here's exactly what I was talking about, only it isn't the bingbot. 4 consecutive lines:

66.249.66.111 - - [16/Oct/2013:18:34:23 -0700] "GET /robots.txt HTTP/1.1" 301 581 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
66.249.66.169 - - [16/Oct/2013:18:34:23 -0700] "GET /robots.txt HTTP/1.1" 200 672 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
66.249.66.111 - - [16/Oct/2013:18:34:23 -0700] "GET / HTTP/1.1" 301 561 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
66.249.66.169 - - [16/Oct/2013:18:34:24 -0700] "GET / HTTP/1.1" 200 2764 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"