homepage Welcome to WebmasterWorld Guest from 54.197.19.35
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Strange Googlebot behavior
Grabbing same page over and over in one hit
grandma genie




msg:4578697
 6:54 pm on May 28, 2013 (gmt 0)

Hello,

I am used to see Googlebot indexing my pages. Their hits usually look something like this:

66.249.73.17 - - [22/May/2013:10:08:53 -0700] "GET /direcctory/sample.html HTTP/1.1" 301 231 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

But only recently I've begun seeing this odd behavior. Is this the new normal for googlebot?

66.249.73.17 - - [22/May/2013:10:08:56 -0700] "GET /https://example.com/https://example.com/https://example.com/https://v.com/https://example.com/direcctory/sample.html/example.com/direcctory/sample.html/example.com/https:/example.com/direcctory/sample.html/example.com/direcctory/sample.html/example.com/https:/example.com/https:/example.com/direcctory/sample.html/example.com/direcctory/sample.hl/example.com/https:/example.com/direcctory/sample.hl/example.com/direcctory/sample.hl/example.com/https:/example.com/https:/example.com/https:/example.com/direcctory/sample.hl/example.com/direcctory/sample.hl/example.com/https:/example.com/direcctory/sample.hl/example.com/direcctory/sample.hl/example.com/https:/example.com/https:/example.com/direcctory/sample.hl/example.com/direcctory/sample.hl/example.com/https:/example.com/direcctory/sample.hl/example.com/direcctory/sample.hl HTTP/1.1" 301 276 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

My logs are full of this repetitious behavior.

-- Grandma

 

lucy24




msg:4578799
 11:41 pm on May 28, 2013 (gmt 0)

Was that your own thread title or a moderator? What you're quoting is not "the same page over and over", it's some kind of garbage link. Wait a couple of days and you may not find the source in gwt. (Or is it only bing that does this with 404s? I forget.)

:: trying to figure out if it was my browser that changed every occurrence of "html" into "h{trademark symbol}l" or did it happen earlier :) ::

grandma genie




msg:4579036
 2:25 pm on May 29, 2013 (gmt 0)

Hi Lucy,
I don't know why those little th's appeared but I'll try it again now. This is a sample of the Googlebot's odd indexing behavior. I have not seen it again now for several days (why all the repetitions):
66.249.73.nn - - [23/May/2013:01:11:44 -0700] "GET /https://example.com/https://example.com/https://example.com/https://example.com/https://example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/https:/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html/example.com/https:/example.com/directory/same_page.html/example.com/directory/same_page.html HTTP/1.1" 301 291 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

lucy24




msg:4579150
 7:08 pm on May 29, 2013 (gmt 0)

Right now you don't know if the googlebot itself has the hiccups, or it's following a link from some hiccupy third party.

You don't have to 'adopt' all wrong requests. Sometimes "ignore it and it will go away" really is the best fix ;)

And, er, when I said "may not" I meant "may". Careless editing.

Key_Master




msg:4579185
 8:13 pm on May 29, 2013 (gmt 0)

I would be concerned. If that page doesn't exist, the server should respond with a 404 or 410- not a 301.

I think you have a faulty redirect somewhere, most likely having to do with port 443. What happens when you follow "/https://example.com/"?

lucy24




msg:4579213
 9:25 pm on May 29, 2013 (gmt 0)

### I didn't even notice the 301. Is it followed by another request? If not, the punch line may be "Oh! I didn't realize that 'example.com/blahblah/https://example.com/blahblah/https://example.com/blahblah' is the same page as 'www.example.com/blahblah/https://example.com/blahblah/https://example.com/blahblah'. I'll just scratch it off the list them."

On shared hosting I don't know if it's possible to find out more about the original request. The two obvious variables are: with/without leading www in hostname, and http vs. https protocol.

Does the form
https://example.com/
by itself lead to a real (non-redirected) page?

grandma genie




msg:4579963
 4:57 pm on May 31, 2013 (gmt 0)

I only noticed those googlebot searches on one day. They all came from the same IP. The https://example.com does lead to a real page. It is not a redirect. However, there are some redirects in my htaccess file that involve port 443. My host put them there when I first moved my site to his VPS server. All my old links were from https and he changed everything to use http://example.com. It used to be https://www.example.com. But those redirects did not work. I had to wait for all the old https links to expire before anyone could find me again. (Took about a month.) Most of those old links are gone now. But the (non-working) htaccess redirects are still there. I assume I can remove them. I've asked my host about them. Waiting for his reply.
-- Grandma

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved