Welcome to WebmasterWorld Guest from 54.224.49.217

Forum Moderators: open

Message Too Old, No Replies

Googlebot Errrors

Looking for "ht" extention instead of "htm"

     
6:02 am on Jan 29, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 5, 2002
posts:48
votes: 0


I am getting some hits from Googlebot where it is looking for example.ht pages instead of example.htm. I analyzed error pages and reviewed the page where those error links could be picked up by the robot, but there everything seems to be OK.

Any suggestions?

6:13 am on Jan 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Incorrect external links, mayhap?
7:30 am on Jan 29, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 5, 2002
posts:48
votes: 0


Thank you for your reply, but I do not see any backlinks from other sites. Maybe they are with low PR. I really doubt that someone linked to those pages as these pages are with limited information.

I will be checking my access logs from specific sites this month.

7:46 am on Jan 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


This happened to my site yesterday also. Googlebot hit a file called deny.htm . The real filename was deny.html . Googlebot came close to violating a robots.txt prohibited file and triggering a spider trap. :)

Actually, Googlebot has been very well behaved lately and I only posted this in the event there is a problem that you should know about. I thought the last character being stripped from the extension was curious because AltaVista's Scooter had similar problems a few months ago (on a massive scale).

7:51 am on Jan 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


<added>I just rechecked the logs and Googlebot is still looking for files with a .htm extension. I think you have got a problem there. I'll E-mail the details to search quality(at)google.com</added>
7:53 am on Jan 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Thanks, Key_Master. We'll check it out.
9:45 am on Jan 29, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 6, 2002
posts:59
votes: 0


It might be a Google thing - I have the same. Googlebot has been lokking for *.ht files at my site to day. I will be checking my site, but have had no porblems before.
9:50 am on Jan 29, 2003 (gmt 0)

New User

5+ Year Member

joined:Oct 4, 2008
posts:
votes: 0


same here, some examples:

64.68.82.46 - - [29/Jan/2003:11:36:17 -0500] "GET /projects.ht HTTP/1.0" 302 275
"-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

should be projects.htm

this also occurs on directories:

64.68.82.51 - - [29/Jan/2003:11:35:13 -0500] "GET /tristan HTTP/1.0" 301 299 "-"
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

should be the directory /tristan/

11:25 am on Jan 29, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 3, 2003
posts:58
votes: 0


Just wanted to confirm that I've encountered a similar phenomenon. Freshbot requested a sizable number of pages and in each case the final character of the URL was missing.

Gringo

6:19 pm on Jan 29, 2003 (gmt 0)

Inactive Member
Account Expired

 
 


By the way, I've already seen the deep crawl Googlebot (IP begining with 216) on my index page.

Anybody else?

9:54 pm on Jan 29, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Just talked to someone about this. Problem found, and should be solved now. Post an update here if you see any problems from now on. Thanks for reporting this! :)
7:57 am on Jan 30, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Dec 5, 2002
posts:219
votes: 0


Hi Tristan,

It looks like your "404 - Not found" redirection isn't set up properly as the line:
64.68.82.46 - - [29/Jan/2003:11:36:17 -0500] "GET /projects.ht HTTP/1.0" 302 275
"-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

shouldn't show a 302 header ...

When using the "ErrorDocument 404... " directive in your .htaccess file, don't use full URLs otherwise you'll never return the correct 404 header. Use something like:
ErrorDocument 404 /myerrorfile.htm

The 302 header means "temporarily moved" ... are your files located on a cluster?

Dan

11:32 am on Jan 30, 2003 (gmt 0)

Preferred Member

10+ Year Member

joined:July 26, 2002
posts:535
votes: 0


If anyone needs another reason why Google is number one, read this thread carefully. When Microsoft is notified of a problem with their software, they generally take months to resolve it, not a few hours. For a company of Google's size, this reponse time is phenomenal.

It's not just the quality of the search results y'know...

11:44 am on Jan 30, 2003 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member fathom is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 5, 2002
posts:4110
votes: 109


Very nice jetboy_70! ;)

Totally, absolutely correct.

7:19 pm on Jan 30, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Well spotted, hetzeld!

Use ErrorDocument 404 /myerrorfile.htm,
not ErrorDocument 404 [example.com...]

This can cause serious trouble with search engines, the least of which is that your custom 404 page will start showing up in the SERPs.

See the Apache Core Features [httpd.apache.org] documentation for details.

Jim

8:35 pm on Jan 30, 2003 (gmt 0)

New User

5+ Year Member

joined:Oct 4, 2008
posts:
votes: 0


hey, thanks alot dudes!
gonna fix that in my .htaccess'es now

again: thanks!

8:46 pm on Jan 30, 2003 (gmt 0)

New User

5+ Year Member

joined:Oct 4, 2008
posts:
votes: 0


I just changed my .htaccess to contain the lines:

ErrorDocument 400 /
ErrorDocument 403 /
ErrorDocument 404 /
ErrorDocument 500 /

(before it was "ErrorDocument 404 [<mysite.com...]

but now when I test the 404 redirection, and for example got to
[<mydomain>.com...]

I get the / page, but the address in my address bar doesn't change
to [<mydomain>.com...]
Is it possible to give the correct 404 headers, AND change the address
to [<mydomain>.com...]

Since this isn't really on topic anymore, feel free to sticky me

thanks alot!

10:57 am on Jan 31, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Dec 5, 2002
posts:219
votes: 0


Hi Tristan,

You won't be able to change the apparent URL (the one in your browser's address bar) using the ErrorDocument directive.

To achieve this, you could use the mod_rewrite with an external redirect ([R] flag)
I used this a few times for "file not found - 404" errors as this is solved using a quite trivial rule, but I'm not sure that this URL rewrite could be applied for all error codes (especially the 500 code)

Dan