homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

Googlebot Errrors
Looking for "ht" extention instead of "htm"

 6:02 am on Jan 29, 2003 (gmt 0)

I am getting some hits from Googlebot where it is looking for example.ht pages instead of example.htm. I analyzed error pages and reviewed the page where those error links could be picked up by the robot, but there everything seems to be OK.

Any suggestions?



 6:13 am on Jan 29, 2003 (gmt 0)

Incorrect external links, mayhap?


 7:30 am on Jan 29, 2003 (gmt 0)

Thank you for your reply, but I do not see any backlinks from other sites. Maybe they are with low PR. I really doubt that someone linked to those pages as these pages are with limited information.

I will be checking my access logs from specific sites this month.


 7:46 am on Jan 29, 2003 (gmt 0)

This happened to my site yesterday also. Googlebot hit a file called deny.htm . The real filename was deny.html . Googlebot came close to violating a robots.txt prohibited file and triggering a spider trap. :)

Actually, Googlebot has been very well behaved lately and I only posted this in the event there is a problem that you should know about. I thought the last character being stripped from the extension was curious because AltaVista's Scooter had similar problems a few months ago (on a massive scale).


 7:51 am on Jan 29, 2003 (gmt 0)

<added>I just rechecked the logs and Googlebot is still looking for files with a .htm extension. I think you have got a problem there. I'll E-mail the details to search quality(at)google.com</added>


 7:53 am on Jan 29, 2003 (gmt 0)

Thanks, Key_Master. We'll check it out.


 9:45 am on Jan 29, 2003 (gmt 0)

It might be a Google thing - I have the same. Googlebot has been lokking for *.ht files at my site to day. I will be checking my site, but have had no porblems before.


 9:50 am on Jan 29, 2003 (gmt 0)

same here, some examples: - - [29/Jan/2003:11:36:17 -0500] "GET /projects.ht HTTP/1.0" 302 275
"-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

should be projects.htm

this also occurs on directories: - - [29/Jan/2003:11:35:13 -0500] "GET /tristan HTTP/1.0" 301 299 "-"
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

should be the directory /tristan/


 11:25 am on Jan 29, 2003 (gmt 0)

Just wanted to confirm that I've encountered a similar phenomenon. Freshbot requested a sizable number of pages and in each case the final character of the URL was missing.


 6:19 pm on Jan 29, 2003 (gmt 0)

By the way, I've already seen the deep crawl Googlebot (IP begining with 216) on my index page.

Anybody else?


 9:54 pm on Jan 29, 2003 (gmt 0)

Just talked to someone about this. Problem found, and should be solved now. Post an update here if you see any problems from now on. Thanks for reporting this! :)


 7:57 am on Jan 30, 2003 (gmt 0)

Hi Tristan,

It looks like your "404 - Not found" redirection isn't set up properly as the line: - - [29/Jan/2003:11:36:17 -0500] "GET /projects.ht HTTP/1.0" 302 275
"-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

shouldn't show a 302 header ...

When using the "ErrorDocument 404... " directive in your .htaccess file, don't use full URLs otherwise you'll never return the correct 404 header. Use something like:
ErrorDocument 404 /myerrorfile.htm

The 302 header means "temporarily moved" ... are your files located on a cluster?



 11:32 am on Jan 30, 2003 (gmt 0)

If anyone needs another reason why Google is number one, read this thread carefully. When Microsoft is notified of a problem with their software, they generally take months to resolve it, not a few hours. For a company of Google's size, this reponse time is phenomenal.

It's not just the quality of the search results y'know...


 11:44 am on Jan 30, 2003 (gmt 0)

Very nice jetboy_70! ;)

Totally, absolutely correct.


 7:19 pm on Jan 30, 2003 (gmt 0)

Well spotted, hetzeld!

Use ErrorDocument 404 /myerrorfile.htm,
not ErrorDocument 404 [example.com...]

This can cause serious trouble with search engines, the least of which is that your custom 404 page will start showing up in the SERPs.

See the Apache Core Features [httpd.apache.org] documentation for details.



 8:35 pm on Jan 30, 2003 (gmt 0)

hey, thanks alot dudes!
gonna fix that in my .htaccess'es now

again: thanks!


 8:46 pm on Jan 30, 2003 (gmt 0)

I just changed my .htaccess to contain the lines:

ErrorDocument 400 /
ErrorDocument 403 /
ErrorDocument 404 /
ErrorDocument 500 /

(before it was "ErrorDocument 404 [<mysite.com...]

but now when I test the 404 redirection, and for example got to

I get the / page, but the address in my address bar doesn't change
to [<mydomain>.com...]
Is it possible to give the correct 404 headers, AND change the address
to [<mydomain>.com...]

Since this isn't really on topic anymore, feel free to sticky me

thanks alot!


 10:57 am on Jan 31, 2003 (gmt 0)

Hi Tristan,

You won't be able to change the apparent URL (the one in your browser's address bar) using the ErrorDocument directive.

To achieve this, you could use the mod_rewrite with an external redirect ([R] flag)
I used this a few times for "file not found - 404" errors as this is solved using a quite trivial rule, but I'm not sure that this URL rewrite could be applied for all error codes (especially the 500 code)


Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved