Why Can't Google Crawl with this .htaccess

Forum Moderators: phranque

Message Too Old, No Replies

Why Can't Google Crawl with this .htaccess

bennyb3bil

3:28 am on Feb 18, 2009 (gmt 0)

Working on a friend's site for him.. he's having problems getting google to crawl his site. I fixed a bunch of sitemap issues he already had and finally got it to go through without sitemat errors, but now google is giving a 4xx error in webmaster tools under the "Errors for URLs in Sitemaps" category under web crawl.

Here's what he has in his .htaccess

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

When I remove any of this, the simple links to posts on his site no longer work.. he's using wordpress 2.7 as cms.

Any ideas?

jdMorgan

4:08 am on Feb 18, 2009 (gmt 0)

That's the standard WordPress code, and unlikely to be the cause of the problem.

All it does is rewrite any request for any non-blank URL-path to WordPress, as long as that URL-path does not resolve to a physically-existing file or directory. So if you have a "real" file named "blah.html" then a request for example.com/blah.html won't go to WordPress, but will get that real file instead. If the file does not exist, then the request is passed to WordPress, and WP takes care of deciding if any content can be served for that requested URL.

Check that the entries in sitemap.xml correspond exactly to the canonical domain and URLs on the site -- No "www versus non-www" or casing discrepancies should be allowed. Also, beware that WMT reports can be quite stale; Allow several weeks for your changes to be reflected in those reports, especially if this site isn't completely crawled on a weekly or daily basis.

Also check the robots.txt file and make sure its structure and syntax are 100% correct. If there isn't a robots.txt file, then that request too will be passed to WordPress...

Also be aware that some hosts may block search engine spiders at the firewall, either due to what they deem to be excessive crawling, or simply due to an error.

Jim

bennyb3bil

6:15 pm on Feb 18, 2009 (gmt 0)

Thanks so much for your insight Jim. I will check all of the issues you've mentioned.

I will let you know when/how it gets resolved.