Forum Moderators: open
MSN has been spidering pages that are prohibited by robots.txt and those pages can be found in the MSN cache. I believe that there are no errors in the robots.txt file. It looks like this:
User-agent: *
Disallow: /directory1
Disallow: /directory2
...and yet MSN is spidering and caching /directory1
(for example)
Also lately when I've done a 301 (permanent) redirect from an old page to a new page, I've found both the old and new pages sitting in the MSN and Yahoo caches.
Another problem with Yahoo is that, even when I am using search engine friendly URLs with mod_rewrite, and there are no links to any dynamic pages on my site, Yahoo will spider many dynamic urls. This happens with my own URL rewrites as well as with WordPress, where Yahoo has spidered the entire blog's dynamic URLs instead of the search-engine-friendly ones.
Google has indexed hundreds of pages of one of my sites that are identical to each other. It was an error in a calendar script that I had used, and when I removed the calendar, it left hundreds of identical pages in Googles cache with urls like:
/?url-1
/?url-2
/?url-3
This threw all of my pages but the home page into the supplemental results. I have a PHP script that sends a 404 header whenever a dynamic URL (query string) is requested but it hasn't stopped Google from even adding more dynamic pages to their index that don't exist.
I tried removing these with Google's URL removal tool and it said that the pages have been removed, but they weren't removed. I also added the following line to the robots.txt file (found on Google's robots.txt information page) and it hasn't worked either:
User-agent: Googlebot
Disallow: /*?
The Google URL removal tool has been down for a while now...