Forum Moderators: open
One thing is: I used different spellings for the same site (index.htm or index.HTM). They are both identical, but google seems to treat them as different pages. Now, I made all links lower case. And I put the wrong spellings in my robots.txt file. Google doesn't spider the excluded pages any more, so the robots-Syntax seems to be ok. But the double pages still linger in the cache and probably are regarded as multiple content (all pages do have page rank, though)
Any idea how I can get the old pages deleted without putting the real pages at risk?
If they were seen as duplicate then they would just be merged; one of the two URLs would inherit the backlinks and PageRank of the other.
Now that you've unlinked the URLs, they should eventually disappear but over this year unlinked URLs haven't gone with the smae regularity as we were used to previously.
To have one removed, I would suggest removing the /robots.txt exclusion and using a robots meta tag to exclude them (eg. you could use XSSI or similar). Then, when the robot visits it should remove them.
in that case I'll just wait - I thought it was my fault they'd keep the old pages.
What's XSSI (extended server side includes?)
But if I used the meta tag - I would have to know which spelling the bot used to access the page (e.g. "index.cfm" oder "InDex.cfm"). (which isn't hard to code, but I'm not sure sure how reliable the info is.)
btw. does gbot spider pages that don't have any more links to them anyway?
> does gbot spider pages that don't have any more links to them anyway?
Normally I think not, but it can happen. For example, some pages with no links since early this year are having their links crawled.
If the URL is /robots.txt excluded, then Google won't request the URL and it won't be able to find the META tag robots exclusion.
/robots.txt exclusion does not ask an engine to remove a URL from its index
The point is, with all the reasons not to keep a disallowed page in their index, I can't think of one reason why a search engine would want to. And if they did, how would they know when that page doesn't exist anymore? With that, a page could stay indexed indefinately because the SE would never know when it didn't exist anymore.
I'm not arguing against you, just oferring my reasoning. I don't have any hard facts or prior experience to base my thoughts off of, but maybe you do. And maybe I misunderstood you in the first place ;)
Unless things have changed recently, Google will list a /robots.txt excluded URL without fetching it (so no title or snippet). This can allow members-only URLs to be listed if people link to them.
Sensitive information that the site doesn't want crawled cannot be protected by /robots.txt, which is merely a mechanism to request polite robots not to fetch those URLs.