Forum Moderators: Robert Charlton & goodroi
I wish to remove pages with session IDs from a large site that suffers from many indexed supplementals.
I would use a line like
disallow:/*sessionid
And then tell the google removal tool to revisit the robots.txt
My issue is this:
There are many pages indexed on this site via external links to pages with the sessionid variable - they are partner sites. There aren't any links on the actual site with these partner session id's in the link.
Will the removal bot simply crawl the site and remove urls it comes across with *sessionid in it? Or is the removal bot smart enough to crawl all pages indexed currently in google for that site?
All pages we want removed are now served automatically with noindex nofollow when they have sessionid in the url. But google's cache of these pages is months old, and it simply isn't crawling them because the links are from external pages, which it apparently rarely visits.
I could build a huge stupid sitemap to link to all of these pages so google will crawl them again, but that seems backwards. I could also remove them with the removal tool one at a time... but there are far too many.
Supplement pages can have various reasons as well.
Even though Google say that they support the wildcard(*) in robots.txt and it is explained on the following link, the URL removal tool simply does not recognise this line!
To remove dynamically generated pages, you'd use this robots.txt entry:User-agent: Googlebot
Disallow: /*?
[google.com...]
On the URL Removal tool it says:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?
But if you have this in your robots.txt and you go to the relevant sitemaps tool and test it against your robots.txt it says that URL is being blocked.
I am totally confused, but it seems to be working.
The other strange behaviour with the removal tool is that some urls which are now 404 it likes to remove but others don't. In my blog I the archive feature with links to dates which were all indexed by google. Now I removed those pages and the URL removal tool says "request denied"...
There must be a way to disallow ALL dynamic urls on a site. I can't believe that the wildcard is not yet a standard.
The "request denied" message is probably about some tehnical issue or other. However, I would not advise over using the removal tool -- save it for important issues and let Google handle your ordinary 404 responses however it sees fit.
By trying to "micro-manage" the Google index, many people have accidentally created significant ranking problems. No other search engine that I know of even has such a tool -- I suggest treating it like the very heavy duty action that it is.
There's a difference between what you can put in the robots.txt for googlebot (it accepts wild cards) and what the url removal tool will accept (it must be one unique url at a time.)
Apart from the "remove url" option it also allows to remove content according to the robots.txt. For this option I am talking about.