I also tried to remove our site search URL's but nothing happened even when we have blocked them through robots.txt
The only think that happened after about 3-4 months is that all the site search urls now appear in crawl error report as 404....
Changes to your gwt prefs tend not to be retroactive. That is, they won't index new stuff but they'll keep the old stuff unless you explicitly tell them to delete it. Has anyone tried removing pages with parameters from the existing cache and index? Can it even be done, or do you just get a message saying "We've already removed that page"?
If you formerly had 10,000 and still have 10,000 I suspect something is working, because otherwise it would be 15,000 or 20,000 by now.
|Changes to your gwt prefs tend not to be retroactive. |
Right - exactly. If a type of URL is already indexed (such as Site Search) I use a 2-step approach for the clean-up. First, add a robots noindex to the template for about 4 weeks. Then add a robots.txt Disallow rule.
Hmm, ok that is interesting, I assumed that it would be retroactive. The parameter i am trying to remove is from dynamic urls that were created with a index problem we had a year ago. They have been fixed so that it cant happen however the urls are still valid. It would be almost impossible to add a disallow to this page because they are dynamic, is there a better way to do it? Can you disallow a parameter in robots.txt?
Thanks for all of the great info!
|Can you disallow a parameter in robots.txt? |
Most likely, if the character string that's used as your parameter name doesn't also appear the file and directory structure that the site uses.
This requires a pattern matching wild card "*" within the Disallow rule - that's an extension of the earlier robots.txt specification that Google supports. So imagine you want to disallow crawling of any URL that uses the parameter "pdq".
The rule Disallow: /*pdq would do it. But if your parameter is "sch" and you also have a URL like /kirschwasser.php - then you're in a bit of trouble.
If it is the first parameter then these would work:
If it is a non-first parameter, could you use
? Or does the ampersand have a robots.txt-specific meaning that I've forgotten about?
So if the problem parameter is amp;amp; due to the way the system would generate & then if I set up Disallow: /*amp;amp; then it should take care of the problem?
Yes. All those code snippets look like they would each be valid for their specific purposes.