|Experiences with the Google URL removal tool|
A nice tool, but use it with care!
The Google removal tool has been often mentioned in this forum as a cure against many diseases. I have used it quite often to remove obsolete content on my site and I am happy with its existense, but there are some side effects that I have observed that are worth sharing. Most of the testing mentioned below was done on a non-commercial personal blog where removal from Google didn't hurt me, but people wanting to use the tool on commercial sites should do it with caution.
URLs are filtered, not removed
After you use the tool, you may see that the amount of pages that Google reports is not lowered, but the actual number of pages with the site: search option is. This is, because not all information of the URLs is deleted, but they are filtered from the index. After the removal period is over, they will suddenly pop-up again in the site: search, including a cache. This also happened for me with pages that were already deleted before I used the removal tool. So Google didn't reload the page and took a cache, but merely used data still somewhere in its index.
Length of removal
In the early times the removal tool removed a URL for the period of 90 days. This was later silently changed to 180. Now it seems that it is a mix of the two. URLs I submitted at April 1st are back in the index and show the status expired in the tool, where URLs I submitted in February are still hidden and have the status complete. It seems that Google changed back the period to 90 days in March or April, although currently the documentation says 180 days.
(toolbar) PR is completely removed
I have observed this on a site where I have a set of pages tagged with <meta robots="noindex,follow"> I submitted some of these pages with the URL removal tool, the other pages were removed automatically after some spider cycles of Googlebot. The interesting thing is that after the recent toolbar PR update all tool-removed pages show toolbar PR0, where all Googlebot removed pages show toolbar PR3. So, although none of these pages are indexed, the "noindex,follow" pages still carry PR and are probably capable to transfer PR to other pages, where the tool-removed pages don't. This is especially important for people who want to transfer domains and do a 301 redirect from the old domain to the new. DO NOT USE the URL removal tool on the old domain in this case, as the value of all incomming links from external sites to the old domain may be discarded!
Links on removed pages are not spidered
Links from removed pages are actually dead. On the same site mentioned above I have rows of pages linked as page1->page2->page3->... After removing page2 with the removal tool, page3 and all subsequent pages weren't spidered.
There is no way back
If you want a page to reappear after you used robots.txt or the meta robots to delete it from the search engine, you can easily reverse the removal. This is not the case with the removal tool. You have to wait the 3 or 6 months.
So, using the tool for obsolete content is no problem. But if you want to speed up a 301 redirect, duplicate content problem or whatever, it might be the wrong road. For duplicate content my advice is to use the <meta robots="noindex,follow"> instead and let the Googlebot sort it out, because it works in the same way (takes just a little bit more time) but the unindexed page will keep its value and is in my experience capable of transferring PR.
I removed a domain in Feb, it will be 180 days in 2 days from now, but as of today it says "complete" in the removal section.
So, after the 180 days is up, how long until the site gets indexed again.
Also, do you know if it is then subject to the sandbox again or if things have been "ticking in the background" and it should come back with full strength?
After the waiting period is over, all pages will pop up again in the SERPs. One day they are nowhere, the next day they are all back. Those that were supplemental when I removed them are still supplemental. The others are normally indexed.
I did not specifically test ranking, sandbox etc. I did the tests mainly on a non-commercial site where ranking was not an issue. When doing some searches now, the removed pages are at approximately the same position as other pages of the site that weren't removed, indicating that ranking restores after the removal period is finished.
I have noticed that everything once again pops up after the removal period is over. This is even if the files do not exists. I am now waiting for Google to remove these non-existant URLs.
One side effect I have noticed on two rather dormant sites Im resurrecting is that using robots.txt to remove a page results in the site as a whole getting a good crawl from Gbot.
A small correction to my first post. <meta robots="noindex,follow"> is not the correct syntax. It should be:
<meta name="robots" content="noindex,follow">
I wish I could just get it to work. Tried emailing G with the standard reply. I have joined google groups to get a more definitive answer as to why I cannot remove pages. Still nothing works. Old pages have been deleted since February and they are still in the SERPs.
I ended up recreating the pages and going the 301 route. At least now the pages are going somewhere.
Have you noticed the removed urls returning on days 91 or 181? Also, I too have noticed urls that I do not want in the index are returning after 6 months. These urls are from an old format and my site is now using a diff't url format. Ideally, Googlebot would hit those old urls and see my 301 to the new one. BUT, the cache that shows is the same cache that was there before I submitted it for removal, and indexed with old title/desc. Googlebot simply won't return to the url.
I want to use the Google URL removal option "Remove pages, subdirectories or images using a robots.txt file" to tell Googlebot to stay out of my images directory and to completely exclude their image bot - have added the lines to my robots.txt file and left it otherwise unchanged.
But I saw the following blurb on that same web page: "Please keep in mind that submitting via the automatic URL removal system will cause a temporary, six months, removal of your site from the Google index."
Can someone please confirm for me that simply submitting a new robots.txt with a minor change using their tools does *not* completely wipe my site out of Google?
The pages I submitted at April 1st this year returned after 91 days, but those that I submitted earlier in February are in the 180 days cycle and will return at the end of this month.
|the cache that shows is the same cache that was there before I submitted it for removal, and indexed with old title/desc. Googlebot simply won't return to the url |
Yep, one of the big problems of the URL removal tool. The tool doesn't delete the content from Google's index but only makes it invisible for 6 months. After that period you get the same old URLs back in the SERPs. So the URL removal tool doesn't work to permanently remove them. The URL will come back, will often become supplemental and can exist in this zombie state for years.
The only thing I know that works to get those old URLs permanently deleted is to 410 or 301 them in a .htaccess file and put the URL in a link from a daily spidered high PR page. Googlebot will then revisit the URL and see the 410 or 301. After a few spider cycles the URL will be deleted from the index. I prefer the 301 redirect to a comparable page, because if there are still some pages linking to the old URL, the value of these links will be transferred to the new page.
|Can someone please confirm for me that simply submitting a new robots.txt with a minor change using their tools does *not* completely wipe my site out of Google? |
One minor error in your robots.txt and you can book a six months holiday. Always test your robots.txt with a validator before you feed it to Google. Read these posts for motivation:
http://www.webmasterworld.com/forum30/30438-2-10.htm message 14 [webmasterworld.com]
|Links on removed pages are not spidered |
My experience was just the opposite. I wonder what it depends on, but I had no problem with spidering new pages linked from removed page, and even with passing PR from removed page to them.
The URL removal tool is worse than useless.
After removing the urls they all come back in 3-6months and cause havoc on your site. And, from my experience, you can't remove them again using the tool.
A monkey could have programmed a more useful tool.
Removal should mean removal from the index for good. If the time limit expires, it should mean that the spider would be allowed to respider these pages if it is not restricted in the robots.txt.
After the expiration, all the stuff reappears - even if it no longer exists. Then you are stuck with waiting for it to fade away.
When you look at the serps, it almost seems like Google has just been abandoned. No changes, stale serps, old removed urls returning even if they don't exist anymore.
|No changes, stale serps, old removed urls returning even if they don't exist anymore. |
What we see happening now is the return of all those URLs that have been removed with the URL removal tool in the beginning of this year when we thought it was the ultimate solution for the 302 hijacking problem. Now after six months all those crap is comming back polluting the SERPs. Because many of these URLs have no incomming links anymore they will remain stale in the SERPs, as supplemental jurassic fossils.
Lammert - Exactly. Around Feb. there was a big push using the URL removal to tool to try and to solve the 302 hijacking issue. This stuff will all be re-added to the indexes over the next few weeks.
Maybe this is why I am noticing stuff in the serps with cache dates of Mar. 2004.
|My experience was just the opposite. I wonder what it depends on, but I had no problem with spidering new pages linked from removed page, and even with passing PR from removed page to them. |
Wizard, it could be that we used different methods to remove the URLs from the SERPs. Most URLs I removed either returned a 404, or I placed a meta tag in the header. I only once used the robots.txt to remove a larger set of URLs.
I have one more observation to mention about the disappearing (toolbar) PR. A few days ago, a series of URLs came back in the SERPs which I removed in February. Those pages had toolbar PR0, while they were removed, but when I checked this morning, all the pages have their original toolbar PR back (mostly PR2 or PR3). I haven't seen a mentioning of a toolbar PR update in the last week on this forum, so I can only conclude that while a page is removed with the URL removal tool, the toolbar returns PR0, but the actual toolbar PR is somewhere stored in the Google database. As soon as the removal expires, that PR is visible again. This is the same behaviour as with the cache and snippet. They are stored in the index, but during the removal period they are inaccessible.
Although my experiences with spidering and PR transfer to other pages are different from yours Wizard, my tests were on a too small scale to give any scientific evidence, so your observation could be true, and mine just caused by something else. I am now thinking how I can setup a double-blind experiment. If there are any results, I will mention them here.
When your pages returned to the index with pagerank, did they begin ranking on terms/phrases as they did before? I had a page returned that was removed 6 months ago. It returned overnight, pagerank and all. But, no where to be found on any search phrases, not even exact snippets of text searched in quotes.
Almost all pages I deleted didn't rank very well before I deleted them. Most of them were on a personal blog where ranking was not an issue. I have checked now, and most of them are visible in the SERPs when searching for text phrases, but only few of them are at #1. Mainly somewhere between #10 and #100 for a search query with a few thousand results.
Oddly, I think that there may be two issues happening here - they seem related but might not be...
For instance, I have had much the same experience as lammert and crobb305 - I used the Google removal tool in late Jan to remove pages from a subdirectory on my kids' homework website. I certainly didn't care about the very tiny page rank they had, so I used those pages as a test of the Google removal tool.
After removing them from the old site, I started a new domain and it steadily gained some nice PR and came up on page 1 for relevant searches. Until 6 months later to the day, the old pages showed back up on the old site, as supplemental results. They must have retained some page rank, because initially they began showing up in SERPS again, and my new website tanked (I assume for duplicate content penalty). This all happened in the space of a couple days, so during the 6 month "removal", I think that somehow Google was still using the algo on them and already had pegged them as duplicate content.
They only showed up in SERPS for about a week, then lost all page rank and quit showing up, but were still listed on the site: search. Realizing that Google had no backlinks to those old pages, and no reason to ever spider the urls again to get the appropriate 404, I created some backlinks on other sites, and slowly but surely they are now dropping off, but I expect that the new website will be slow to recover from the penalty.
ON THE OTHER HAND
I have also been watching a competitor's website in an entirely different (commercial) topic area. They had an .asp database website until February, then launched a new website all .html. They still have the old defunct .asp database on their server - they didn't take it down, so Google still sees it, even though the new .html site has no links to it. Eventually all of the .asp stuff lost rank and became Supplemental Results. It just sat there like that. UNTIL late July. Really weird but those Supplemental Results all of a sudden started showing up here and there in the SERPS again. I repeat, I keep track of this website quite often, and they did NOT ever use the Google removal tool.
So, at this point, I think that the odd Supplemental Results that are popping up in various SERPS might only be "coincidentally" related to many of our January/February tests with using the Google removal tool.
We are trying to remove wildcard type pages from
our indexed site, ie pages that are being indexed with a
? mark after them, www.name.com/?
so we used the information for such removal and placed
in our robots.txt file
However, then when attempting to use the url removal tool
for this we get the follwoing message:
URLs cannot have wild cards in them (e.g. "*"). The following line
contains a wild card:
so what it be? How can we get these pages removed?
If Google ignores the? and * wildcards in the robots.txt, then it might be deduced that what Googlebot sees in your syntax is this:
I suggest you change your robots.txt asap.
I have had this question in the past myself.
The real problem is that I find various "pages" indexed in Google with strange strings of wildcards - things that have never been strings in my standard .php code, but somehow Gbot is following oddities.
In my opinion, robots.txt and removal tool are not a solution for wildcards, because of the risk of blocking out your good pages. Sorry, I don't have a suggestion for how to solve the wildcard oddities.
well i am afraid from first hand experience it doesnt work.
i use the tool to remove my domain www.google.com and ...
I am looking to use the removal tool to have Google remove non-www pages in hopes of clearing up some duplicate content issue.
How can I set up a robots.txt to make sure Google is removing non-www pages. I feel I need to specify the absolutle path (and not one relative to my server).
Say I want to remove 'directory'
normally I would do:
However, how do I know this is removing the non-www pages. I feel like I need something like:
Will that work, not sure if I am allowed to put the domain name in a robots.txt and specify the full path.
Please also note that I have a 301 redirect in place which redirects from non-www to www. I don't want Google to remove my www pages.
|I am looking to use the removal tool to have Google remove non-www pages in hopes of clearing up some duplicate content issue. |
There is a bug/feature in the URL removal tool, which causes that deletion of the www version will automatically delete the non-www version and vice versa. Some members saw their complete websites disappear from the Google index in a matter of days when they tried to solve their www/non-www problem in february this year. Never use the URL removal tool for this kind of things. It might wipe out your site.
The best thing is to install the 301 redirect--as you already did--and let the Googlebot in. This is important: The only way the Googlebot will know the 301 is in place, is when the page is actually spidered. Putting the unwanted URLs in the robots.txt will stop spidering of these files, and will prevent deletion. Those URLs will get the supplemental state after some time and might take months or years to disappear. So the best thing is to let the Googlebot do the work. Any manual interference with the removal tool or robots.txt might do more harm than good.
lammert, thanks so much, the response was very helpful and has helped me avoid a bad situation.
A little late to the thread, but I have an issue with how to use the Remove Tool and hope I can get some feedback.
I have a PHP page with no "HTML" content - it's simply a redirect page taking a parameter from the query string
Google has this page indexed 1000+ times but as it's a redirect page, I'm concerned it may be viewed as cloaking or some other devious tactic.
I understand I cannot use wildcards in the robots.txt for the Remove Tool, but also using a META tag on the page won't work: It doesn't produce any content (or the auto redirect won't work as the content-type for the page would be set). Hope that makes sense!
Any ideas on the best way to remove this page entirely from the index? I had thought about listing each "version" of the page in the robots.txt (with the parameters) but I believe the robots.txt has a limitation of 100 URLs. I could just repeat the process 10+ times, but obviously I'd rather look at another solution first!