Welcome to WebmasterWorld Guest from 54.145.173.36

Message Too Old, No Replies

Removing entire website via robots.txt

Consequences of removing an entire website

   
8:41 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At the beginning of this year one of our websites had some problems caused due to duplicate content. To fix this, we decided to remove the entire website by changing the robots.txt to "User-agent: * Disallow: /". We used Google's automatic URL removal system to speed up the process. After a few days all pages were removed and we changed the robots.txt back to "User-agent: * Allow: /".

This action leads to the following behaviour:

- After 5 month there are still no pages in the index. When we removed the website it was said that this "will cause a temporary, 90 day removal". Some time later, Google increased this time to "180 days". One would expect that only websites are affected which were removed after they increased this time. However, also our website is affected.

- In the past websites which excluded Googlebot from crawling their pages were still in the index. The results just appeared as URL only entry, but they could be found due to incoming links and their anchor text. However, now the site doesn’t exist any more in the index. Even searching for domainname doesn't bring it up. Also, in the past PR was past to pages from thus websites while now these pages have PR0.

- In the past there was no effect for the directory. However, now the domain was removed from Google's directory.

The consequences of removing the entire website were not only different than expected, one could also use this behaviour to harm other websites (if you have access, e.g. if you want hurt a client site). Just changes the robots.txt and use the automatic URL removal system. Re-change the robots.txt after one or two days. The website will be removed for (at least) a half year and it will be hard to find the reason. More time will be needed until the original situation (all pages are indexed and have PR) is recovered.

To avoid such problems, I would suggest that Google change there policy and reinclude website if the robots.txt is changed back. Also, I would prefer if excluding Googlebot doesn't lead to a remove of the directory entry.

12:42 pm on Jul 15, 2005 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



> Allow:

Is not something recommended by the robots.txt [google.com] standard [robotstxt.org]. We have some ancilary evidence that it may be confusing some search engines and causing indexing problems. It's usage is not recommended.

[robotstxt.org...]

12:58 pm on Jul 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just checked the robots.txt again. We used "User-agent: * Disallow: " (and not "User-agent: * Allow: /" as written in my first post), i.e. the behaviour isn't caused be a wrong syntax.
10:41 pm on Jul 15, 2005 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I am seeing stuff being reincluded (as URL-only entries) after 90 days even though the robots.txt still has the Disallow: /cgi-bin in place. Do I really have to resubmit it all again? Aaarrgghhh!
12:07 pm on Jul 19, 2005 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I would be surprised if the url only entries did not pop back up in the index. robots exclusion usually doesn't stop that. You will also see full crawls on the banned site. The only way to stop that is with an htaccess ban.
12:21 pm on Jul 19, 2005 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




Changed all the links pointing to that stuff to be rel="nofollow" for all those pages that are behind a password, and have put "meta noindex" on all the others that Google could otherwise index, instead of the entries in robots.txt. Maybe that will work.
3:14 pm on Jul 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would be surprised if the url only entries did not pop back up in the index.

This was what I expected. In the past one could find URL only entries of sites which banned GoogleBot. (I would be thankful for such a behaviour - I never wanted a "complete remove".) However, all information about this domain are removed - even the directory entry.