Forum Moderators: Robert Charlton & goodroi
I may need to do this soon, but I have little enthusiasm for submitting pages to the removal tool, especially in cases where there are a lot of pages (necessary site redesign). I know, I know. :/
Redirect gone /folder/file.html or, for a lot of files:
RedirectMatch gone ^/folder/(subfolder1¦subfolder2¦subfolder3¦subfolder4)/.* These are examples from pages that were really gone. I noticed Googlebot stops asking for files that are served a 410 Gone for and they stop showing up in the SERPs. So maybe there's a way, using mod_rewrite, to serve only Gbot a 410? Something like this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REQUEST_URI} ^/folder/file.html$
RewriteRule .* - [G]
I have not tested this, it's just an idea.
From reading these forums I understand Google uses "not so wellknown IPs" and at least one other UA string (Python?) so be careful.
Anyways, the noindex,nofollow meta tag alone won't work. Google has to fetch the page to read that tag.
Regarding cloaking for Googlebot and serving a 410 response; Yes it would probably work. But then you'd presumably have on-site and inbound links to a page that returns 410 when Gbot tries to spider it. While the obsolete inbounds won't matter for a long time or maybe never, links remaining on your site pointing to a 'dead page' might be one of those 'more than a hundred' indicators of quality pages/sites that Google uses.
Many people get in trouble when they (mis)use HTTP response codes and mechanisms as 'quick' or 'easy' ways to do something other than exactly what these codes were intended to do and to mean... Disasters such as the 'easy' method of invoking a PHP content-handler script by defining that script as a custom error document for 404-Page Not Found. Sure, it will serve pages, but each one with a 404-Not Found error code! And then people wonder why the pages don't rank... I advise using the HTTP server response codes and mechanisms in a simple, straightforward manner only.
If you want to keep page content out of Google, Disallow the page in robots.txt. If you don't want Google to even mention a URL in search results, then don't Disallow it in robots.txt; Put a <meta name="robots" content="noindex"> tag on the page instead.
Pages disallowed in robots.txt often get a URL-listing-only if Googlebot finds a link to them; Google has complied with robots.txt by not fetching the page, but they can use the link text they find as keywords to return the URL in search results.
On the other hand, if you put the noindex meta-tag on the page and allow Googlebot to fetch it (so it can read the tag), then the page and URL should stay out of the index.
Anyway, if you've got pages you want removed fast and you want them to stay removed, the use the removal tool and also follow the Disallow or Noindex procedure above.
If you're not in a hurry, then the same steps apply, but you needn't use the removal tool.
Jim
"If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster must first insert the appropriate meta tags into the page's HTML code. Doing this and submitting via the automatic URL removal system will cause a temporary, 180-day removal of these pages from the Google index, regardless of whether you remove the robots.txt file or meta tags after processing your request."
"Please keep in mind that submitting via the automatic URL removal system will cause a temporary, six months, removal of your site from the Google index. You may review the status of submitted requests in the column to the right."
I assume that my whole site wont be removed, just the few url's that I submit?