I was just about to put a folder onto one of my sites and do my BEST to stop G from crawling it by using nofollow and robots.txt, so whats the answer to stopping G from crawling something!
I too would really like Google to stop indexing some pages. I've done what they recommend (robots text), but the old pages live on.
Ironic, after working so hard and waiting to get them ranked originally.
having dug around a little more it seems that gbot is almost uncontrollable and will not obey anything - so what options do you have to stop it from crawling as it appear it can now follow java as well
If some other site is linking to that page then googlebot will find it through the inbound link.
if the page exists in googles cache then googlebot will try to refresh it.
solution: make sure robots.txt validates and submit robots.txt file into the removal tool. Those pages will be removed for minimum of 6 months. (be very sure of your robots.txt file - do not remove your entire site like others have done.)
Make sure that your disallow syntax is 100% correct. You must have each disallowed URL start with a / to be valid. There must also be a blank line before each User-agent line too.
If you disallow spidering of a page or folder by Google then that does not actually stop Google from knowing that the URL "exists" when it is seen in links to that page from other sites.
In that case the URL will be shown as a URL-only entry in Google search results. In Yahoo results, they will then also use the anchor text from the other site (as long as it is not something like "click here") to build a title for the page that they didn't even spider.
If Google has already indexed a page, and you then add a robots.txt exclusion for that page, Google often reverts to showing a cached copy of the page from a time just before you disallowed it, usually as a Supplemental Result. Those are impossible to remove from their index.
If you do not want the page to show up at all, then instead you must use the <meta name="robots" content="noindex"> tag on the page itself. Again, for Supplemental Results the update takes a very long time. For normal results they usually then drop out within days to weeks.
If you make an error there is an out.
Since the removal tool is an axe on steroids maybe I should tell a bit about it.
1don't use a complicated robots.txt file (you never know what could go wrong) replace it with a simple robots.txt designed specifically for googlebot to remove those pages. (whatever is disallowed will be axed from google)
In the URL removal tool there are 3 options, one option is to submit URL of robots.txt file.
After you submit it you will be given a list of pages "pending removal"
If something went wrong (your home page or some important page is in the list) then don't panic, it takes 24-36 hrs for the removal to happen and google will revisit robots.txt during the process.
If you like what is going to be removed then leave the robots.txt alone until the removal goes through but if you do not want one of the "pending" pages removed then simply remove the robots.txt file from your site and the removal process will fail. (It will come up "page removed" or "removal failed".
|Pages that are banned in robots.txt or have noindex, nofollow on them are being indexed and are ranking poorly. |
How are you determining this? Are you doing site: searches? If so, this has no bearing on anything really. If the pages are showing up for search queries, then it may be another story.
Googlebot is going to fetch everything that has a reference to it. In the instance of stuff blocked via robots.txt protocol, you're going to see a URI only listing when doing site: searches.
Want to keep those pages out of the index? Don't use the robots.txt protocol and drop a robots meta tag on that page. I've been testing this for a couple of years now and it works like a charm.
Instructions for utilizing the Robots META Tag
<meta name="robots" content="none">
Place the above directive (Robots META Tag) right after the <head> of your document.
<meta name="robots" content="none">
That should prevent those bots that obey the protocol from indexing and following links on the page.
Now, if you want to allow the links on that page to be followed, you might do this...
<meta name="robots" content="noindex, follow">
Both Google and MSN have developed specific Robots META Tags that you can use to keep just their bots away from those pages.
<meta name="googlebot" content="none">
<meta name="msnbot" content="none">
There are other robots terms that you can use to manipulate spider behavior.
I'm sure this has been discussed before, but I am relatively new to WebmasterWorld. Does it violate Google TOS to send a 404 to Googlebot for pages that it already indexed that you now want it to remove? What about dynamically avoiding internal links to pages you don't want Googlebot to see? (Kind of reverse cloaking.)
|If some other site is linking to that page then googlebot will find it through the inbound link. |
Nope, only internal linking. It is also indexing my redirects which are in a special folder /redirect/ which is banned in robots.txt
|if the page exists in googles cache then googlebot will try to refresh it. |
These pages were banned in robots.txt and the noindex tag inserted before even uploading them to the web.
|Make sure that your disallow syntax is 100% correct. |
|solution: make sure robots.txt validates |
Thanks, and I'm sure that will help others but I'm a big "validator". Nearly everything I upload has been through a HTML/link/php/robots/crawler etc. validator before going live.
|If you do not want the page to show up at all, then instead you must use the <meta name="robots" content="noindex"> tag on the page itself. |
That's what I said, my pages have those and they are still being indexed. Some are just URLs only but some aren't.
|That's what I said, my pages have those and they are still being indexed. |
Remove the Disallow: from the robots.txt file. ;)
As I said above, URL-only entries happen when Google knows that the URL exists, as they have seen it in links, but have been disallowed from indexing the content of the page by the entries in the robots.txt file. Having something disallowed in the robots.txt file says nothing about whether they can list the URL or not (they can). It is merely a directive to not index the content at that URL.
As they are not seeing the content, then they don't see the meta robots noindex tag; so they don't follow it. Remove the robots.txt directive and only let them see the meta tag. That will get you what you want.
Having read through this thread I just wanted your input for our situation which is with a new site/new pages that G hasn't had the opportunity to crawl yet. I have a 'members' section/folder which I have added to the robots.txt file and then been specific to G:
Then within the code on the pages within this folder to be doubly sure I've added <meta name="robots" CONTENT="noindex, nofollow" />
Do you think this is all I can do to protect this folder from a G invasion?
Best bet is to do what P1R and g1smd advised and to remove the robots.txt disallow and rely on the meta robots noindex.
thanks for the feedback having to stop G from seeing pages is all very new to me
Read my post again, and ask yourself this:
- do you want the pages to NOT be spidered (saving bandwidth), with a risk that the pages will show in the index as URL-only entries?
- do you mind having the pages spidered for search engines to read the meta tags, so that the pages will never appear in the index?
The pages in the 'members' section actually have the login & join pages via affiliate codes, therefore there is actually no content to be crawled but in light of recent announcements we want to ensure we still get members to sign up but without fear of being penalised for it
I followed everyones views here on stopping G crawling pages using this
<meta name="robots" CONTENT="noindex, nofollow" />
guess what G ignored it completely and crawled the pages.
So any other ideas on how I can stop Gbot crawling, if anyone wants the see the let me know as it clearly shows that G just doesn't obey noindex, nofollow command
I'm having the same problems.
I removed the robots file as suggested and worked with on page noindex, nofollow commands, even went to the extent of trying different comands ion different pages, including the 'none' on a googlebot specific meta
Still didn't work! Any suggestions greatly received
|<meta name="robots" CONTENT="noindex, nofollow" /> |
guess what G ignored it completely and crawled the pages.
They have to crawl the pages to be able to see the noindex tag - so they are always going to fetch the page first. Question is, does it rank or is it indexed anything other than URL only?
My solution to stopping all indexing in a particular directory was to use mod_rewrite to send 403 Forbidden messages to any Googlebot user agent visiting the excluded directory.
If you do a very specific search then yes the pages are ranking - when I say raking the only page that comes up is mine
The idea was to stop G looking at these pages at all and I followed the ideas suggested here - TBH this is all very new for me as I've never tried to stop pages from getting crawled so if you wouldn't mind could you give me some more info regarding this as I'm completely out of my deep on this
>My solution to stopping all indexing in a particular directory was to use mod_rewrite to send 403 Forbidden messages to any Googlebot user agent visiting the excluded directory.
Dealing with the same thing... Have used the Robots meta-tag to block Googlebot. I thought this would work. However, we now have a bunch of stuff in the index that was not desired. Mostly stuff for our cgi area. In the past, we used "robots" in the meta-tag. We now specifically call out "Googlebot". Hopefully this will get some of the stuff removed. The one thing I have learned - it is difficult to get something completely out of the index once it is in. The only thing I have not yet tried is the "410".
I have been getting some hits to some of my graphics from images.google.com, despite having in robots.txt for all spiders:
google may still crawl disallowed areas looking for TOS violations ect. But it should not index it. A URL only listing just shows that google (knows it exists) but doesn't mean it gets called up in searches.
I have had immediate success using the URL removal tool, and I use 410, can't say for sure that 410 will cause something to be removed.
I'm in this situation also, very new to all of this, and have been told that I need block Google completely from my subdomain using a mod-rewrite on the server (windows) and also the robots.txt but I can't find an example of either!
There seems to be so much conflicting information about this out there!
Can anyone help me?
Hmm, I see some pages that were last modified in 2003 May and which have had the <meta name="robots" content="noindex"> tag on each one since at least that time, that are listed in Google as Supplemental Results with a full title and snippet, and with a full cache from 2005 June.
Google has been ignoring the robots meta tag. I know. I designed that site. I am the only one with FTP access and the files have not been altered since 2003, and have always had the disallow on them since then.
If you use the meta robots=noindex, make sure your pages validate so google can read the tag properly.
Also if your pages are already in the index already you might want to use:
<meta name="robots" content="noindex,follow,noarchive" >
So it removes it from the index, the cache and follows every link to read their robots meta tags.
When the pages are gone (could take a while, about the same time it take you to get a new page listed), change it to:
<meta name="robots" content="noindex,nofollow,noarchive" >
To stop if following the links as well.
Then... after you pages are gone wait a fair while, then add the pages to robots.txt
Hope it helps