Forum Moderators: Robert Charlton & goodroi
if the page exists in googles cache then googlebot will try to refresh it.
solution: make sure robots.txt validates and submit robots.txt file into the removal tool. Those pages will be removed for minimum of 6 months. (be very sure of your robots.txt file - do not remove your entire site like others have done.)
If you disallow spidering of a page or folder by Google then that does not actually stop Google from knowing that the URL "exists" when it is seen in links to that page from other sites.
In that case the URL will be shown as a URL-only entry in Google search results. In Yahoo results, they will then also use the anchor text from the other site (as long as it is not something like "click here") to build a title for the page that they didn't even spider.
If Google has already indexed a page, and you then add a robots.txt exclusion for that page, Google often reverts to showing a cached copy of the page from a time just before you disallowed it, usually as a Supplemental Result. Those are impossible to remove from their index.
If you do not want the page to show up at all, then instead you must use the <meta name="robots" content="noindex"> tag on the page itself. Again, for Supplemental Results the update takes a very long time. For normal results they usually then drop out within days to weeks.
Pages that are banned in robots.txt or have noindex, nofollow on them are being indexed and are ranking poorly.
How are you determining this? Are you doing site: searches? If so, this has no bearing on anything really. If the pages are showing up for search queries, then it may be another story.
Googlebot is going to fetch everything that has a reference to it. In the instance of stuff blocked via robots.txt protocol, you're going to see a URI only listing when doing site: searches.
Want to keep those pages out of the index? Don't use the robots.txt protocol and drop a robots meta tag on that page. I've been testing this for a couple of years now and it works like a charm.
<meta name="robots" content="none"> Place the above directive (Robots META Tag) right after the <head> of your document.
<html>
<head>
<meta name="robots" content="none">
</head> That should prevent those bots that obey the protocol from indexing and following links on the page.
Now, if you want to allow the links on that page to be followed, you might do this...
<html>
<head>
<meta name="robots" content="noindex, follow">
</head> Both Google and MSN have developed specific Robots META Tags that you can use to keep just their bots away from those pages.
For Googlebot
<html>
<head>
<meta name="googlebot" content="none">
</head> For MSNBot
<html>
<head>
<meta name="msnbot" content="none">
</head> There are other robots terms that you can use to manipulate spider behavior.
If some other site is linking to that page then googlebot will find it through the inbound link.
Nope, only internal linking. It is also indexing my redirects which are in a special folder /redirect/ which is banned in robots.txt
if the page exists in googles cache then googlebot will try to refresh it.
These pages were banned in robots.txt and the noindex tag inserted before even uploading them to the web.
Make sure that your disallow syntax is 100% correct.
solution: make sure robots.txt validates
If you do not want the page to show up at all, then instead you must use the <meta name="robots" content="noindex"> tag on the page itself.
That's what I said, my pages have those and they are still being indexed. Some are just URLs only but some aren't.
As they are not seeing the content, then they don't see the meta robots noindex tag; so they don't follow it. Remove the robots.txt directive and only let them see the meta tag. That will get you what you want.
User-agent: *
Disallow: /members/
user-agent: Googlebot
Disallow: /members/
Disallow: /members/login.html
Disallow: /members/join.html
Then within the code on the pages within this folder to be doubly sure I've added <meta name="robots" CONTENT="noindex, nofollow" />
As I don't want any spiders into this folder I was then going to put links to it in javascript, unless this is a bad idea from a human visitor point of view?
Do you think this is all I can do to protect this folder from a G invasion?
Best bet is to do what P1R and g1smd advised and to remove the robots.txt disallow and rely on the meta robots noindex.
- do you want the pages to NOT be spidered (saving bandwidth), with a risk that the pages will show in the index as URL-only entries?
OR
- do you mind having the pages spidered for search engines to read the meta tags, so that the pages will never appear in the index?
<meta name="robots" CONTENT="noindex, nofollow" />
guess what G ignored it completely and crawled the pages.
So any other ideas on how I can stop Gbot crawling, if anyone wants the see the let me know as it clearly shows that G just doesn't obey noindex, nofollow command
<meta name="robots" CONTENT="noindex, nofollow" />guess what G ignored it completely and crawled the pages.
They have to crawl the pages to be able to see the noindex tag - so they are always going to fetch the page first. Question is, does it rank or is it indexed anything other than URL only?
My solution to stopping all indexing in a particular directory was to use mod_rewrite to send 403 Forbidden messages to any Googlebot user agent visiting the excluded directory.
If you do a very specific search then yes the pages are ranking - when I say raking the only page that comes up is mine
The idea was to stop G looking at these pages at all and I followed the ideas suggested here - TBH this is all very new for me as I've never tried to stop pages from getting crawled so if you wouldn't mind could you give me some more info regarding this as I'm completely out of my deep on this
>My solution to stopping all indexing in a particular directory was to use mod_rewrite to send 403 Forbidden messages to any Googlebot user agent visiting the excluded directory.
Thanks.
There seems to be so much conflicting information about this out there!
Can anyone help me?
Google has been ignoring the robots meta tag. I know. I designed that site. I am the only one with FTP access and the files have not been altered since 2003, and have always had the disallow on them since then.
[validator.w3.org...]
Also if your pages are already in the index already you might want to use:
<meta name="robots" content="noindex,follow,noarchive" >
So it removes it from the index, the cache and follows every link to read their robots meta tags.
When the pages are gone (could take a while, about the same time it take you to get a new page listed), change it to:
<meta name="robots" content="noindex,nofollow,noarchive" >
To stop if following the links as well.
Then... after you pages are gone wait a fair while, then add the pages to robots.txt
Hope it helps