Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: open
I have a problem which Iím not sure of how to fix.
Basically i have some content on my site that i pay for by the impression. i would like to stop robots crawling these pages.
Iíve read up on robot.txt files and as i understand it I can place a no entry to the directory where they are stored.
The main problem I think I will have is external links to these pages. Does Google check the robots file every time it visits.
So if it follows an external link to one of these pages will it hit the robots file first find that it should not visit the page and not follow the link on the external website?
Should I place a java link on the page that needs a click from a mouse to show the rest of the page that I will be charged for?
Does Google follow this type of link? Iím not worried about passing of PR etc.
Iíd really like to be able to stop Google or any SE for that matter from costing me a fortune,
Whatís the best legal way of getting around this problem?
i really do not want any bot from accessing these pages as the bill will run up.
My site is quite large 50K static pages. and i get a lot of visist by bots each day.
can anyone else confirm this, meta tags only stop indexing not the bot pulling the page.
has anyone got another way,
What Iím thinking of doing is having a button that will need to be clicked to show the portion of the page that i pay for.
Questions that Iím not sure on are:
1. Will Gbot follow the link on page?
2. Is this cloaking.
I can not afford the Gbot hits on these parts of the page or being banned for a cloaker.
Any help on this would be greatly appreciated
Ive seen the software in use on news sites, but am not sure where to get it . Implementing it should be quite easy.
How can you tell just GoogleBot to avoid certain pages?
Google does support it's own specific robots tag...
Googlebot obeys the noindex, nofollow, and noarchive Robots META Tag. If you place the tag in the head of your HTML/XHTML document, you can cause Google to not index, not follow, and/or not archive particular documents on your site.
<meta name="googlebot" content="robots-terms">
The robots term of noindex will produce the following effect; Googlebot will retrieve the document, but it will not index the document.
The robots term of nofollow will produce the following effect; Googlebot will not follow any links that are present on the page to other documents.
The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.
Further information on this specific robots tag can be found here with additional instructions.
BUT that doesn't stop google listing it as a URL-only result in SERPS (indicative of a page that google knows is important due to inbound links, but hasn't, or any reason, crawled the page yet).
The easy way to do this that I know of, is to use the <META> robots tag. But then, gbot has to be able to crawl the page to find this. Which won't happen if you use robots.txt.
There's the catch.
As a few ugly solutions, you should try the user-interaction technique (enter 3 chars, etc.), or keep the robots.txt entry and occasionally use Google's auto-removal tool to wipe their index of the url-only listings.
That's how I place all of my email addresses on my websites, and so far I don't seem to be getting any junkmail (except from people in Africa with vast fortunes who need my help to liberate their fortune, for which they promise to give me 30%--I think these people are manually gathering email addresses).