What links will google follow. - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

What links will google follow.

i'd like to stop certain pages being crawled.

Vimes

11:29 am on Dec 20, 2004 (gmt 0)

10+ Year Member

Hi,

I have a problem which I’m not sure of how to fix.

Basically i have some content on my site that i pay for by the impression. i would like to stop robots crawling these pages.

I’ve read up on robot.txt files and as i understand it I can place a no entry to the directory where they are stored.

User-agent: *
Disallow: /directory/

The main problem I think I will have is external links to these pages. Does Google check the robots file every time it visits.
So if it follows an external link to one of these pages will it hit the robots file first find that it should not visit the page and not follow the link on the external website?

Or

Should I place a java link on the page that needs a click from a mouse to show the rest of the page that I will be charged for?

Does Google follow this type of link? I’m not worried about passing of PR etc.
I’d really like to be able to stop Google or any SE for that matter from costing me a fortune,

What’s the best legal way of getting around this problem?

Vimes.

nancyb

12:22 am on Dec 21, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I would put a meta tag on the page excluding bots

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

or search google's webmaster info page [google.com] for soloutions.

Vimes

2:17 am on Dec 21, 2004 (gmt 0)

10+ Year Member

But what i've read that does not stop the page being pulled by the spider it only stops the indexing of such a page.

i really do not want any bot from accessing these pages as the bill will run up.
My site is quite large 50K static pages. and i get a lot of visist by bots each day.

can anyone else confirm this, meta tags only stop indexing not the bot pulling the page.

has anyone got another way,

VImes

energylevel

3:01 am on Dec 21, 2004 (gmt 0)

10+ Year Member

I have files and folders listed as disallowed in the robots.txt file and I added the robots meta tag: no index, follow to the files individually.... still the files get listed in Google .. don't seem to be able to stop Google listing all swf files either

Has Googlebot gone mad!

Vimes

3:23 am on Dec 21, 2004 (gmt 0)

10+ Year Member

Does Google follow Java.

What I’m thinking of doing is having a button that will need to be clicked to show the portion of the page that i pay for.

Questions that I’m not sure on are:

1. Will Gbot follow the link on page?

2. Is this cloaking.

I can not afford the Gbot hits on these parts of the page or being banned for a cloaker.

Any help on this would be greatly appreciated

Vimes.

Hugene

8:41 pm on Dec 21, 2004 (gmt 0)

10+ Year Member

A year ago, Googlebot wasn't following Javascript links on my site, so I removed them. Now, I remember reading here that Googlebot does folow javscript links now. But, I am sure you can write some kind of JS that will fool Googlebot. Also, I think that changing robots.txt should be ok, no matter where-from your page is linked. Isn't that what robots.txt is all about, to disallow robots?

mars9820

9:04 pm on Dec 21, 2004 (gmt 0)

10+ Year Member

robots.txt should be the way to go.

If you really want to be very sure you can rewrite your .htaccess and disallow the google IP range from the directories with the pages you don't want to get crawled.

tictoc

11:49 pm on Dec 21, 2004 (gmt 0)

10+ Year Member

Top Contributors Of The Month

META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

How can you tell just GoogleBot to avoid certain pages?

Vimes

11:12 am on Dec 22, 2004 (gmt 0)

10+ Year Member

Thanks all,

it wouldn't hurt to do both i suppose.
robot.txt and the java on page to show the content.

Vimes.

Marcia

11:30 pm on Dec 22, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>How can you tell just GoogleBot to avoid certain pages?

In addition to excluding, make sure there are no links pointing to them.

mars9820

12:40 am on Dec 23, 2004 (gmt 0)

10+ Year Member

as well you can put them in a user registration area. This way people have to sign up.

Another way is to make a script with a form and let people type their emailaddress before going to the file. By doing that you can target the user later with a polite request.

irishaff

12:46 am on Dec 23, 2004 (gmt 0)

10+ Year Member

Inbounds direct to the pages could be a problem and not one that you can easily fix. Id look for a script that asked people to type in a 3 letter word from a graphic . Email address requests may put people off ..

Ive seen the software in use on news sites, but am not sure where to get it . Implementing it should be quite easy.

Vimes

2:42 am on Dec 23, 2004 (gmt 0)

10+ Year Member

Food for thought thanks I’ve got some ideas now.

Vimes.

pageoneresults

3:57 am on Dec 23, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How can you tell just GoogleBot to avoid certain pages?

Google does support it's own specific robots tag...

Googlebot obeys the noindex, nofollow, and noarchive Robots META Tag. If you place the tag in the head of your HTML/XHTML document, you can cause Google to not index, not follow, and/or not archive particular documents on your site.

The code...

<meta name="googlebot" content="robots-terms">

The robots term of noindex will produce the following effect; Googlebot will retrieve the document, but it will not index the document.

The robots term of nofollow will produce the following effect; Googlebot will not follow any links that are present on the page to other documents.

The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.

Further information on this specific robots tag can be found here with additional instructions.

How can I prevent Googlebot from following links from a particular page or archiving a copy of a page? [google.com]

Vimes

4:02 am on Dec 23, 2004 (gmt 0)

10+ Year Member

Thanks pageone thats what i understood for the meta tags No index does not stop google asking for the page.

Vimes.

celenoid

6:34 am on Dec 23, 2004 (gmt 0)

10+ Year Member

You're right, Vimes. That's why you need the entry in robots.txt -- to ensure the page doesn't get pulled by gbot.

BUT that doesn't stop google listing it as a URL-only result in SERPS (indicative of a page that google knows is important due to inbound links, but hasn't, or any reason, crawled the page yet).

The easy way to do this that I know of, is to use the <META> robots tag. But then, gbot has to be able to crawl the page to find this. Which won't happen if you use robots.txt.

There's the catch.

As a few ugly solutions, you should try the user-interaction technique (enter 3 chars, etc.), or keep the robots.txt entry and occasionally use Google's auto-removal tool to wipe their index of the url-only listings.

fclark

7:58 am on Dec 23, 2004 (gmt 0)

10+ Year Member

No need for complicated user interaction.
Why not use a simple form submit action without java?
Gbot doesn't seem to follow these on my sites.

Small Website Guy

7:57 pm on Dec 23, 2004 (gmt 0)

10+ Year Member

document.write("first half of link HTML") + document.write("second half of link HTML")

I doubt that any spider is going to follow that Javascript, not because it couldn't be programmed to, but because interpretting javascript would simply place too much burden on the spider, limiting how fast it operates.

That's how I place all of my email addresses on my websites, and so far I don't seem to be getting any junkmail (except from people in Africa with vast fortunes who need my help to liberate their fortune, for which they promise to give me 30%--I think these people are manually gathering email addresses).