homepage Welcome to WebmasterWorld Guest from 54.161.202.234
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
What links will google follow.
i'd like to stop certain pages being crawled.
Vimes




msg:87490
 11:29 am on Dec 20, 2004 (gmt 0)

Hi,

I have a problem which Iím not sure of how to fix.

Basically i have some content on my site that i pay for by the impression. i would like to stop robots crawling these pages.

Iíve read up on robot.txt files and as i understand it I can place a no entry to the directory where they are stored.

User-agent: *
Disallow: /directory/

The main problem I think I will have is external links to these pages. Does Google check the robots file every time it visits.
So if it follows an external link to one of these pages will it hit the robots file first find that it should not visit the page and not follow the link on the external website?

Or

Should I place a java link on the page that needs a click from a mouse to show the rest of the page that I will be charged for?

Does Google follow this type of link? Iím not worried about passing of PR etc.
Iíd really like to be able to stop Google or any SE for that matter from costing me a fortune,

Whatís the best legal way of getting around this problem?

Vimes.

 

nancyb




msg:87491
 12:22 am on Dec 21, 2004 (gmt 0)

I would put a meta tag on the page excluding bots

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

or search google's webmaster info page [google.com] for soloutions.

Vimes




msg:87492
 2:17 am on Dec 21, 2004 (gmt 0)

But what i've read that does not stop the page being pulled by the spider it only stops the indexing of such a page.

i really do not want any bot from accessing these pages as the bill will run up.
My site is quite large 50K static pages. and i get a lot of visist by bots each day.

can anyone else confirm this, meta tags only stop indexing not the bot pulling the page.

has anyone got another way,

VImes

energylevel




msg:87493
 3:01 am on Dec 21, 2004 (gmt 0)

I have files and folders listed as disallowed in the robots.txt file and I added the robots meta tag: no index, follow to the files individually.... still the files get listed in Google .. don't seem to be able to stop Google listing all swf files either

Has Googlebot gone mad!

Vimes




msg:87494
 3:23 am on Dec 21, 2004 (gmt 0)

Does Google follow Java.

What Iím thinking of doing is having a button that will need to be clicked to show the portion of the page that i pay for.

Questions that Iím not sure on are:

1. Will Gbot follow the link on page?

2. Is this cloaking.

I can not afford the Gbot hits on these parts of the page or being banned for a cloaker.

Any help on this would be greatly appreciated

Vimes.

Hugene




msg:87495
 8:41 pm on Dec 21, 2004 (gmt 0)

A year ago, Googlebot wasn't following Javascript links on my site, so I removed them. Now, I remember reading here that Googlebot does folow javscript links now. But, I am sure you can write some kind of JS that will fool Googlebot. Also, I think that changing robots.txt should be ok, no matter where-from your page is linked. Isn't that what robots.txt is all about, to disallow robots?

mars9820




msg:87496
 9:04 pm on Dec 21, 2004 (gmt 0)

robots.txt should be the way to go.

If you really want to be very sure you can rewrite your .htaccess and disallow the google IP range from the directories with the pages you don't want to get crawled.

tictoc




msg:87497
 11:49 pm on Dec 21, 2004 (gmt 0)

META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

How can you tell just GoogleBot to avoid certain pages?

Vimes




msg:87498
 11:12 am on Dec 22, 2004 (gmt 0)

Thanks all,

it wouldn't hurt to do both i suppose.
robot.txt and the java on page to show the content.

Vimes.

Marcia




msg:87499
 11:30 pm on Dec 22, 2004 (gmt 0)

>>How can you tell just GoogleBot to avoid certain pages?

In addition to excluding, make sure there are no links pointing to them.

mars9820




msg:87500
 12:40 am on Dec 23, 2004 (gmt 0)

as well you can put them in a user registration area. This way people have to sign up.

Another way is to make a script with a form and let people type their emailaddress before going to the file. By doing that you can target the user later with a polite request.

irishaff




msg:87501
 12:46 am on Dec 23, 2004 (gmt 0)

Inbounds direct to the pages could be a problem and not one that you can easily fix. Id look for a script that asked people to type in a 3 letter word from a graphic . Email address requests may put people off ..

Ive seen the software in use on news sites, but am not sure where to get it . Implementing it should be quite easy.

Vimes




msg:87502
 2:42 am on Dec 23, 2004 (gmt 0)

Food for thought thanks Iíve got some ideas now.

Vimes.

pageoneresults




msg:87503
 3:57 am on Dec 23, 2004 (gmt 0)

How can you tell just GoogleBot to avoid certain pages?

Google does support it's own specific robots tag...

Googlebot obeys the noindex, nofollow, and noarchive Robots META Tag. If you place the tag in the head of your HTML/XHTML document, you can cause Google to not index, not follow, and/or not archive particular documents on your site.

The code...

<meta name="googlebot" content="robots-terms">

The robots term of noindex will produce the following effect; Googlebot will retrieve the document, but it will not index the document.

The robots term of nofollow will produce the following effect; Googlebot will not follow any links that are present on the page to other documents.

The robots term of noarchive will produce the following effect; Google maintains a cache of all the documents that we fetch, to permit our users to access the content that we indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish us to archive a document from your site, you can place this tag in the head of the document, and Google will not provide an archive copy for the document.

Further information on this specific robots tag can be found here with additional instructions.

How can I prevent Googlebot from following links from a particular page or archiving a copy of a page? [google.com]

Vimes




msg:87504
 4:02 am on Dec 23, 2004 (gmt 0)

Thanks pageone thats what i understood for the meta tags No index does not stop google asking for the page.

Vimes.

celenoid




msg:87505
 6:34 am on Dec 23, 2004 (gmt 0)

You're right, Vimes. That's why you need the entry in robots.txt -- to ensure the page doesn't get pulled by gbot.

BUT that doesn't stop google listing it as a URL-only result in SERPS (indicative of a page that google knows is important due to inbound links, but hasn't, or any reason, crawled the page yet).

The easy way to do this that I know of, is to use the <META> robots tag. But then, gbot has to be able to crawl the page to find this. Which won't happen if you use robots.txt.

There's the catch.

As a few ugly solutions, you should try the user-interaction technique (enter 3 chars, etc.), or keep the robots.txt entry and occasionally use Google's auto-removal tool to wipe their index of the url-only listings.

fclark




msg:87506
 7:58 am on Dec 23, 2004 (gmt 0)

No need for complicated user interaction.
Why not use a simple form submit action without java?
Gbot doesn't seem to follow these on my sites.

Small Website Guy




msg:87507
 7:57 pm on Dec 23, 2004 (gmt 0)

document.write("first half of link HTML") + document.write("second half of link HTML")

I doubt that any spider is going to follow that Javascript, not because it couldn't be programmed to, but because interpretting javascript would simply place too much burden on the spider, limiting how fast it operates.

That's how I place all of my email addresses on my websites, and so far I don't seem to be getting any junkmail (except from people in Africa with vast fortunes who need my help to liberate their fortune, for which they promise to give me 30%--I think these people are manually gathering email addresses).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved