Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to reduce number of links Googlebot sees on my site

OK to limit what Googlebot sees?

         

CharlieGeek

9:48 pm on Jun 11, 2006 (gmt 0)

10+ Year Member



So I have a site with a lot of dynamic content, masked as such with mod_rewrite. When I designed the site structure carelessly once upon a time, I paid little attention to the permutations of the number of unique URLS I was creating that interlinked on my site. Now Google is seeing tens of thousands of pages, many duplicates, which make sense from a webmaster's standpoint and are largely invisible to the web viewer. But, I worry that this flooding of internal links to Google is hurting my PR plus maybe keeping it from including all the pages I care about at the expense of duplicates.

I know the proper way to tell Googlebot to remove pages or whole directories is to use robots.txt . But my URL structure is simply too complicated for that. I'm contemplating something more radical than that: what I call reverse cloaking. That is, I want to hide content from Googlebot.

What I'd do is mark in my database which pages I want Googlebot (and all the other bots) to see. If I detect a bot, I'll dynamically generate links only to pages that are "crawable," otherwise I'll cloak them out. And, from now on I'll generate a 410 error whenever a bot crawls a non-crawable page.

Main question: does this scheme violate Google's TOS? I know the opposite - showing Googlebot more than I show regular web viewers, "cloaking," is a no-no, but what about the opposite?

I am pondering re-structing my URLs completely and starting over, to try to avoid all these duplicate URLs, but it's not clear yet how do to this, plus even then I'm going to take a hit for a while until Googlebot catches up. Is there a Best-Known Method to do this properly?

Many thanks for your insights.

jonrichd

11:50 pm on Jun 12, 2006 (gmt 0)

10+ Year Member



Depending on how your dynamic content is served, have you thought about using a NOINDEX, NOFOLLOW metatag on the pages you don't want crawled? That might be one way to accomplish what you want, and certainly stay within guidelines. If you have a DB of the URLS you don't want, you could come up with some sort of include in your head section that would check to see if the URL is on the list, and insert the code.

Another thing to keep in mind if you're trying to get PR to flow is that Google will still know the link is there, even if it can't crawl it. Most of the PR conserving schemes that I've seen make use of Javascript or other technology to hide the link from the bot.

CharlieGeek

10:45 pm on Jun 13, 2006 (gmt 0)

10+ Year Member



Thanks for your response. NOINDEX is one possibility, but as you say, it will still know about the link, just not index the page - and I wonder if that means it will still count the page as part of my site, something I really don't want. NOFOLLOW won't work because on most pages I have some links I want Google to see, some I don't want it to see.

Again, does anyone have any thought as to whether Google will not like having links excluded from it? Does it violate TOS somehow?

daveVk

11:44 am on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Consider using some form of javascript link. This relies on the bots not understanding javascript. The upside is that the same page content is sent to all, no bot detecting, clocking involved. The downside is that any visitors with javascript disabled will not see these links either.

fishfinger

2:48 pm on Jun 14, 2006 (gmt 0)

10+ Year Member



Javascript doesn't hide links from googlebot if Analytics is anythng to go by. I used to use 'document.write' to import links I didn't want spidered - Analytics shows these as links. I've tested and Googlebot doesn't crawl them, but it appears they count them. Make your link an image and you can use javascript inside the image tag. These appear to be invisible to googlebot.

<img src="img.gif" onClick="window.open('http://www.domain.com','_self');" style="cursor:pointer">

You can also use swap image javascript in an image tag too without needing <a href="">

g1smd

2:52 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You need the <meta name="robots" content="noindex"> tag on all the pages that you do not want to be indexed.

trinorthlighting

2:53 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I were you, I would clean up the pages, clean up the links. Attempting to play tricks on google exposes you to the possibilty of getting hit hard with a penalty.

daveVk

4:11 am on Jun 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<img src="img.gif" onClick="window.open('http://www.domain.com','_self');" style="cursor:pointer">

I assume

<span onClick="window.open('http://www.domain.com','_self');" style="cursor:pointer">Click Me</span>

would work similarly without bother of image.

fishfinger: Interesting observation on document.write and analytics.