Stop Google Getting Certain Dynamic Pages

Forum Moderators: open

Message Too Old, No Replies

Stop Google Getting Certain Dynamic Pages

How can i stop google?

davejs

5:05 pm on Oct 23, 2003 (gmt 0)

If i have 3 dynamic pages listed on google..such as...

widgets.php?size=big
widgets.php?size=medium
widgets.php?size=small

How can i stop google listing just the "small" entry?
Can i just use a meta noindex type thing when i dynamicly create the small page?

If i do it this way will the other pages remain on google?

Many thanks

John_Caius

5:25 pm on Oct 23, 2003 (gmt 0)

Is the content the same on the three pages? If so then Google may well automatically choose just one out of the three anyway - search on WW for "duplicate content filter"... :)

AthlonInside

5:52 pm on Oct 23, 2003 (gmt 0)

Googlebot is stuborn, they won't listen. I use robots.txt and meta robot to disallow them but they still insist to crawl and index the file. They are notorious little creatures which eat anything they they see!

But it won't hurt even a bit when they get your files, in terms of ranking ... so stop worry unless they are getting some sensitive information indexed.

dirkz

5:58 pm on Oct 23, 2003 (gmt 0)

How can i stop google listing just the "small" entry?

What is your goal? To let all pages indexed?

Are the pages duplicate in content?

You can use some meta tags to control crawling, but from what I can read here it doesn't always work.

davejs

6:27 pm on Oct 23, 2003 (gmt 0)

The pages have different content, but i have alot of pages to get indexed so id rather drop some "less" important ones so as to hopefully get more of the important pages indexed.

Does google have a limit on number of pages it will eventually pick up?

Is there a definite way to stop them crawling certain pages?
If the meta noindex thing dosnt work whats the point of it!?

can i list a dynamic page such as..
widgets.php?size=small
in robots.txt and if i can, will that not stop ALL of the widgets.php pages i have links to?

Lots of questions there!

davejs

8:11 pm on Oct 23, 2003 (gmt 0)

Any ideas anyone!?

bakedjake

8:29 pm on Oct 23, 2003 (gmt 0)

psuedocode, but in your php file:


if useragent=googlebot
  return 404
else
  return content
end

davejs

8:40 pm on Oct 23, 2003 (gmt 0)

would google not see this as some sort of trickery, I don't want to get into any trouble!

would the return 404 not cause me any problems elsewhere with google? will it definitely just stop this page being index and not the other dynamic pages with the same base name ..

BlueSky

8:45 pm on Oct 23, 2003 (gmt 0)

This is the format I've used to keep Googlebot away from certain dynamic pages on my site. It has worked fine for me.

User-agent: Googlebot
Disallow: /*size=*$

NickW tried something similar but a slightly different format. Googlebot ignored it and indexed the pages anyway.

User-agent: Googlebot
Disallow: /index.php?*$

He listed User-agent: * above the one for Googlebot which may or may not be the reason it didn't work. Don't know. According to Google though, if you don't want their bot crawling any dynamic pages they say to use this format:

User-agent: Googlebot
Disallow: /*?

See FAQ #12: [google.com...]

Since you want some indexed, you can try what I did. Googlebot was ignoring my noindex, nofollow metatags. Once I used the above in robots.txt he started behaving himself. Last week, I experimented again with metatags on some other pages without using robots.txt entries. So far, he seems to be obeying them. Perhaps he's had his attitude adjusted. I'd recommend using both robots.txt and metatags then keep an eye on your logs to see if he goes wandering off.

bakedjake

8:54 pm on Oct 23, 2003 (gmt 0)

would google not see this as some sort of trickery

No, it works fine for me.

davejs

9:10 pm on Oct 23, 2003 (gmt 0)

Thanks for the help.

Id rather stay clear of the robots.txt method if possible as i want the dynamic pages to "change" their index/noindex state each time google updates depending on the content that is being read in from another source at that point in time.

So i need the index/noindex state to be as dynamic as the page itself (if you see what i mean!)

I dont think that would be possible with robots.txt file.

Has anyone else got any experience with the meta noindex tags and let us in on if they work or not!

?
Thanks

hitchhiker

10:54 pm on Oct 23, 2003 (gmt 0)

G Ignores it, my pages are still in the index (but with no associated content - cache or page etc..)

BlueSky

11:57 pm on Oct 23, 2003 (gmt 0)

On my site, the content and metatags also change dynamically based on the variables passed to the page. Other SE bots have honored the tags but not little Googlebot (until recently). I went to the robots.txt and using certain variables as the filter because he's a little brighter than the others in using regexs there. If that had not worked, my next approach would have been to mod the script to use different page names -- one to be indexed and the other not. Never had to do that. I opted not to go the route of feeding the bot error codes because it would pollute my error logs with the high volume of indexing he does.

Metatags seem to work on some people's sites and not others. If that's what you want to use go for it. If he doesn't follow the tags, you can always contact Google and tell them their bot is misbehaving and hope they change their software.

hitchhiker

6:46 am on Oct 24, 2003 (gmt 0)

1) I don't really care about 'duplicate content' as a developer; i believe it's completely innevitable at some level - SE's should just deal with it (It's their issue, not ours; we don't design sites to be 'flat' - although i just spent 5 months getting rid of thousands of querystrings on their behalf (6) )

2) My use of the "noindex" directive was to avoid the waste of bandwidth. If G-bot hasn't got that right, it's just plain 'ignant. Their wasting their own resources..

<PLEA>G? WHY! - You need the bandwidth right</PLEA>

davejs

8:11 am on Oct 24, 2003 (gmt 0)

Ive not used the robots.txt file before... ive just read a few pages on this method...

Am i right in thinking that i could add a extra variable to the URL link and then include this part in the robots.txt file as a filter?

ie.

widgets.php?size=big
widgets.php?size=small&stop=1

then in robots.txt :-

User-agent: Googlebot
Disallow: /*stop=1

would this stop any url with &stop=1 being inexed?
and allow all other through?

Can i make the robots.txt in notepad? i read something about being careful about how you create the file?

thanks again...

dirkz

8:27 am on Oct 24, 2003 (gmt 0)

so id rather drop some "less" important ones so as to hopefully get more of the important pages indexed

My guess is that it won't work as you expect it. Since the number of pages crawled and indexed seems to be a function of PR, I would rather get more inbound links with good anchor text.

IMHO you should never lock out googlebot from something that could be important in the future.

A short-term measure would be to make the "more important" pages more important to Googlebot, e.g. placing more links to them, listing them in a dedicated sitemap linked by several pages, giving the links to them h1 tags etc.