homepage Welcome to WebmasterWorld Guest from 54.237.99.131
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Bots ignoring my robots.txt
In place for a week, validates, but no use
roddy

10+ Year Member



 
Msg#: 184 posted 5:59 am on Nov 3, 2003 (gmt 0)

I started a forum a few months ago, and Googlebot's always loved it, especially since we went to PR6. At first I just let it run around as it wanted - I wasn't anywhere close to using up my bandwidth as it was, and I didn't want to scare it off by messing up my robots.txt.

However, last month it took about half a gig of bandwidth, and so far this month it's on course for almost twice that.

I put a robots.txt in place about a week ago. I checked it carefully, validated it, and waited. No joy. It's still taking posting, log-on and search pages I've tried to disallow. I checked on Google's help pages, and they seem to say that the robots.txt should get read every 24 hours - so it should have kicked in by now.

Any ideas? Anyway I can force the Googlebot to read my robots.txt?

This is what I'm using. I want to restrict all bots to only reading index, forum and topic pages so I disallowed everything else.

User-agent: *
Disallow: privmsg.php
Disallow: search.php
Disallow: faq.php
Disallow: memberlist.php
Disallow: groupcp.php
Disallow: profile.php
Disallow: login.php
Disallow: posting.php
Disallow: viewonline.php

Roddy

 

ukgimp

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 184 posted 9:01 am on Nov 3, 2003 (gmt 0)

First thing to check:

is you robots.txt valid:
[searchengineworld.com...]

roddy

10+ Year Member



 
Msg#: 184 posted 11:46 am on Nov 3, 2003 (gmt 0)

I checked it carefully, validated it

Apologies, perhaps that wasn't clear enough. Yes I validated it. Twice. And again just now . . .

Roddy

ukgimp

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 184 posted 11:55 am on Nov 3, 2003 (gmt 0)

>>Apologies, perhaps that wasn't clear enough

No you were, it was I that missed that bit :)

I had a similar problem with images but I was still getting request from the G image serach. It seems to be less and less now. You have to wait for the index to update I would guess. So that all the old data has been replaced by new. What will the update mania I tend to wait and see how things progress. I know that is no real helkp for you right now but I am sure it will work in the end.

Ahh, do you see request for the robots.txt in your logs. If you do then bingo, you know it is read, you will just have to wait it out.

roddy

10+ Year Member



 
Msg#: 184 posted 12:04 pm on Nov 3, 2003 (gmt 0)

So I've just got to wait, unless they've requested robots.txt, in which case I've . . . just got to wait.

Which I've already done, for one week, which is 7 times as long as Google says it should take to be registered. Hmmmmmm.

Roddy

Nick_W

WebmasterWorld Senior Member nick_w us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 184 posted 12:06 pm on Nov 3, 2003 (gmt 0)

Are the pages it's requesting dyname? - like privmsg.php?x=y&a=b?

I think bots see that as a different page.

Nick

roddy

10+ Year Member



 
Msg#: 184 posted 12:09 pm on Nov 3, 2003 (gmt 0)

Yes, they are dynamic.

I've disallowed (for example) posting.php

Would that still allow posting.php?t=123?

Roddy

Nick_W

WebmasterWorld Senior Member nick_w us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 184 posted 12:12 pm on Nov 3, 2003 (gmt 0)

I think so yes. Check out the last couple of msgs here: [webmasterworld.com...]

Although I've not checked my logs in a few days, the last time I looked those pages were still beign picked up despite following Googles own advice..

Nick

roddy

10+ Year Member



 
Msg#: 184 posted 12:17 pm on Nov 3, 2003 (gmt 0)

Looks useful, but this will prevent google crawling ALL dynamic pages - I only want to prevent crawling of certain pages.

(actually I'm quite happy to treat all bots the same, but Google is the only one taking any significant bandwidth)

Roddy

Nick_W

WebmasterWorld Senior Member nick_w us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 184 posted 12:22 pm on Nov 3, 2003 (gmt 0)

Yes, sorry: didn't think! - I rewrite my urls.

You might have to cloak them to return 404's ot bots.

Nick

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 184 posted 3:02 am on Nov 5, 2003 (gmt 0)

roddy,

May I suggest:

User-agent: *
Disallow: /privmsg.php
Disallow: /search.php
Disallow: /faq.php
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /profile.php
Disallow: /login.php
Disallow: /posting.php
Disallow: /viewonline.php

Ref: [robotstxt.org...]

The robots.txt validator -- like most other validators -- indicates that the 'code' is valid, and not that it will do what you desire it to do.

Disallowing /xyz.php will also disallow /xyz.php?anything

Jim

roddy

10+ Year Member



 
Msg#: 184 posted 7:10 am on Nov 5, 2003 (gmt 0)

Thanks for that. Actually the last 24 hours seem to have seen Googlebot calm down and pay attention to the robots.txt. I'll need to wait a while to be really sure, and if I have any more problems I'll try your suggestion.

Thanks for all the help

Roddy

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved