Forum Moderators: goodroi

Message Too Old, No Replies

Keeping robots out of cgi bin

What's the best format?

         

Reno

5:55 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I want to keep robots from indexing anything in my cgi bin, is one of these more effective than the other? Or should I use both?

Disallow: /*.cgi$

Disallow: /cgi-bin/

..................................

jimbeetle

6:20 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd use the second option as Google's are the only SE bots that currently recognizes the special characters and operators.

goodroi

7:46 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I would use option #2. Keep it simple so it is easy for the spiders to obey.

Reno

8:10 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you, and I'm glad I asked, as I had the other version in a number of my robots.txt files. I'll now go back and make the change.

So given your advice, I guess this is less useful also:

Disallow: /*.jpg$

How would you handle that?

..............................

jimbeetle

8:30 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Put all of your images in one directory, then disallow it.

Disallow: /images

Reno

8:46 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Put all of your images in one directory, then disallow it

I have a few sites that go back to the mid-90's that were originally hosted at services which would not allow subfolders (Homestead was like that at the time). While they've been out of those places for years at this point, the primary structure still has many many images at the top level (and all the img src coding pointing to them at that location!), so while it might be a good winter project to clean all that up, for now I have to leave it alone because the time commitment would be considerable.

So, given that (unfortunate) status quo, is there a disallow for .jpg's that may still work?

Thanks again.....

.....................................

goodroi

9:22 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you are going to use robots.txt to prevent image bandwidth theft, it is best to store all images in one folder.

For older sites with legacy issues it is sometimes easier to use .htaccess file to prevent image bandwidth theft.

Reno

10:12 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks goodroi. I do use htaccess to control image theft -- my concern here is keeping the bots from not using an excessive amount of bandwidth as they crawl. For whatever reason, MSNbot seems particularly guilty of this -- some days my logs show 4 or 5 MB of bandwidth for them at sites that are really rather small. Multiplied over the course of a month, it gets over 100 MB, just for MSN (Yahoo and Google do not seem to use as much).

That being the case, I was hoping to control it somewhat by disallowing almost everything except html files:

User-agent: msnbot
Crawl-delay: 120
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /art/

..................................

jimbeetle

10:34 pm on Sep 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, how do you like that. I just checked and MSN is recognizing wild cards. So, looks like you're good to go Reno.

Hmmm, feels like goodroi's following me around today. Guess I'd better watch my back.

Reno

6:00 pm on Sep 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks very much jimbeetle for checking -- I'll go with that format and hopefully will get MSN down to a more reasonable bandwidth usage for their crawl ...

.......................................