robots.txt vs htaccess for control

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt vs htaccess for control

if the robots.txt is not dependable why not just use htaccess?

ccubed99

11:07 pm on Jan 27, 2006 (gmt 0)

I am trying to understand the logistics of controlling spiders, as this is new to me. I understand the meta tags, I have a basic understanding of the contents of the robots.txt and some of the basics for the htaccess.

That said... if compliance with the robots.txt is dependent on the spider being designed to read the it and follow the rules or the spider is designed to ignore the robot.txt, why not just use the htaccess to control where the spider can go?

Stefan

11:38 pm on Jan 27, 2006 (gmt 0)

.htaccess is certainly a better way of banning bad bots than robots.txt, but if you use it to tell G or Y what they can visit, it could get tricky, eh? They would be finding internal links from allowed pages to disallowed pages, and then getting 403's or something when they try to retrieve them later. Maybe it would work, I never really thought about it before - might the whole thing seem a bit dodgy to the SE's though? I'm interested in seeing what others have to say on it.

And welcome to WebmasterWorld.

Pfui

11:50 pm on Jan 27, 2006 (gmt 0)

I use .htaccess with mod_rewrite to do 99.9% of my robot (and various bad/iffy visitors) access control.

That said, I usually don't recommend that route when fielding Qs about robots.txt because: not all servers have compiled-in the mod_rewrite module; the module is head-bangingly complicated; .htaccess can be complicated; and many ISPs don't allow users to 'see' dot-files so .htaccess simply isn't an option.

jdMorgan

11:52 pm on Jan 27, 2006 (gmt 0)

Stefan,

Dodgy, yes. Very. Or at least 'unprofessional' -- and possibly counted as a "quality indicator."

ccubed99,

Welcome to WebmasterWorld!

Use robots.txt to control what you want robots.txt-compliant 'good' spiders to *fetch* -- bandwidth control, in other words. If they are nice by asking, give them a polite robots.txt reply. I specifically said 'fetch' here, because that is what is accomplished. Some search engines, including some of the majors, don't need to fetch a page to list it in their results; They can create a search result based on links they find on other sites pointing to your page, and the link text associated with that link. I find this annoying, but that leads us to...

Use the on-page meta-robots tag to control what search engines *list* in their results. (If you mark a page as "noindex," then you must allow it to be fetched in robots.txt -- otherwise the spider can't fetch and read the page to find the robots meta tag.)

Use .htaccess to stop rogue spiders that don't fetch, or that fetch and ignore robots.txt, and to insure that good spiders don't wander into forbidden territory due to a bug in their code or an error in your robots.txt or on-page meta-robots tags.

The three methods are complementary, but in no way are any of them equivalent.

Jim

Stefan

12:12 am on Jan 28, 2006 (gmt 0)

Great summary, jd. I thought it had to be dodgy, somehow. At the very least, it would look like a lot of your internal links were bad. I wasn't even sure what kind of error code you would throw them.

Ccubed99 - your first post was a good question, and you got a good answer. You're batting a thousand so far :-)

ccubed99

3:39 am on Jan 28, 2006 (gmt 0)

Thank-you all for a warm welcome and succinct answer. The big picture was feeling a little fuzzy and gray while putting the details together and that definitely cleared up any questions I had about the logistical relationship.