|what we do for our robots|
| 10:43 pm on Aug 4, 2011 (gmt 0)|
Not sure if this is philosophical meandering or a statistical question. (Also not sure if I'm in the right forum, but the next passing Moderator will know.)
I took a closer look at my htaccess files. Plural*, because I've got one directory-specific file thanks to massive rearranging of the whole directory. That one's got nothing but unconditional redirects; everything else is in the real htaccess.
Every single thing in the secondary htaccess, and at least 90% of the main htaccess, exists solely for the benefit of robots. (The other 10% is the no-hotlinking routine and my personal decision to lock out an entire country. Oh, and one picture that I've got linked from a forum but I can't remember where, so I can't edit its address at the source.)
If there were no robots, the /paintings/ directory would get along handily with its directory-specific 404 page listing the new subdirectories, along with a one-line htaccess drawing attention to it. Within the directory, all links are current and correct.
When I change an illustration's format from jpg to png, or put it into a subdirectory, up goes another block of htaccess. Humans don't need it; the illustrations are called by the page they're on.
When I make a mistake in a link, so people are pointed to a nonexistent page, up goes another block of htaccess to intercept any robots who happened to see the incorrect link. (It was only up for a few days and thankfully did not catch the attention of g###, or that particular redirect would have to stay there forever.)
Would the world end if everyone said To ### with it and just wrote their htaccess for humans?
* Dual, if you speak the appropriate language. I don't count the occasional Options +Indexes one-liners.
| 9:56 pm on Aug 5, 2011 (gmt 0)|
You'd see your bandwidth usage shoot up.
You'd have to add a shed load more filters to your logging.
Your site would probably fail more often.
Need I go on?
Be aware that if you have internal rewrites in the .htaccess file in the root and you also have some external redirects in the .htaccess file in a folder, those redirects will expose previously rewritten requests back out on to the web as new URLs.
| 1:36 am on Aug 6, 2011 (gmt 0)|
|Be aware that if you have internal rewrites in the .htaccess file in the root and you also have some external redirects in the .htaccess file in a folder, those redirects will expose previously rewritten requests back out on to the web as new URLs. |
You would weep big tears if you could see my htaccesses (dual) :) because I use both Rewrite and Redirect. But nothing that has been touched by the root-level htaccess can ever reach the /paintings/ htaccess. The only potential overlap is hotlinks, and those get rewritten (not redirected) to a different directory.
But I wasn't thinking about locking out unwanted robots or sending the really foul ones away to contemplate their navels at 127.0.0.1. (When you're on shared hosting I think that's the final arrow in your quiver, darn it.) What exasperates me is all those redirects that exist purely to keep the googlebot from looking for files that haven't existed since 2007 and are linked from nowhere-- and then complaining that it can't find them.
| 4:34 am on Aug 6, 2011 (gmt 0)|
note that if you mix directives from different modules (mod_alias and mod_rewrite) you must account for the order of processing and the use of the PT flag:
| 12:24 am on Aug 7, 2011 (gmt 0)|
|What exasperates me is all those redirects that exist purely to keep the googlebot from looking for files that haven't existed since 2007 and are linked from nowhere |
Google never forgets a url and will continue returning to see what they look like. Returning 404 and/or 410 errors is appropriate unless you're trying to redirect visitors or salvage rank for the replacement pages. Since you say 'are linked from nowhere' I think it's safe to clean up your htaccess, let googlebot complain about the 404's and get on with building us a great site!
Block the pages in robots.txt and perform a removal request if you want Googlebot to stop looking for them ?
| 2:22 am on Aug 7, 2011 (gmt 0)|
|Block the pages in robots.txt and perform a removal request if you want Googlebot to stop looking for them ? |
Holy ###. Do you know, it never occurred to me that I can block nonexistent (or, in this case, no-longer-existent) directories just as easily as real ones. That one step alone should clear up a lot of garbage.
:: mopping brow ::