I am a litle confused on how to address spiders in my robots.txt file. Some are straightforward like ArchitextSpider, some are not. The spider Spotting Chart on this site lists the Alta Vista spider as:
Scooter/2.0 G.R.A.B. X2.0
Can I address this spider as Scooter only? Do I need the version (2.0, 1.0)? Should I make a robots.txt file for each of these?
Also, I read some advice from a professional at the Web Position Gold forum that said to make a doorway page for each keyword, but I don't observe this in any page when looking at the source code. In fact, pages have come up on a search for my keywords that don't even list them as keywords. Would this be beneficial if I have only a half dozen or so keywords?
Any advice would be appreciated. I have been researching for months and I am ready to dig in and start submitting.
Hi Kenny, sorry I missed this post the first time around. Yes, you should be able to do just "scooter" alone. It isn't perfect, but it does work.
I wouldn't do tranditional doorway pages with 'spam' keywords. I've never cared for them, and they are easier to spot these days than every before.
As this thread was already started may I please follow it up here with a similar question? I'm a bit confused about robots.txt.
I've read just about everything I can find on it. I understand how to make the file. I know where it must reside on the server. I understand about spider agent names and spider IP addresses. However, what I can't find out is how I'm supposed to monitor the robots.txt file once it's up there.
I'm already running Web Trends and Hit Box simumtaneously so I do get information about spiders visiting my site. But it sure would be nice to be able to access one space (i.e. robots.txt) where I could get a read for spider activity only.
If anyone has any advice they can give, it would be appreciated very much. I know I'm close to having this down pat, but not quite there yet.
p.s. Wonderful site.
This is rather bizzare, but I've setting here trying to get Apache to execute an ssi on a TXT file (very tricky, but doable). Once that is done, I can run a logger from the robots.txt and monitor pulls.
The only other way to monitor robots.txt is via your server logs. If you don't have a robots.txt then look in the site error log for "file not found".
Hitbox can not track a spider. Webtrends can of course since it uses your server log files, but hit box is a graphic counter only (records about 75% of your hits).
Would you care to publish that Apache trick to execute SSIs from txt-files here, please? This sounds most interesting.
I found this older post while I was looking for... oh heck, I forgot what I was looking for.
Anyway, Brett, I'm wondering how did you configure Apache to use SSI on a text file? Did you do AddType type via .htaccess or was it... more involved, tiresome, lengthy?
I, too, would like to view all the hits to robots.txt separate from my other logfiles. Including ban them using wildcards. Seems like I have to get too specific using my plain robots.txt file and I keep getting hit with subtle variations of the same pesky bots.
Rather than make my robots.txt look like Webster's Dictionary, I'd just like some little exec file that logged 'em, welcomed them, or banned them. It'd also be slick to re-use the same little script for multiple clients as a separate deal instead of configuring and reconfiguring robots.txt. Update one script - and let it fly.
In Apache you only need AddType if you want to define .txt files to be something else, (such as HTML). IE is broken in all versions I've tried though, so don't expect it to treat HTTP Content-Type correctly. :(
If you have access to your HTTP daemon configuration files then use "AddHandler server-parsed .txt" in (usually but not always) /etc/httpd/conf/httpd.conf
If you want to use an .htaccess file (or whatever it's called in the AccessFileName directive) then you can add "AllowOverride FileInfo" (or "AllowOverride All") to httpd.conf (or ask your server admin).
Performance can suffer considerably if overrides are enabled, so don't be surprised if your admin won't let you use .htaccess
Warning: www dot apache dot org is likely to be far more accurate and reliable than me.