Even if you 'block' log spammers with 403's they still get into your logs.
The only real solution is to password-protect your log pages which removes their incentive for spamming you.
Sorry, I don't understand.
As they're accessing the site by spiders, how will they get a 403 if I block them in the robots.txt? Aren't they stopped completely?
If they are visiting for spam purposes it is unlikely that they would obey the robots.txt file.
Can you explain the purpose of this spam technique, I have not heard of it before.
The technique is commonly called "referral log spamming". Someone is hired by a website to produce links to the site in weblogs and webstats of other sites (like mine). This will generate hits (curious webmsaters like myself will wonder who that referrer is and click the link) or ingoing links in blogs which in turn will e.g. increase search engine rankins.
This "someone", i.e. the spammer, is using a spider disguised often as a search engine spider and there's no way AFAIK to grab the IP the spider is coming from (the spammer) and there's a meaning of blocking the domains of the spammers clients since.
The above is what I understand. I might err here and there, though, but basicallym that's it.
You can do a Google "referral log spamming" and you'll get a few hits.
So, I still wonder how I configure the robots.txt to allow certain spiders but block the rest.
As Dan_Vendel said above, you can't block anyone using robots.txt. It's a voluntary protocol and only legitimate spiders will obey it (and not even all of them).
The way to actually block certain types of visitors from your site is to use the .htaccess file and identify them by User Agent or IP address (or various other parameters). There are literally thousands of references to this technique in the Apache Web Server forum.
The purpose of log spamming is not to lure curious webmasters into clicking on a link, but to get their link into your log stats and then indexed by Google which gives their site another referrer and a higher PR.
The spammers URL will get into your log stats whether you block them (403) or not. So, as I said above, the only way to stop them is to take away the incentive and make your log stats inaccessible to the general public (also can be done using .htaccess).
put them in your log file filter and never look back. A "access denied" will just cost you more server resources than simply filtering them off. They will just get new ip's anyway.
dcrombie, you say:
"..the only way to stop them is to take away the incentive and make your log stats inaccessible to the general public (also can be done using .htaccess)."
I haven't a clue how to make them inaccessible for the punlic. As a matter of fact, I took for granted that they were.
I'm on a *nix box with Apache and CPanel. Can you by that tell how I do it and which unpleasant consequences this might have, if any?
Appreciate your help!
I'd be happy to put them in he log file filter if I knew how I'd do it.
I'm on a *nix server, Apache and have CPanel. Will that be enough to give me a hint on how to do?
Your help is appreciated!
I use webalizer which has a config file. In this you can get webalizer to ignore agents, IPs etc so that they are never reported. i.e. they will never appear in your logfile reports.
I would imaging that most logfile analysis programs are configurable in this way.
I presume that this is what brett was describing