homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Referral log spamming: How use robots.txt to keep them out?

 4:33 pm on Mar 23, 2004 (gmt 0)

In my stats, I have about 50% referrals coming from BS sites using referral log spamming.
As I understand, they do this by crawling the site like a SE spider.
I thought I'd try to block them in the robots.txt, but not sure how I'd do this and not block legitimate bots as well.
Can I "allow" the legitimate and "disallow" the rest, sort of?

TIA for info



 5:32 pm on Mar 23, 2004 (gmt 0)

Even if you 'block' log spammers with 403's they still get into your logs.

The only real solution is to password-protect your log pages which removes their incentive for spamming you.


 6:43 pm on Mar 23, 2004 (gmt 0)

Sorry, I don't understand.
As they're accessing the site by spiders, how will they get a 403 if I block them in the robots.txt? Aren't they stopped completely?


 6:48 pm on Mar 23, 2004 (gmt 0)

If they are visiting for spam purposes it is unlikely that they would obey the robots.txt file.

Can you explain the purpose of this spam technique, I have not heard of it before.



 8:09 pm on Mar 23, 2004 (gmt 0)


The technique is commonly called "referral log spamming". Someone is hired by a website to produce links to the site in weblogs and webstats of other sites (like mine). This will generate hits (curious webmsaters like myself will wonder who that referrer is and click the link) or ingoing links in blogs which in turn will e.g. increase search engine rankins.
This "someone", i.e. the spammer, is using a spider disguised often as a search engine spider and there's no way AFAIK to grab the IP the spider is coming from (the spammer) and there's a meaning of blocking the domains of the spammers clients since.

The above is what I understand. I might err here and there, though, but basicallym that's it.

You can do a Google "referral log spamming" and you'll get a few hits.

So, I still wonder how I configure the robots.txt to allow certain spiders but block the rest.



 9:35 am on Mar 24, 2004 (gmt 0)

As Dan_Vendel said above, you can't block anyone using robots.txt. It's a voluntary protocol and only legitimate spiders will obey it (and not even all of them).

The way to actually block certain types of visitors from your site is to use the .htaccess file and identify them by User Agent or IP address (or various other parameters). There are literally thousands of references to this technique in the Apache Web Server forum.

The purpose of log spamming is not to lure curious webmasters into clicking on a link, but to get their link into your log stats and then indexed by Google which gives their site another referrer and a higher PR.

The spammers URL will get into your log stats whether you block them (403) or not. So, as I said above, the only way to stop them is to take away the incentive and make your log stats inaccessible to the general public (also can be done using .htaccess).


 9:56 am on Mar 24, 2004 (gmt 0)

put them in your log file filter and never look back. A "access denied" will just cost you more server resources than simply filtering them off. They will just get new ip's anyway.


 12:27 pm on Mar 24, 2004 (gmt 0)

dcrombie, you say:
"..the only way to stop them is to take away the incentive and make your log stats inaccessible to the general public (also can be done using .htaccess)."

I haven't a clue how to make them inaccessible for the punlic. As a matter of fact, I took for granted that they were.

I'm on a *nix box with Apache and CPanel. Can you by that tell how I do it and which unpleasant consequences this might have, if any?

Appreciate your help!



 12:29 pm on Mar 24, 2004 (gmt 0)


I'd be happy to put them in he log file filter if I knew how I'd do it.
I'm on a *nix server, Apache and have CPanel. Will that be enough to give me a hint on how to do?

Your help is appreciated!



 12:36 pm on Mar 24, 2004 (gmt 0)

I use webalizer which has a config file. In this you can get webalizer to ignore agents, IPs etc so that they are never reported. i.e. they will never appear in your logfile reports.

I would imaging that most logfile analysis programs are configurable in this way.

I presume that this is what brett was describing

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved