Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: incrediBILL
I was just yesterday reading an interesting discussion highly relevant to current Google changes dating from 2001 and found through a favourite search engine...
Those who cannot learn from history are doomed to repeat it.
Is that served to everyone? Or just to unrecognised robots / IPs?
If it's to everyone, then I think that is drastic but possibly inspired action.
If it's just to anyone unauthorised, then that would be in line with what has always been done with meta tags, etc, would it not?
joined:Dec 29, 2003
Seeing what effect it will have on unauthorized bots. We spend 5-8hrs a week here fighting them. It is the biggest problem we have ever faced.
We have pushed the limits of page delivery, banning, ip based, agent based, to avoid the rogue bots - but it is becoming an increasingly difficult problem to control.
also - everyone will have to login to access the site starting now.
a solution is being tested and worked on. It will probably take atleast 60 days for the old pages to be purged from the engines.
Does robots.txt not get parsed in sequence - i.e. you do the inverse of what you did previously - allow the crawlers you specifically want at the top of the file, and the last line in the file is ban everything?
wouldn't it have been better to have this system in place before stopping the bots crawling the site and cutting of our only way of searching the forum?
That is part of it and part of the testing. I have found that the majority DO obey robots. However, most of them use weird agent names or browser agent names. The majority certianly do not support cookies.
Agreed tiger, but 12million page views last week while we were away at the conference by rogue bots caused change in that time line.
Not sure why? (Apologies if this is off topic)
I can certainly agree that you have a major problem on your hands. However I don't quite understand the logic of banning all bots.
I would suggest that you redirect all requests for robots.txt to a script called robots.php or whatever and then look-up the IP in a list of known IPs for googlebot and then feed googlebot your usual robots.txt and anybody else the ban all bots robots.txt.
You could also log all requests for robots.txt to a DB or log file and go back and analyze which bots are following the disallow directives.
I'm doing something similar on my site and it's working pretty well.
Maybe you are already doing some thing along those lines since you mentioned "cloaking"
So, we start by banning bots, and then follow immediatly with required cookies/logins for everyone. That will stop most of the bots. The ones it don't, we will follow up with session id's, and auto ban in htaccess for page view abuse. Lastly, we will move to captcha logins, and then random login challenges with other captcha gfx requirements.
> Why in "Foo"?
I hate talking about it at all. It is like talking about security problems in public (given I believe that the majority of bots we see here are owned by members). However, it is better brought up by us, than someone else.
but based on known ip ranges of good bots?
The idea is that a legitimate user will not request more than X number of pages in a specified amount of time. So you limit access to the ones that go over the limit.
You should see the appropriate X from your stats, and make exception for the known major bot networks.
Wouldn't work here were it is not uncommon to have more than 1000 visitors that will view more than 500 pages a day or 200 visitors that will visit more than 1000 pages in an 8hr day or 50 visitors that view more than 2000 pages a day. deminishing returns on a script like that.
You could put a cap on the number of pageviews based on the 95-percentile of your normal users' stats, and have a whitelist system for those who really read 1000s of pages per day.
If you changed the robots.txt file to the following syntax (in the order shown below), wouldn't that allow only Googlebot in and keep all the other good bots out? Rogue bots will completely ignore the robots.txt anyway, but at least the site search would still work: