That's going to be an issue for the site search [webmasterworld.com].
Yeah, SE traffic is soooo 2001.
I've been using such a robots.txt for ages!
Works like a charm :P
And plenty of visitors from search engines :)
|That's going to be an issue for the site search. |
Yeah - that's my only problem with it. Is there anything we can use to replace search facilities for our own use if the SE's are no longer allowed in here?
Yeah, that kind of mandates a decent search facility onsite, one would think... Anything like that in the pipelines?
Since a good proportion of the current worth of WebmasterWorld (as a long-term subscribing member) is in the archives, it would seem a questionable decision to cut off access to it.
I was just yesterday reading an interesting discussion highly relevant to current Google changes dating from 2001 and found through a favourite search engine...
|Those who cannot learn from history are doomed to repeat it. |
Is that served to everyone? Or just to unrecognised robots / IPs?
If it's to everyone, then I think that is drastic but possibly inspired action.
If it's just to anyone unauthorised, then that would be in line with what has always been done with meta tags, etc, would it not?
Personally I could live a month or so without a site search, just for the incite we will all gain from this experiment. Maybe do it in two parts though, the first blocking all but Google and the second all but Yahoo or maybe do it to searchengineworld?
You're not feeling the hit of the Rackspace bandwidth charges are you?
to see if they will respect the robots.tx, how fast you will be reincluded or...?
Seeing what effect it will have on unauthorized bots. We spend 5-8hrs a week here fighting them. It is the biggest problem we have ever faced.
We have pushed the limits of page delivery, banning, ip based, agent based, to avoid the rogue bots - but it is becoming an increasingly difficult problem to control.
also - everyone will have to login to access the site starting now.
a solution is being tested and worked on. It will probably take atleast 60 days for the old pages to be purged from the engines.
Seems wrong to me to try and use robots.txt to ban rogue crawlers - the truly rogue crawlers don't obey it anyway.
Does robots.txt not get parsed in sequence - i.e. you do the inverse of what you did previously - allow the crawlers you specifically want at the top of the file, and the last line in the file is ban everything?
>a solution is being tested and worked on. It will probably take atleast 60 days for the old pages to be purged from the engines
wouldn't it have been better to have this system in place before stopping the bots crawling the site and cutting of our only way of searching the forum?
> rogue crawlers don't obey it anyway.
That is part of it and part of the testing. I have found that the majority DO obey robots. However, most of them use weird agent names or browser agent names. The majority certianly do not support cookies.
Agreed tiger, but 12million page views last week while we were away at the conference by rogue bots caused change in that time line.
Ok thanks I can understand that
Strange - I had been checking Google's cache for the robots.txt to see what Google sees. Now, there is no cache... has it been removed through the URL removal tool? or perhaps Google is getting a 404 (seems the easiest way!)
Not sure why? (Apologies if this is off topic)
I can certainly agree that you have a major problem on your hands. However I don't quite understand the logic of banning all bots.
I would suggest that you redirect all requests for robots.txt to a script called robots.php or whatever and then look-up the IP in a list of known IPs for googlebot and then feed googlebot your usual robots.txt and anybody else the ban all bots robots.txt.
You could also log all requests for robots.txt to a DB or log file and go back and analyze which bots are following the disallow directives.
I'm doing something similar on my site and it's working pretty well.
Maybe you are already doing some thing along those lines since you mentioned "cloaking"
Foo? Why in "Foo"?
P.S. I would really love my competitors to try what you just did :-)
phpmaven, we have been doing EVERYTHING you can think of. This is a part of that ongoing process. We can't require all people to login and allow bots onto the site (eg: pure cloaking - we aren't the new york times!). Even the random ad scripts we cloak off to keep bots from seeing session id like content, gets grumbles from alot of members. The claims are that we are either selling links (which they claimed about our links to westhost and now rackspace are paid), or claim we are cloaking to get higher pr when we do block bots from seeing session ids. eg: no win situation for us.
So, we start by banning bots, and then follow immediatly with required cookies/logins for everyone. That will stop most of the bots. The ones it don't, we will follow up with session id's, and auto ban in htaccess for page view abuse. Lastly, we will move to captcha logins, and then random login challenges with other captcha gfx requirements.
> Why in "Foo"?
I hate talking about it at all. It is like talking about security problems in public (given I believe that the majority of bots we see here are owned by members). However, it is better brought up by us, than someone else.
Brett, I know the rule about the urls, but have you tried something like that?
but based on known ip ranges of good bots?
The idea is that a legitimate user will not request more than X number of pages in a specified amount of time. So you limit access to the ones that go over the limit.
You should see the appropriate X from your stats, and make exception for the known major bot networks.
I'm glad your going to do captcha. Until you do what is going to happen is people will write botts that get a user name and then hit randomly to look like they are humans. Unfortunatly captcha has been broken I have seen articles where bots can get past that. The good news is that what you are planning will keep away almost all bots. If somebody wants to crawl your site bad enough there is not much you can do about it. You can just make it harder on them and more costly.
Wouldn't work here were it is not uncommon to have more than 1000 visitors that will view more than 500 pages a day or 200 visitors that will visit more than 1000 pages in an 8hr day or 50 visitors that view more than 2000 pages a day. deminishing returns on a script like that.
It does not have to be just per day, you can try counting on hourly basis with a fallback of a daily limit.
You could put a cap on the number of pageviews based on the 95-percentile of your normal users' stats, and have a whitelist system for those who really read 1000s of pages per day.
dam dangerous game Brett ... what if someone uses the url removal tool in google ..
Don't you think that any such request for WebmasterWorld would raise some sort of extra scrutiny within Google? I guess this forum is very well known... (at least that's what I hope).
it's automated .. all you need is to add the robots.txt which brett has done
I sure hope you'll keep us posted about any side-effects to this Brett! ;)
Incidentally, any chance of getting a better site-search now that Google and AllTheWeb won't be indexing new content?
If you changed the robots.txt file to the following syntax (in the order shown below), wouldn't that allow only Googlebot in and keep all the other good bots out? Rogue bots will completely ignore the robots.txt anyway, but at least the site search would still work:
> any side-effects to this Brett!
Ya, the site is as fast as it has ever been.
Why would you possibly allow google with a nonstandard robots.txt entry and not allow, jeeves, yahoo, and msn?
| This 223 message thread spans 8 pages: 223 (  2 3 4 5 6 7 8 ) > > |