Forum Moderators: open
required login the real story here...
MSN and yahoo bots were blocked in October. This does everyone else.
ya'll need to get a clue at how much of a problem rogue bots are.
Well, I understand Brett's issues but there are lot's of other ways to stop rogue bots than to just turn off access to all robots. I have similar issues to Brett's in that rogue bots can shut down my site for a few minutes (like the WW slowdown a week ago) when they hammer on certain CPU intensive dynamic pages and request them all in just a few seconds, those greedy little pigs.
A simple solution I found to slowing and stopping unauthorized bots was to just limit the number of pages they can download within a certain amount of time, and blocking them automatically (via the dynamic pages) if there are too many page requests within a minute. For instance a human actually reading pages sure as heck can't download 100 pages in a minute and isn't likely to read 100 pages in 5 minutes either, so when that behavior starts I just start serving up error pages unless it's an authorized bot.
For valid bots with a known range of IPs like Google, Yahoo, MSN, Jeeves, etc. I let them thru but everyone else gets errors.
The only problem I run into is good old Google doesn't seem to use crawl-delay, sigh.
[edited by: incrediBILL at 6:12 pm (utc) on Nov. 23, 2005]
site:webmasterworld.com
what am I missing, is just google removed?
cheers
This is not good. As a newb I must use the search function!
I remember it took me a few weeks to find out where the google search function was located. When I did it was like "breathing fresh air".
Now I'm spoiled - it takes way too long to plod through every thread espaecially for technical questions. Is there an alternative search function?
Yahoo shows 135000 pages of searchable content
Are you saying that google posessed a more complete record than Yahoo?
And does banning via robots.txt actually remove the pages from SERPS
OR was the google removal tool the reason for no results been shown in Google.
As a former Admin here I have a pretty good idea about the problems we faced with rogue bots and scrapers and I can safely assume it's not gotten any better.
It seems to me tho that the Brett, you are basically classing googlebot, slurp et al in the rogue spider camp but banning them along with everyone else and continually saying that the issue is about rogue spiders.
We all know that if you really wanted to just ban rogue spiders it would be easy enough to force a login and cookies on everyone while still allowing the major search engines to crawl based on their IP addresses and then to throttle the crawl.
What has me curious at this point tho is how fast would WebmasterWorld be fully indexed again if the ban was lifted and the 180 days at G were past. So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?
However, I am sure Brett is well aware of the consequences.
I wish him well, I thank him for providing this site, and I admire this move he has taken.
Regards from England Brett
Rod
Either way, I used Google a lot to find stuff on WebmasterWorld. Anytime I needed info on a css tactic, htaccess code, php help, or anything technical, I used a site:webmasterworld.com search in Google. Still a great forum, but I guess it won't be used as a resource of information like it used to.
A simple solution I found to slowing and stopping unauthorized bots ... start serving up error pages unless it's an authorized bot
I use a similar automated solution [webmasterworld.com] to stop them, which works fine. The difference is, I do not distinguish between "good" bots and "bad" bots. If the bad bots behave well they get through, and if the good bots behave badly they do not. What other criterion should we have? Any other attitude says that we are harlots (they wave enough money and we let them in - definition of a harlot).
So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?
Here's something for you guys to chew on: While WW has been removed from Google's SERPS the pages are STILL IN GOOGLE'S INDEX!
"How do you know that?!?" you ask?
Because, I used the url removal tool 181 days ago, as it just so happens on the forum section of one of my sites.
I put in the robots.txt file restricting the bot from "/"
I used the url removal tool.
All SERPS gone, and even removed from the directory.google.com directory (an unexpected but understandable result). PR0 from PR4.
Changed the robots.txt file to allow Google access to / but nothing else.
180 days pass.
Yesterday my PR went back to PR4.
And ALL OF MY OLD PAGES ARE BACK IN THE SERPS marked as "Supplemental Result", but none of the new.
Google kept the old cache of pages, and has obeyed the robots.txt file, which demands that the bots stay away, but does NOT instruct them to remove the pages from their index.
Good Luck, Brett.
I'm dont wanna money, backlinks or something else, I'm quite busy, but your problem I can solve with my free time just for fun. I'm sure!
Contact-me.
> A simple solution I found to slowing and stopping
> unauthorized bots was to just limit the number of
> pages they can download within a certain amount
> of time, and blocking them automatically
and knowing your pattern of usage - you use the site more than alot of the bots. You would look more like a bot to an algo, than many of the bots.
This aint your local phentermine five and dime site! We have members legitimately view thousands of pages a day. One former moderator regularly hit 4k page views a day (a great mod at that).
> Don't have time to read through all this thread right now
Then don't participate if you don't have the decency to read a the recent responses. I understand when it is a huge thread, but this one is doable and alot of good info up there (eg: all your concerns were addressed).
> As a newb I must use the search function
And if you read back a bit, you'll see I addressed that with a solution. I agree, the best thing a newbie can do is READ.
I am pretty surprised how fast we fell out of the indexes, but I am sure because some threw in a removal request as some mentioned above... so I thought we had a lot more time before I needed to roll out the new engine.
> Are you saying that google possessed a more complete record than Yahoo?
ya, but it was all supplemental ;-)
> problems we faced with rogue bots and scrapers
5 to 1 over the humans. If I wouldn't have banned 1k ips, banned 200 agent names, and required login for about 70k users a day on isp X - we would be looking at 50 to 1.
Moral of the story - be careful if you have alot of content, and that content is easily indexable.
We all know that if you really wanted to just ban rogue spiders it would be easy enough to force a login and cookies on everyone while still allowing the major search engines to crawl based on their IP addresses and then to throttle the crawl.
You hit the nail on the head OIL - I can't require cookies (eg: login) and allow the se bots in - or it would be classified as cloaking out right (which we have flirted with here for several years because of this very problem). I've heard that rarely a week goes by where we don't get accused of something by someone and told to the engines...
What has me curious at this point tho is how fast would WebmasterWorld be fully indexed again if the ban was lifted and the 180 days at G were past. So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?
....olms - olms - links for the poor... lol
> simply cloaking for Google, MSN, and Yahoo by IP would do the trick.
ya, and do the trick at getting us removed because of cloaking. Sheese, there are those with their shorts in a wad that I hide session ids now.
> you did it just because you were po'd that google wouldn't give you a pr8.
lol. Although it was/is disappointing that we would not get a pr8, that is the last thing on my brain here...
> which demands that the bots stay away, but does NOT instruct them to remove the pages from their index.
Actually - Google interps it to mean that they can still crawl, and use for index purposes, but not display results. So, even if you ban a the bot via robots.txt, you will still get crawled by google. (in our case, the required login will put Gbot at the login page over and over)
> The site feels faster to me, so kudos.
Thanks - it definitely is...
> Upside:
I think they are good for the moment. I understand those that want to peruse the archives and currently can't. I am surprised how fast it fell out of the index...
> Man, brett, so you're really going to hand off all that traffic to your competitors?
I have never valued se traffic of the search engines for community building.
Communities are built right here in the one on one responses - not from some random search. Yes, it is the evil point of entry we all must deal with, but you don't stick around here because we are in index X - you stick around to be involved with the members - to get answers to fresh questions and to answer others questions.
Like I said - it is an experiment. Some are good - and some are bad, but so far - I like the idea of not being beholding to engines for traffic. CoDependancy - takes two to tango...
Life without the engines. hmmmm Can it be done?
Actually - Google interps it to mean that they can still crawl, and use for index purposes, but not display results.
So interpretations or not, in practice they are not crawling restricted pages, and they ARE now displaying the old versions of them.
MASSIVE GIGA OUCH!
I would have done this long ago if that was my server and my money ..
No problems with this decision at all ..
And if it also cuts down on the dumb posts then thats fine by me too ( and as a semi domesticated quasi hacker fora troll who doesn't list a profile ..I don't have an axe to grind ..here ;-) ..
Lot of people here whining ..
Some things about this site might be in the realms of "lets discuss" ..
I dont think this is one of them ..
I'm surprised Brett discussed such a self evident point ..I wouldn't have bothered ..
( BTW Mr Tabke ..your bad word filter is getting more and more relaxed here ;-)) ...
Oh and WebmasterWorld is gone from the SE's in France but they are still hooking into the pubcon domain for phrases that have been used here ..if the bandwidth is hurtin it might be an idea to lock that down a little too ..
better internal search
Better? Search WebmasterWorld leads to a now obsolete thread about using Google to find relevant content.
I hope Brett shares traffic information about how this shakes out. I'd guess that at least half the traffic was coming from Google searches. He does not monetize on traffic so there are different factors in the decision than for most here at WebmasterWorld.
..and knowing your pattern of usage..
Oh oh... I could be in trouble. I may even have to get a life! :)
>>Life without the search engines..?
At this point in the history of WW I'd be surprized if life, as in stability and growth, without the SEs wasn't a distinct possiblity.
Brett, wherever you're off to, have a good holiday.
A forum without new members will wither and die.
Good luck to Brett and all of "us" members and/or employees of WW. It will be interesting to see how it turns out.
As I am not the gambling sort, I'm awfully glad I don't have money on it, either way.
[edited by: RonS at 12:46 am (utc) on Nov. 24, 2005]
Yes, but, speaking for myself, I came here initially because I started noticing that I was ending up here on various searches routinely enough, that brought WebmasterWorld up on my radar, same way I found expertsexchange and devshed and a few others that seem to give me the answers I look for time and time again.
However, I can see why you'd want to try it, there are dangers though, freshness, new members, there will be a drop off. If I were in your shoes I'd be talking to the debian guy or someone like him, but each to his own, many will be happy out in webland, that's for sure.
Seriously tho, I don't remember how I first happened upon this site, most probably it was a search, but I can just not envision a site that can sustain traffic on word of mouth alone for an extended period of time.
Granted this is not your normal site as it has a huge built in audience, eventually it must begin to dwindle (especially without a site search, I mean cmon now, really ;) )...
Doesn't it....
Or is it that we have it so ingrained in our heads that the search engines are the only way to draw new visitors.
An interesting experiment indeed.
bingo - that's what's got us all worked up - we don't buy it ;)
>>I am pretty surprised how fast we fell out of the indexes, but I am sure because some threw in a removal request as some mentioned above... so I thought we had a lot more time before I needed to roll out the new engine.
and I have some swamp land in Florida to sell ya - hehe
Dump all the cached pages from all search engines, actually write an in-house search that works, make the in-house search available only to the Supporters Forum and then resubmit only a few select top level pages to the search engines.
Whether his actions are good bad or indifferent, this little experiment has hit the web like wildfire with backlinks from blogs all over the place so it's a brilliant viral marketing move regardless of the outcome with Google.
Just a thought, we shall see what happens.