homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 223 message thread spans 8 pages: < < 223 ( 1 2 3 [4] 5 6 7 8 > >     
lets try this for a month or three...
last recourse against rogue bots
Brett_Tabke




msg:329347
 1:21 am on Nov 19, 2005 (gmt 0)

[webmasterworld.com...]

required login the real story here...
MSN and yahoo bots were blocked in October. This does everyone else.

 

kevinpate




msg:329437
 5:58 pm on Nov 23, 2005 (gmt 0)

Safe landings Brett, for you and for the site. :)

Brett_Tabke




msg:329438
 6:04 pm on Nov 23, 2005 (gmt 0)

i agree that no website is an island...but i've never tried it..

incrediBILL




msg:329439
 6:07 pm on Nov 23, 2005 (gmt 0)

ya'll need to get a clue at how much of a problem rogue bots are.

Well, I understand Brett's issues but there are lot's of other ways to stop rogue bots than to just turn off access to all robots. I have similar issues to Brett's in that rogue bots can shut down my site for a few minutes (like the WW slowdown a week ago) when they hammer on certain CPU intensive dynamic pages and request them all in just a few seconds, those greedy little pigs.

A simple solution I found to slowing and stopping unauthorized bots was to just limit the number of pages they can download within a certain amount of time, and blocking them automatically (via the dynamic pages) if there are too many page requests within a minute. For instance a human actually reading pages sure as heck can't download 100 pages in a minute and isn't likely to read 100 pages in 5 minutes either, so when that behavior starts I just start serving up error pages unless it's an authorized bot.

For valid bots with a known range of IPs like Google, Yahoo, MSN, Jeeves, etc. I let them thru but everyone else gets errors.

The only problem I run into is good old Google doesn't seem to use crawl-delay, sigh.

[edited by: incrediBILL at 6:12 pm (utc) on Nov. 23, 2005]

digicam




msg:329440
 6:11 pm on Nov 23, 2005 (gmt 0)

The Yahoo search still works fine:-

site:webmasterworld.com

what am I missing, is just google removed?

cheers

DaveN




msg:329441
 6:17 pm on Nov 23, 2005 (gmt 0)

digicam the othe 2.2 million pages that google had

DaveN

keno




msg:329442
 6:37 pm on Nov 23, 2005 (gmt 0)

Don't have time to read through all this thread right now - but it looks like the search function got the axe.

This is not good. As a newb I must use the search function!

I remember it took me a few weeks to find out where the google search function was located. When I did it was like "breathing fresh air".

Now I'm spoiled - it takes way too long to plod through every thread espaecially for technical questions. Is there an alternative search function?

digicam




msg:329443
 6:42 pm on Nov 23, 2005 (gmt 0)

DaveN just humour me :)

Yahoo shows 135000 pages of searchable content

Are you saying that google posessed a more complete record than Yahoo?

And does banning via robots.txt actually remove the pages from SERPS

OR was the google removal tool the reason for no results been shown in Google.

oilman




msg:329444
 6:49 pm on Nov 23, 2005 (gmt 0)

>> ya'll need to get a clue at how much of a problem rogue bots are.

As a former Admin here I have a pretty good idea about the problems we faced with rogue bots and scrapers and I can safely assume it's not gotten any better.

It seems to me tho that the Brett, you are basically classing googlebot, slurp et al in the rogue spider camp but banning them along with everyone else and continually saying that the issue is about rogue spiders.

We all know that if you really wanted to just ban rogue spiders it would be easy enough to force a login and cookies on everyone while still allowing the major search engines to crawl based on their IP addresses and then to throttle the crawl.

What has me curious at this point tho is how fast would WebmasterWorld be fully indexed again if the ban was lifted and the 180 days at G were past. So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?

phantombookman




msg:329445
 7:34 pm on Nov 23, 2005 (gmt 0)

I don't pretend to understand exactly how these bots work or why they do what they do nor, most importantly, I do not appreciate what the figures Brett quoted equate to in terms of $

However, I am sure Brett is well aware of the consequences.
I wish him well, I thank him for providing this site, and I admire this move he has taken.

Regards from England Brett
Rod

bears5122




msg:329446
 8:04 pm on Nov 23, 2005 (gmt 0)

Not going to get into the debate on whether it is a good move or not. I would think blocking all with a login page and simply cloaking for Google, MSN, and Yahoo by IP would do the trick.

Either way, I used Google a lot to find stuff on WebmasterWorld. Anytime I needed info on a css tactic, htaccess code, php help, or anything technical, I used a site:webmasterworld.com search in Google. Still a great forum, but I guess it won't be used as a resource of information like it used to.

econman




msg:329447
 8:09 pm on Nov 23, 2005 (gmt 0)

Upside:

1. May reduce the number of low quality posts from new participants.

2. Might make it feasible to allow some changes to the TOS (e.g. allowing links in posts by the more senior members)

Downside: pressure for Brett to add a better internal search tool (Lucene?)

AlexK




msg:329448
 8:16 pm on Nov 23, 2005 (gmt 0)

incrediBILL:
A simple solution I found to slowing and stopping unauthorized bots ... start serving up error pages unless it's an authorized bot

My site has similar experiences to your own, and to this site (though at a much, much lower level).

I use a similar automated solution [webmasterworld.com] to stop them, which works fine. The difference is, I do not distinguish between "good" bots and "bad" bots. If the bad bots behave well they get through, and if the good bots behave badly they do not. What other criterion should we have? Any other attitude says that we are harlots (they wave enough money and we let them in - definition of a harlot).

cabbie




msg:329449
 8:46 pm on Nov 23, 2005 (gmt 0)

Banning search engines from websites is no big deal.
I do it all the time..Oh hold on..No,thats not right.. they ban me.

2by4




msg:329450
 10:07 pm on Nov 23, 2005 (gmt 0)

Man, brett, so you're really going to hand off all that traffic to your competitors? Makes some sense, you don't do the ad garbage so more page views don't really add that much value overall, but this is pretty major. You're opening a lot of doors for others out there though, you sure you want to do that? Forums will start getting stale as new member joins start trickling away. Cookie required to view will do the same thing, I'd like to see expertexchange and the rest logs for the last few days, those guys are grinning, that's for sure.

jonknee




msg:329451
 10:22 pm on Nov 23, 2005 (gmt 0)

Good luck! The logic is backwards for most of the members here (who probably make their money from advertising, which means SE traffic is valuable) but if it increases the utility of WebmasterWorld then it's for the best. The site feels faster to me, so kudos.

RonS




msg:329452
 11:20 pm on Nov 23, 2005 (gmt 0)

So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?

Here's something for you guys to chew on: While WW has been removed from Google's SERPS the pages are STILL IN GOOGLE'S INDEX!

"How do you know that?!?" you ask?

Because, I used the url removal tool 181 days ago, as it just so happens on the forum section of one of my sites.

I put in the robots.txt file restricting the bot from "/"

I used the url removal tool.

All SERPS gone, and even removed from the directory.google.com directory (an unexpected but understandable result). PR0 from PR4.

Changed the robots.txt file to allow Google access to / but nothing else.

180 days pass.

Yesterday my PR went back to PR4.
And ALL OF MY OLD PAGES ARE BACK IN THE SERPS marked as "Supplemental Result", but none of the new.

Google kept the old cache of pages, and has obeyed the robots.txt file, which demands that the bots stay away, but does NOT instruct them to remove the pages from their index.

Good Luck, Brett.

nsqlg




msg:329453
 11:45 pm on Nov 23, 2005 (gmt 0)

Brett, believe me. My english dont is so good, I'm not a SEO expert too, but secure servers (like debian with grsecurity, selinux, LIDS or another "ACLs based system") with very high traffic, load balanced, with very low cost with cheap machines of well know DC companys (like EV1), programming, mysql, backups, mission critical uptime, blocking bad guys IPs with automatic custom tools....This is my life!

I'm dont wanna money, backlinks or something else, I'm quite busy, but your problem I can solve with my free time just for fun. I'm sure!

Contact-me.

Brett_Tabke




msg:329454
 11:50 pm on Nov 23, 2005 (gmt 0)

hello from dfw...argh - what a zoo on the busiest travel day of the year. yeow!

> A simple solution I found to slowing and stopping
> unauthorized bots was to just limit the number of
> pages they can download within a certain amount
> of time, and blocking them automatically

and knowing your pattern of usage - you use the site more than alot of the bots. You would look more like a bot to an algo, than many of the bots.

This aint your local phentermine five and dime site! We have members legitimately view thousands of pages a day. One former moderator regularly hit 4k page views a day (a great mod at that).

> Don't have time to read through all this thread right now

Then don't participate if you don't have the decency to read a the recent responses. I understand when it is a huge thread, but this one is doable and alot of good info up there (eg: all your concerns were addressed).

> As a newb I must use the search function

And if you read back a bit, you'll see I addressed that with a solution. I agree, the best thing a newbie can do is READ.

I am pretty surprised how fast we fell out of the indexes, but I am sure because some threw in a removal request as some mentioned above... so I thought we had a lot more time before I needed to roll out the new engine.

> Are you saying that google possessed a more complete record than Yahoo?

ya, but it was all supplemental ;-)

> problems we faced with rogue bots and scrapers

5 to 1 over the humans. If I wouldn't have banned 1k ips, banned 200 agent names, and required login for about 70k users a day on isp X - we would be looking at 50 to 1.

Moral of the story - be careful if you have alot of content, and that content is easily indexable.

We all know that if you really wanted to just ban rogue spiders it would be easy enough to force a login and cookies on everyone while still allowing the major search engines to crawl based on their IP addresses and then to throttle the crawl.

You hit the nail on the head OIL - I can't require cookies (eg: login) and allow the se bots in - or it would be classified as cloaking out right (which we have flirted with here for several years because of this very problem). I've heard that rarely a week goes by where we don't get accused of something by someone and told to the engines...

What has me curious at this point tho is how fast would WebmasterWorld be fully indexed again if the ban was lifted and the 180 days at G were past. So say 7 months for now Brett lifts the ban - how fast does Google pick up those 2 million pages?

....olms - olms - links for the poor... lol

> simply cloaking for Google, MSN, and Yahoo by IP would do the trick.

ya, and do the trick at getting us removed because of cloaking. Sheese, there are those with their shorts in a wad that I hide session ids now.

> you did it just because you were po'd that google wouldn't give you a pr8.

lol. Although it was/is disappointing that we would not get a pr8, that is the last thing on my brain here...

> which demands that the bots stay away, but does NOT instruct them to remove the pages from their index.

Actually - Google interps it to mean that they can still crawl, and use for index purposes, but not display results. So, even if you ban a the bot via robots.txt, you will still get crawled by google. (in our case, the required login will put Gbot at the login page over and over)

> The site feels faster to me, so kudos.

Thanks - it definitely is...

> Upside:

I think they are good for the moment. I understand those that want to peruse the archives and currently can't. I am surprised how fast it fell out of the index...

> Man, brett, so you're really going to hand off all that traffic to your competitors?

I have never valued se traffic of the search engines for community building.

Communities are built right here in the one on one responses - not from some random search. Yes, it is the evil point of entry we all must deal with, but you don't stick around here because we are in index X - you stick around to be involved with the members - to get answers to fresh questions and to answer others questions.

Like I said - it is an experiment. Some are good - and some are bad, but so far - I like the idea of not being beholding to engines for traffic. CoDependancy - takes two to tango...

Life without the engines. hmmmm Can it be done?

RonS




msg:329455
 11:52 pm on Nov 23, 2005 (gmt 0)

Actually - Google interps it to mean that they can still crawl, and use for index purposes, but not display results.

Brett, this may be what you've been told, but I *assure* you that they are now displaying - as Supplemental Results - pages that have been restricted by robots.txt since before I used the removal tool, and according to my logs, googlebot has not crawled any of those restricted pages.

So interpretations or not, in practice they are not crawling restricted pages, and they ARE now displaying the old versions of them.

Leosghost




msg:329456
 11:55 pm on Nov 23, 2005 (gmt 0)

Having done the calculations from what opera shows is the average page size in fora ( I'm set to 25 messages per page and that gives around 75 kb per page and multiplied it by Bretts hits figure ) ..for 4 days! only!

MASSIVE GIGA OUCH!

I would have done this long ago if that was my server and my money ..

No problems with this decision at all ..

And if it also cuts down on the dumb posts then thats fine by me too ( and as a semi domesticated quasi hacker fora troll who doesn't list a profile ..I don't have an axe to grind ..here ;-) ..

Lot of people here whining ..

Some things about this site might be in the realms of "lets discuss" ..

I dont think this is one of them ..

I'm surprised Brett discussed such a self evident point ..I wouldn't have bothered ..

( BTW Mr Tabke ..your bad word filter is getting more and more relaxed here ;-)) ...

Oh and WebmasterWorld is gone from the SE's in France but they are still hooking into the pubcon domain for phrases that have been used here ..if the bandwidth is hurtin it might be an idea to lock that down a little too ..

notsleepy




msg:329457
 12:05 am on Nov 24, 2005 (gmt 0)

We haven't had an option to search the supporters forum for a long time now so no loss for me with the Google drop.

I would like to see a bit of capitalism injected into this site. Sell some web real estate to fund the thing so we can get the features we need!

adamxcl




msg:329458
 12:09 am on Nov 24, 2005 (gmt 0)

It just makes the site less usable, more time consuming requiring more page views looking for what you want and less of a great source of information. Especially if you pay to support something that doesn't make it any easier to use....people will dwindle away over time to other more usable forums or blogs that highlight the best news...and having less users will help speed it up in the long run. So there is good and bad in every call.

joeduck




msg:329459
 12:09 am on Nov 24, 2005 (gmt 0)

better internal search

Better? Search WebmasterWorld leads to a now obsolete thread about using Google to find relevant content.

I hope Brett shares traffic information about how this shakes out. I'd guess that at least half the traffic was coming from Google searches. He does not monetize on traffic so there are different factors in the decision than for most here at WebmasterWorld.

ken_b




msg:329460
 12:38 am on Nov 24, 2005 (gmt 0)

..and knowing your pattern of usage..

Oh oh... I could be in trouble. I may even have to get a life! :)

>>Life without the search engines..?

At this point in the history of WW I'd be surprized if life, as in stability and growth, without the SEs wasn't a distinct possiblity.

Brett, wherever you're off to, have a good holiday.

RonS




msg:329461
 12:45 am on Nov 24, 2005 (gmt 0)

I don't know at what point a forum will reach critical mass and continue to grow and prosper without organic search traffic or paid referrals, but I have a very strong feeling that point can be severely underestimated e.g. much higher and much harder to reach than expected.

A forum without new members will wither and die.

Good luck to Brett and all of "us" members and/or employees of WW. It will be interesting to see how it turns out.

As I am not the gambling sort, I'm awfully glad I don't have money on it, either way.

[edited by: RonS at 12:46 am (utc) on Nov. 24, 2005]

2by4




msg:329462
 12:45 am on Nov 24, 2005 (gmt 0)

"but you don't stick around here because we are in index X"

Yes, but, speaking for myself, I came here initially because I started noticing that I was ending up here on various searches routinely enough, that brought WebmasterWorld up on my radar, same way I found expertsexchange and devshed and a few others that seem to give me the answers I look for time and time again.

However, I can see why you'd want to try it, there are dangers though, freshness, new members, there will be a drop off. If I were in your shoes I'd be talking to the debian guy or someone like him, but each to his own, many will be happy out in webland, that's for sure.

netscan




msg:329463
 12:49 am on Nov 24, 2005 (gmt 0)

My guess was that you lost a bet to GoogleGuy....

Seriously tho, I don't remember how I first happened upon this site, most probably it was a search, but I can just not envision a site that can sustain traffic on word of mouth alone for an extended period of time.

Granted this is not your normal site as it has a huge built in audience, eventually it must begin to dwindle (especially without a site search, I mean cmon now, really ;) )...

Doesn't it....

Or is it that we have it so ingrained in our heads that the search engines are the only way to draw new visitors.

An interesting experiment indeed.

ken_b




msg:329464
 12:53 am on Nov 24, 2005 (gmt 0)

We'll just see more of those "Favorite Threads" Threads, and "Does Anyone Have A Bookmark For the .... Thread?" threads.

Oil up your bookmark buttons folks.

oilman




msg:329465
 1:18 am on Nov 24, 2005 (gmt 0)

>>The logic is backwards for most of the members here

bingo - that's what's got us all worked up - we don't buy it ;)

>>I am pretty surprised how fast we fell out of the indexes, but I am sure because some threw in a removal request as some mentioned above... so I thought we had a lot more time before I needed to roll out the new engine.

and I have some swamp land in Florida to sell ya - hehe

incrediBILL




msg:329466
 1:23 am on Nov 24, 2005 (gmt 0)

Maybe Brett's moving into PHASE II of the Webmaster World monetization plan.

Dump all the cached pages from all search engines, actually write an in-house search that works, make the in-house search available only to the Supporters Forum and then resubmit only a few select top level pages to the search engines.

Whether his actions are good bad or indifferent, this little experiment has hit the web like wildfire with backlinks from blogs all over the place so it's a brilliant viral marketing move regardless of the outcome with Google.

Just a thought, we shall see what happens.

willybfriendly




msg:329467
 1:28 am on Nov 24, 2005 (gmt 0)

I would guess that the 257,000 links returned in Yahoo for a linkdomain will go a long ways towards maintaining traffic flow.

Most of those links are in very targeted areas - blogs, other forums, etc. - not in useless links pages.

I am also reminded of a rather niche Yahoo group I belong to that has been plugging along for a full decade with a membership of a couple of thousand. It requires registration, and nothing in it appears in any SE. It exists solely on WOM

We'll see how things shake out. Good luck Brett...

WBF

This 223 message thread spans 8 pages: < < 223 ( 1 2 3 [4] 5 6 7 8 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Local / Foo
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved