homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 7 [8] 9 > >     
Attack of the Robots, Spiders, Crawlers.etc
part two
Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 7:55 pm on Nov 25, 2005 (gmt 0)

Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:

[webmasterworld.com...]


Summary:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.

Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.


It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.

The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.

It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.

The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.

How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.

We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.

Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.

> how fast we fell out of the index.

Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.

It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.

Steps we have taken:

  • Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
  • Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
  • Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
  • Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
  • IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
  • One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
  • Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.

An intelligent combo of all of the above is where we are at right now today.

The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.

That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.

The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.

Thanks
Brett

Rogue Bot Resources:


 

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 10:08 pm on Dec 1, 2005 (gmt 0)

Eh? The data the google bot sees is EXACTLY the same data the user would see. You're just requiring the user to login. The data behind the login page is still the same. No one is being deceived.

The USER who clicks on Google's result is being deceived - instead of seeing page with relevant data he will see login page. I feel seriously pissed off when I click on some links in Google news only to be greeted with subscriber's only page - it may be acceptable for Google news since they have trusted feeds, but its not for Google search engien.

Granted registration is free, however its not important - what's important is that there is no way computers can distinquish between good Brett cloaking for good reason and bad Bill-The-Spammer who cloaks for bad reasons, thus anybody who cloaks should be penalised because machines simply can't see the difference.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 10:12 pm on Dec 1, 2005 (gmt 0)

2by4 - also noticed that folks that run with bot names, are most often bots trying to slip by filters and thus end up autobanned by the cron job at the end of the day.

eg: if you have an ip that has been banned and you are running as a bot name - that's why...

Actual ip parsing is left as an excersize to the reader:

#####################
#!/usr/bin/perl

#use CGI::Carp qw (fatalsToBrowser);

print "Content-type: text/plain\n\n"; # if needed...

$ip = $ENV{'REMOTE_ADDR'};
$hta= ".htaccess";

&GetHtaccess;
# print "s $success $ip" if $debug;
&BurnIP($ip);
&PutHtaccess;

sub BurnIP {
$z=shift;
foreach $t (@htaccess) {
if ($t =~ /deny from/gi &&!$done) {
$t.=" $z";
$done++;
push(@out,$t);
}
else {
push(@out,$t);
}
}
if (!$done) {
push(@out,qq\ndeny from $z\n);
}
undef @htaccess;
@htaccess=@out;
undef @out;
}

sub PutHtaccess {
open(FILE2,">$hta");
foreach $t (@htaccess) {
print FILE2 "$t\n";

}
close(FILE2);
}

sub GetHtaccess {
return(0) if!-e "$hta";
open(FILE3,"$hta");
@htaccess =<FILE3>;
chomp @htaccess;
close(FILE3);
1;
}

#####################

2by4

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 9618 posted 10:52 pm on Dec 1, 2005 (gmt 0)

LOL, should have known better brett, thanks for the sample code. On the bright side, I get to play with w3m and vi, always fun.

takes me back to the good old days I guess. By the way, if you haven't tried w3m lately, check it out, it's pretty cool.

BeeDeeDubbleU

WebmasterWorld Senior Member beedeedubbleu us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 7:13 am on Dec 2, 2005 (gmt 0)

Can we disable the site search and post a notice at the top about this to stop infrequent visitors posting about this?

kaled

WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 11:55 am on Dec 2, 2005 (gmt 0)

Having thought about it a little more, I think Lord Majestic has a point. However, I think a compromise/alternative approach might be possible.

If not logged in :-
1) Disable all outward links from every page.
2) If cookies are enabled, insert a login form at the top of the page, otherwise insert a "cookies required" message at the top of the page.

By disabling all the outward links, robots would be totally screwed. By displaying the indexed content (albeit with a login form at the top of the page) those that enter the site directly from a search engine will not be overly annoyed/disappointed.

Kaled.

SebastianX

10+ Year Member



 
Msg#: 9618 posted 1:31 pm on Dec 2, 2005 (gmt 0)

Brett, could you allow bots and simple RSS readers like Feedreader to request the RSS feed again?

I wouldn't mind if you've to move it to another location to make this happen ;)

TIA

Play_Bach

WebmasterWorld Senior Member play_bach us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 9618 posted 2:52 pm on Dec 2, 2005 (gmt 0)

> Can we disable the site search

What site search are you referring to?

BeeDeeDubbleU

WebmasterWorld Senior Member beedeedubbleu us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 3:06 pm on Dec 2, 2005 (gmt 0)

The one on the menu at the top of this and every page.

HelenDev

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 3:11 pm on Dec 2, 2005 (gmt 0)

Re the site search, I apologise for my probably ignorance in the matter, but how come it's not possible to just have a normal, erm, site search? ie not a google search or whatever?

I don't know anything about the workings of this site but I presume it's driven by some sort of database, could not a search function be written for it?

Is it because the site is too big and it would take too long to do a search?

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 3:15 pm on Dec 2, 2005 (gmt 0)

Thanks kaled ;)

By disabling all the outward links, robots would be totally screwed.

It seems to me that this will defeat the point - if robots can't find links then they won't index much, thus there will be no bots and hence no results in Google to get traffic from: this is fine if you don't want bots, but the way I understand the situation here (with alleged cloaking of robots.txt and possibly other content depending on user-agent) the point is to make people register to view content, yet some bots are allowed using cloaking.

Its either having cake or eating it.

kaled

WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 4:34 pm on Dec 2, 2005 (gmt 0)

Perhaps I was a little too brief....

You would disable links if not logged in (thereby defeating unwanted robots) but enable the links if logged in and/or on a white-listed IP address. Thus, Googlebot et al could be allowed in.

This is still cloaking, but the user would see the same content as indicated by the search, however links would be disabled and a login form would be displayed at the top of the page.

To disable links, simply set href="#" (to go to top of page I think). You could also use javascript to focus the first item on the form.

Kaled.

effisk

5+ Year Member



 
Msg#: 9618 posted 4:48 pm on Dec 2, 2005 (gmt 0)

WW is the only forum of this importance without a rss feed.

Is this something that will be implemented in the near future?

another thing; I mentionned PunBB as being a very light and efficient BB system, I forgot to mention the best system: MesDiscussions. I have no other words, it simply is the best. Only thing is, I'm not sure they have support in English (it's a French system).

cheers

Leosghost

WebmasterWorld Senior Member leosghost us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 4:57 pm on Dec 2, 2005 (gmt 0)

tu rigole ..!
there is no comparison ..that is typical graphic laden smiley ridden MSQL weirdness which would collapse at the volume and usage of this place ..
it's not cos TF1 have used it that it's the best or big enough or versatile enough for this community..

Plus it can be "taken down" too easily for most usages ..France has some of the best coders in the world ..they did not work on "MesDiscussions"

support is french only ..24 hour delay except holidays ..France has lots of holidays ..form mail problem submit system..

BTW ..you seriously think that TF1 has the specific bot problems of here ..most of the posters to TF1 fora can barely read let alone run a bot!

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 5:14 pm on Dec 2, 2005 (gmt 0)

This is still cloaking, but the user would see the same content as indicated by the search, however links would be disabled and a login form would be displayed at the top of the page.

Some people will certainly be confused seeing no links - but I suppose in this case it could actually be a reasonable compromise, the only thing I don't like is producing different content depending on whether requests come from particular search engine IPs - all good search engines should have an unknown range of IPs to catch such things and this brings the ultimate problem with cloaking - how can a machine determine if its a cloaking with the best intentions or just plain black hat spamming? They can't do that, so the only reasonable course of action is to ban all cloakers - not geo-IP delivery is not cloaking.

effisk

5+ Year Member



 
Msg#: 9618 posted 5:35 pm on Dec 2, 2005 (gmt 0)

Leosghost,

I was not aware TF1 did use this system at any stage.

MesDiscussions is used for some of the largest online communities, including hardware.fr (about 350,000 members and not far from 30 million messages).
The "smiley" thing is not an issue, it can be deactivated. The support however can become an issue.

And by the way, I suggest those who doubt of systems such as phpbb for large communities have a look at "Gaia Online" forums. You'll be surprized... I think they're not far from 1,000,000 new posts each month.

kaled

WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 5:39 pm on Dec 2, 2005 (gmt 0)

Ok, with a little more thought, it can be done without cloaking at all (well, almost)....

1) Use entirely cgi links.
2) Unless logged in or on an IP address white-list, all links will lead to a login page.
3) The login page must include a noindex robots meta (to avoid duplicate content penalty!)

Kaled.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 5:51 pm on Dec 2, 2005 (gmt 0)

2) Unless logged in or on an IP address white-list, all links will lead to a login page.

This is based on assumption of knowing all IPs of given search engines - this perhaps will work now, but with explosion of spam sites it seems to me that using cloaking is exposing yourself to a serious penalty.

This also does not address the issue of users who came from search engine being mislead - they expect to see content they searched for on the first page after click, but instead they will have to register etc. When I come across with these things I just close the window and search harder.

kaled

WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 7:49 pm on Dec 2, 2005 (gmt 0)

Scratch my last suggestion. Mixing Google with redirects is probably a really bad idea anyway.

Kaled.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 7:56 pm on Dec 2, 2005 (gmt 0)

IMO the best most honest policy is this: you either close access to search engines or you allow them (non-abusive bots only of course) to roam free - doing anything else is likely to push site beyond line in sand that separates good content sites and spammy junkies.

2by4

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 9618 posted 10:16 pm on Dec 2, 2005 (gmt 0)

ip unbound... I will be good, I will be good....

effisk, those french forums, I was going to post a code sample from their page that certainly does nothing to support your claim that they are the best, but leosghost already covered the question.

With phpbb forums, it's a different idea than these, they are db driven, bbbs are flat file driven. Different animals. Punbb does look interesting, but hasn't been stress tested yet as far as I know on a major forum site. But it's probably more or less just phpbb lite, with some extras and some subtractions, definitely as I noted the best output css/html of any forum software I've yet looked at. And very quick. But these forums aren't going to migrate to any generic solution, so there's really no point in bringing that up.

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 8:41 am on Dec 3, 2005 (gmt 0)

Hello Brett, it's been a long time. I have 155 messages in my inbox that I can't read. lol :)

You can ban the majority of the bots using Apache. Won't disclose how here though. Sticky me if interested.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 3:54 pm on Dec 3, 2005 (gmt 0)

thanks tm... your thread is still kicking...

JoaoJose

10+ Year Member



 
Msg#: 9618 posted 1:03 am on Dec 4, 2005 (gmt 0)

In the meantime the absence of a search function for WW is making my life pretty difficult. Can never get what I want on other websites...I guess that's why WW was always on Google's top spots for my querys.

lammert

WebmasterWorld Senior Member lammert us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 9618 posted 9:57 am on Dec 4, 2005 (gmt 0)

Those totally adicted to the Google site search functionality to search WebmasterWorld might take a look at this thread [webmasterworld.com] where Receptional gives a workable alternative using the Microsoft search engine which still contains a reasonable amount of WW pages.

webjourneyman

5+ Year Member



 
Msg#: 9618 posted 2:53 pm on Dec 4, 2005 (gmt 0)

I found webmasterworld on google and continued to use it because I could search for answers to particular problems with site: search on google. If it had not been for this feature I would have continued using #*$!.
You should at least allow for search if user is a paying member.

ken_b

WebmasterWorld Senior Member ken_b us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 3:32 pm on Dec 4, 2005 (gmt 0)

I'm curious how this has affected the number of human visits.

walkman



 
Msg#: 9618 posted 3:38 pm on Dec 4, 2005 (gmt 0)

>> I'm curious how this has affected the number of human visits.

not sure how to translate it in numbers but the Alexa ranking is now at about 500, from a top 300 or so. Sure there's a drop, but it's holding up pretty good IMO.

motorhaven

10+ Year Member



 
Msg#: 9618 posted 4:39 am on Dec 5, 2005 (gmt 0)

If a 50% drop in tool bar visits is good, then I guess so. Look at the traffic details, there's a huge drop.

JAB Creations

WebmasterWorld Senior Member jab_creations us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 9618 posted 7:38 am on Dec 6, 2005 (gmt 0)

I spent ALL night trying to read and catch up on the only 25 pages worth of posts... (cuz I know Brett wasn't thrilled about posts without full reads) BUT when you read all that and then see another 24 pages worth in a new thread(!)...gotta post what is on my mind before I crash tonight...

Brett, how about selling your blacklists? If you have every bad bot in the Delta Quadrant coming at you I would think it would be a trustworthy and effective blacklist that you could make a profit out of (and of course make operations a bit cheaper). I *LOVE* access logs and will probably always love them, I'm sure there are plenty of others who spend countless hours tracking like I do. Maybe you could hire someone to cover the work for you (unless you do like dealing with the issue though probably not)...either way sell the lists and make a profit and buy some new super servers or something? ;)

I fully agree with requiring cookies to serve content, it must be done to keep this or any site under siege operationally sound.

BReflection

10+ Year Member



 
Msg#: 9618 posted 5:15 pm on Dec 6, 2005 (gmt 0)

I can't count how many times I was searching Google for a webmaster related issue and ended up coming to WebmasterWorld.

WW is rarely the one to break a story. Usually stuff is posted on the front page days and weeks after the rest of the web gets it (i've noticed there are peaks and troughs of spurts and dryness concerning stories posted). So basically you come to read the valuable comments posted by members. That will continue.

But searching for other people who have started threads on a topic I am interested in will now, unfortunately, have to happen elsewhere (and sorry, I don't prefer to use MSN search...and the web is about choices - you can't choose for me.)

Absolutely nothing has changed in the abstract. Anyone who is willing to accept a cookie can come to WebmasterWorld and rip the entire site. It's just an added requirement - like when before where everyone who had an internet connection could come and rip the entire site.

jdancing

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 9618 posted 6:41 pm on Dec 6, 2005 (gmt 0)

Offer tastefully done sponsorships for each WebmasterWorld sub-forum. Then uses the extra ~$20K/mo. from those sponsorships to pay someone to migrate WebmasterWorld from flat files to a database driven forum (like vBulletin) with search built in, use the money left over to pay for more server power as needed.. Problem solved.

Id rather see a few non-obtrusive sponsorships rather then not have a search function to find the answers I need quickly at WebmasterWorld. I fear unless WebmasterWorld gets search, the redundant posts and lack of utility will cause this site to start to die on the vine.

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 7 [8] 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Local / Foo
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved