homepage Welcome to WebmasterWorld Guest from 54.234.228.64
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 [7] 8 9 > >     
Attack of the Robots, Spiders, Crawlers.etc
part two
Brett_Tabke




msg:307365
 7:55 pm on Nov 25, 2005 (gmt 0)

Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:

[webmasterworld.com...]


Summary:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.

Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.


It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.

The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.

It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.

The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.

How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.

We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.

Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.

> how fast we fell out of the index.

Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.

It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.

Steps we have taken:

  • Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
  • Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
  • Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
  • Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
  • IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
  • One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
  • Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.

An intelligent combo of all of the above is where we are at right now today.

The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.

That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.

The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.

Thanks
Brett

Rogue Bot Resources:


 

2by4




msg:307545
 4:18 am on Nov 30, 2005 (gmt 0)

"I'll bet *anything* that with a Googlebot IP address and useragent you again would not see all useragents dissallowed in robots.txt, and could crawl the site. But that I guess we'll never know..."

I'll bet you anything that at the time of this posting robots.txt is being cloaked by useragent.

I was just regretting not grabbing the old robots.txt for its bot list, but there it is, just like it used to be, no search bots banned at all.... now it's saved though... brett is still one of my favorite tricksters out there, I learn a lot from him, and some others here, all the time.

by the way, how do you cloak a .txt file? I remember trying that with .css but I couldn't figure it out, didn't spend very long on it though, that's in an apache php environment, is it just a htaccess thing? Anybody feel like sharing the code?

I've never seen the point of these though:

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

User-agent: EmailWolf
Disallow: /

do these actually obey robots.txt? Somehow I doubt it.

Of course the real question is what a google ip is seeing.

to the poster asking why brett doesn't just add hardware, he's answered that several times, the bbs software isn't written for multiple servers yet, although he's working on it.

Craven de Kere




msg:307546
 4:56 am on Nov 30, 2005 (gmt 0)

>to the poster asking why brett doesn't just add hardware, he's answered that several times, the bbs software isn't written for multiple servers yet, although he's working on it.

The bbs software does not need to be "written for multiple servers" to use multiple servers.

In fact, within this very thread at least one member has posted a perfectly viable way to scale laterally, even if the data is being stored in a flatfile system.

Thing is, this is not hugely relevant to what I am saying. Using flatfiles is part and parcel of the position Brett is in. It saves a lot of overhead over a db if used correctly but introduces other challenges.

Many here "get" the challenges, but would merely position themselves differently. What my point was is that I see situations in which I think Brett is saying that people who would merely approach the problem differently don't "get" the problem.

It was really just a semantic quibble I have, as in arguments (not saying there was argument here) it's a common tactic to write off disagreement as obtuseness.

No biggie.

2by4




msg:307547
 5:02 am on Nov 30, 2005 (gmt 0)

stop being so damned coherent, it's disturbing... lol... I get your point.

keno




msg:307548
 6:49 am on Nov 30, 2005 (gmt 0)

It is now very frustrating that I must look for answers to technical questions outside of Webmasterworld, and I can’t find stuff that I know is here somewhere. I can post a redundant question on the board, but I don’t like waiting around for an answer, which I may never get, so I like to go searching instead. Oh – and I do post per WW guidelines.

While I have all your attention, I would like to search the supporters’ forum too. It is a huge waste of resource just sitting there, and it doesn’t make any sense to me.

The main reason for my post is that while paging back through this thread there was mention of 2K page views per user per day in some cases.

Then someone said turn off the auto-page-loader thingy, which I suppose is the click I hear every few minutes from the browser.

So, I’m wondering if the auto-page-loader is on, and people really don’t have 2k page views per day, does this mean that the user search statistics might be wrong too?

It was mentioned that 1 in 1000 views, or visits, or users used the Google/Yahoo search function.( I can’t find that post anymore). But this number seems awfully low to me. Now I am only emphasizing the need for a SEARCH utility in this post because:

1.It doesn’t matter to me whether WW is plugged into Google or Yahoo. As a newb I searched for resources all over the Web and found plenty of them before I found WW in August 2005. I found SearchEngineWorld about 20 times before I found WW. And SearchEngineWorld does not have “the rest of the story”. Not even a link to Webmasterworld, and I don’t click on banner ads since 1996. So, WW was downright difficult to find anyway. It seems if you are searching for Javascript or CSS you might get here more easily.

2.Could we not maintain a Webmasterworld presence with several hundred pages, like the WW library, on the front-end. The search engines would still be able to find WW, and the junk-bots probably would not bring down a few hundred pages.

3.Build an internal search function on a mirror site that is 24 hours old or whatever. I don’t know how much resource this might cost, but that’s all I got as a suggestion given the situation.

Maybe this is the plan of action already, I don’t know. I followed the thread to a certain point and skimped over the last 10 pages or so…if there’s gold here, I need a spade.

If this wasn’t such a great site, I wouldn’t be writing this “high quality” post for the last hour…good luck with the solution, whatever it may be!

effisk




msg:307549
 10:20 am on Nov 30, 2005 (gmt 0)

by the way, how do you cloak a .txt file? I remember trying that with .css but I couldn't figure it out, didn't spend very long on it though, that's in an apache php environment, is it just a htaccess thing? Anybody feel like sharing the code?

Lemme see... you can create robots.txt.php...

<?php header('Content-Type: text/plain;');
echo "User-agent: *\n";
if ($_SERVER["SERVER_NAME"]=="www.example.com")
echo "Disallow: /*";
?>

then I don't remember well what to put in the .htaccess file... maybe something like
<Files robots.txt>
ForceType application/x-httpd-php
</Files>

not sure. I've done that for a specific problem when I had to redirect from an old url to a new one while my pages where still the same (same server, etc.)

[edited by: trillianjedi at 12:26 pm (utc) on Nov. 30, 2005]
[edit reason] Examplifying [/edit]

effisk




msg:307550
 10:21 am on Nov 30, 2005 (gmt 0)

oops,

didn't put the right code in the "if" line but you get the point.

Brett_Tabke




msg:307551
 12:59 pm on Nov 30, 2005 (gmt 0)

> In fact, within this very thread at least one member has
> posted a perfectly viable way to scale laterally, even if
> the data is being stored in a flatfile system.

Who - where? I don't know of an off-the-shelf system that will keep 600k+ files in sync. We need a "READ any box" but "WRITE to ALL" box software. Not entirely a problem if you just run the writes through a routine that writes to all known boxes - thus, its in sync 24x7. That's not that big of a deal for perl, but it does take time to setup, write, and debug.

> I've never seen the point of these though:
> User-agent: EmailSiphon
> Disallow: /

It does stop bots that are not those bots, but that use those bots agent names. It is a good place to account for them.

> how to enable a bots.txt as cgi

The simpe way is to just put text files as executable in your htaccess file:

AddHandler cgi-script txt

Now all txt files will be executed (make sure to set permissions on the txt file to executable)

> changing
> cookies on/off

Yes we have been. It is this thing called "testing". That stuff takes awhile to build up to look at log files and see the results. We also are building a white list of ips.

> it's high time to make robots.txt an opt-in
> protocol instead of opt-out. It won't solve
> the problem of rogue bots,

a possible robots inclusion standard:



Access email: BotAccessRequest@exampledomain.com
Restrictions: All
Allowed: SearchEngines

User-Agent: Example Bot
Allow: *


Now bots that wanted access would have to email BotAccessRequest@exampledomain.com to request permission for inclusion.

That would not address the rogue bot issue, but it would address the rogue search engine issue.

Remember last year when a certain new search engin was using everyones data to build their search engine and not sending out traffic? They hit us here for millions of page views without permission.

Using a sites data, should always be by request.

> I don't want to grab it.

hmmm, ip's are public data right ;-)

> A simple example is the more hardware vs. less hardware lines.

Is a bandaid solution. The problem is unauthorized access, usage, and pilfering of YOUR content.

> regretting not grabbing the old robots.txt

[webmasterworld.com...]

Leosghost




msg:307552
 1:32 pm on Nov 30, 2005 (gmt 0)

seems like you aint the only one suffering Brett..

you could always fine [dnsstuff.com]the abusers ;)

Freedom




msg:307553
 1:41 pm on Nov 30, 2005 (gmt 0)

Brett wrote back at the beginning of this thread on the question: "Could you require everyone but google to have a cookie to view the site?"

That exact question has been asked of the se's for years, and they say universally, that would be cloaking and against the major se guidelines and you would be subject to removal.

Google doesn't have the balls to ban WebmasterWorld. I say do it anyway.

WebmasterWorld is not some buy-my-viagra-home-mortgage-content-theft-lazy-ass-webmaster-made-for-adsense.com website.

A ban of a site with about 2 million pages would almost require a human review, and I think your troubles with rogue bots are being noted by more then 1 Google engineer (read Matt Cutts).

They'll understand what you are trying to do at this point and let your cloaking slide. WebmasterWorld by it's nature is a positive for Google by how it educates white hat seo and webmastering skills, - I think Google will side with you and realize what you are trying to do Brett.

I say go for it. Do what you have to do and let Google recognize it.

IMHO

Freedom

Scarecrow




msg:307554
 5:44 pm on Nov 30, 2005 (gmt 0)

by the way, how do you cloak a .txt file? I remember trying that with .css but I couldn't figure it out, didn't spend very long on it though, that's in an apache php environment, is it just a htaccess thing? Anybody feel like sharing the code?

> how to enable a bots.txt as cgi

The simple way is to just put text files as executable in your htaccess file:

AddHandler cgi-script txt

Now all txt files will be executed (make sure to set permissions on the txt file to executable)


Interesting. I didn't know about Brett's method. Here's how I would cloak a robots.txt. It's considerably more complex, but it is potentially much more powerful. You can transparently administer an entire site this way, and have different actions for different files. Meanwhile, you can look at the incoming IP address on the fly, and use IP delivery for the usual suspects. Some people might call this technique a perfect cloaking doorway, but of course I wouldn't know about such things. The "404 page script" does all the IP lookups, the filename conversion lookups, and steers the traffic.

For example, if you changed some filenaming conventions, you can use this technique to steer your traffic from the old names to your new names. Meanwhile, you change your links to use the new names. Then you hope that someday the bots will get tired of the old names and start asking for the new names directly. (Yeah, sure, don't hold your breath. In my experience it will take years.)

By the way, I used a doorway like this written in 'C' that was redirecting my entire site. (I sold one domain and needed them to redirect my specific filenames to my other domain, so I stipulated in the contract that they use my program for a few months.) In C it didn't even make a dent in my load when I ran tests on it.

1. Delete your robots.txt

2. You put an ErrorDocument 404 in your htaccess that redirects to a script.

3. The script picks up the REDIRECT_URL from the environment table.

4. If the request was for robots.txt, you consult your black list or white list. If it wasn't, you issue your usual custom 404 page blurb. You will probably have to issue a Status: 404 Not Found on a line just before you issue Content-type: text/html plus a blank line. Only then should you send your blurb. Check all your headers after you've finished; your mileage may vary with the headers sent out by different versions of Apache, etc.

5. If it was for robots.txt, then issue a Content-type: text/plain and the appropriate robots.txt lines that you want that particular bot to see. Apache probably sends out the Status: 200 by itself in this case. As far as I can tell, if it is done properly (be sure to compare and contrast your headers when you've finished!), then there is no clue in the headers that your robots.txt was not a static file.

If you are flooded by bots from Korea or Japan or China that actually look for robots.txt, and you know that you never get any decent traffic from those places anyway, then I can see the value of doing this. But it also discriminates against good, little bots that are trying to do the right thing, and therefore it simply perpetuates the monopoly of the big bots that are on your white list. Is there any such thing as a "good, little bot" these days, or am I hallucinating because it's too late for the good guys?

Question: How many of the dudes who use personal bots are in the habit of asking for robots.txt? I guess the better way of asking this question is, "Of all the personal bots available, how many of them check robots.txt by default?" I know you can defeat this check on almost all of them, but I'm curious about the default settings.

kaled




msg:307555
 6:10 pm on Nov 30, 2005 (gmt 0)

RE: Cloaking
The bottom line is that if Brett decides to go this route it will be to solve a technical problem it will not be to enhance the SERPS for WW. Given that many sites use cloaking of one sort or another, there could not possibly be any justification for a penalty by any search engine.

Incidentally, it could be argued that Adsense is itself an example of cloaking. Should Google ban all sites that use Adsense?

Kaled.

Lord Majestic




msg:307556
 6:21 pm on Nov 30, 2005 (gmt 0)

Given that many sites use cloaking of one sort or another, there could not possibly be any justification for a penalty by any search engine.

Where this "many" estimate coming from?

Google is pretty clear on this: "Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."

You can't have your cake and eat it - either search engines get normal access or you lose advantage of getting traffic from them. Using cloaking is a form of cheating that should attract most severe automated SE penalties.

Scarecrow




msg:307557
 6:50 pm on Nov 30, 2005 (gmt 0)

Google is pretty clear on this: "Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."

Google is living in a dream world -- one that's as lofty as its market cap.

Here's a more realistic statement, but I don't expect Google to endorse it:

As the bots become more obtrusive, and more scrapers are stealing your content so that they can generate made-for-AdSense pages, legitimate webmasters -- the ones with years invested in the content that they own -- can be expected to protect their investment. These webmasters recognize the risk that they might lose some traffic from search engines if certain countermeasures are taken. However, they feel that the situation on the web has deteriorated to the point where the benefits outweigh the risks.

Leosghost




msg:307558
 7:06 pm on Nov 30, 2005 (gmt 0)

"Of all the personal bots available, how many of them check robots.txt by default?"

Based on what I've seen and what is available via sale as customised to " what do you want it to do " and bots available ( for sale or as echange items ) for DDOS ..etc

Less than 10% of all bots take the slightest notice of robots.txt ..
Plus there are some manufacturers of commercial rippers that boast of this " ability" and post scripts on their sites to take protected directories ..and or let their users exchange preloaded "copy,paste and violate" to wherever in their fora ..

Lord Majestic




msg:307559
 7:08 pm on Nov 30, 2005 (gmt 0)

can be expected to protect their investment.

Oh that's fine by me - you can do what you want with your content including complete ban of bots and password protecting your site to be 100% sure, that's fine, no problem.

What you can't expect however is to take advantage of the Internet feature such as traffic from search engines if you ban bots - of course here we have situation when some bots are allowed and cloacked content (including robots.txt) is used: this sort of behavior is indistinquishable (by a machine) from the stuff that bad spammers do, hence you take risk of being treated as such and quiet rightly so.

I don't know for sure who lives in dream world, but I think those legitimate people who cloak think that somehow automated machines can see difference between their "good" content and some internal reason for cloaking and some bad outright spammers who also cloak - the only way to clean up the web from spam is to have zero tolerance towards cloakers - whatever their reasons are.

Solution1




msg:307560
 8:22 pm on Nov 30, 2005 (gmt 0)

Google is pretty clear on this: "Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."

A user is however never seeing robots.txt. So if you cloak robots.txt, you are not deceiving users, and you are not presenting different content to users than to search engines.

Solution1




msg:307561
 8:25 pm on Nov 30, 2005 (gmt 0)

If you require cookies from human users, and no cookies from search engines, you're not deceiving anyone. The human users, who use cookies, see exactly the same things as the search engines, who don't.

Someone who doesn't use cookies, is technically not a user.

fliks




msg:307562
 9:06 pm on Nov 30, 2005 (gmt 0)

I've never been a very active member or lurker on WebmasterWorld, as I only cared about the pubcons for the last 2+ years and my third supporters donation will definitely be the last, but I must really say, that I'm impressed by the amateurish level that webmasterworld.com shows again and again.

Brett behaves, like his webmasterworld.com is in such a unique position, that every advice he gets does not fit for his size. This heavy ignorance always surprised me - the last time, when I talked to him and his head-tech during in vegas a few days ago.

There is simply no need to run a site as big and profitable as WebmasterWorld from a single box. A cluster is just dirt-cheap and would be able to cater all your needs, including sitesearch, etc. pp and it could scale WebmasterWorld to a multiple of it's current size.

Why do you stress the existence of very few members generating more than 1k pv / day if you could give them a special status (the same that google, msn, etc. get) and ban all the others? why no professional firewall, why don't you move over to a real forum software that is known to work for many other "big boards", where each got more than 10k posts and millions of pageviews a day? WebmasterWorld and it's problems are not unique, but you tend to always pick the worst solution.

Brett_Tabke




msg:307563
 9:07 pm on Nov 30, 2005 (gmt 0)

> Where this "many" estimate coming from?

I would bet you a days wages that you could not find a site in the alexa top 100 that does NOT cloak in some form or another.

Lord Majestic




msg:307564
 9:14 pm on Nov 30, 2005 (gmt 0)

I would bet you a days wages that you could not find a site in the alexa top 100 that does NOT cloak in some form or another.

Lets assume here that you are right (I doubt it, but I will take your word), so we have got 100 sites cloaking, great, now lets look at remaining ~69,999,900 sites - would you bet your days wages that "many" of them cloak? And by many I mean a high % of sites and I exclude those of them who can be classified as definitive spam.

And note - geo-targeting is NOT cloaking: this is valid presentation of different content to different IPs based on their country of origin, NOT whether they originate from a search engine - this ain't geo-targeting, this is cloaking attempt to get better ranking by present different content than that shown to the user.

If you require cookies from human users, and no cookies from search engines, you're not deceiving anyone.

You are deceiving users - when I click on a link from search results I expect to see page that has words matched, not some "register here to login" page, especially if registration requires payment: this is only acceptable if I know in advance that such links are present.

The issue however here is different - search engines are not that smart and even though they can detect cloaking they can't identify those who cloak in "good faith" and those who are just spammer junkies, thus the only reasonable option for a search engine is to disregard ALL cloaked pages.

claus




msg:307565
 9:30 pm on Nov 30, 2005 (gmt 0)

Google doesn't have the balls to ban WebmasterWorld. I say do it anyway.

Uhm... I wouldn't bet on that. As far as I recall they did ban a Google site a few months ago.

Anyway, how's progress with the bot cure? Is the outlook good or too early to say?

2by4




msg:307566
 9:32 pm on Nov 30, 2005 (gmt 0)

brett, I'm curious, have you actually asked google about this directly? You say you have, but this doesn't really make sense.

washingtonpost.com for example requires login for users and lets google in. It's fine.

To me cloaking is presenting different content to users and bots. You've cloaked like this for years, in very subtle ways, but still it's different content.

I don't see how putting a cookie login for all non search bot entries or search referals is cloaking, you're just putting a door between the same content, it's not different content in any way unless you want to get really picky about the term cloaking.

And why would google hit you specifically for doing something that reasonable? They know who you are, they know you aren't presenting different content to bots and users beyond what you always have done. What's the deal here? This really doesn't make any sense to me. Is there some other issue with google? This isn't some random site, you have a relationship with those guys, and they let, as you noted, other large sites do this without penalty.

So what's the real problem here?

fliks, fighting words, that post, this thread is really getting to be one of my all time favorites in terms of being seriously educational, should be a classic for anyone looking at the issues of growing beyond a certain size.

I'm pretty sympathetic to brett on this one though, I think a lot of people just don't get a few things:

1- he likes the challenge of doing it on one box. For those of us who like challenges, you should be able to grasp this simple point. Forget all the technical should could, and look at the challenge. When you are faced with a challenge do you give up, or do you try to beat it? There are different types of people out there, my guess is brett likes to beat it if he can.

2- WebmasterWorld is not monetized the way all those other megasites you're looking at are. In case you haven't noticed, there are no ads on this site. except the pubcon stuff. Of all the example multibox sites you're talking about, how many run with no ads? think about it.

3- I'm assuming brett is the main, if not the only, programmer for bbbs. Which means he has a finite amount of time to deal with issues. Maybe that's wrong, maybe he has others working on it, but I sort of doubt it.

4- "why don't you move over to a real forum software"
This claim is totally ridiculous, I'm sorry. I don't know what planet this poster is from, but I have never seen forum software I like better in almost every single way than bbbs. I'm not kissing brett's butt here, the way these forums look and feel, it's all exactly the way I'd like any forum I visit to look and feel, and I imagine that feeling goes for a lot of longer time posters here. And it's why I tend to leave any other forum I go to behind after a while, I can't stand how other stuff is, vbulliten makes me want to puke for example. Although punbb is intriguing, but I don't know enough about it on the technical end to say more.

theBear




msg:307567
 9:59 pm on Nov 30, 2005 (gmt 0)

KISS is normally the best way to handle things.

But in this instance KISS is also part of the problem.

Get your site chewing bots here, get'em while they are hot, we have lwp_trivial, wget, curl, a five statement php starter kit, come one come all.

I think these things were once touted as being agents to save us time ;-).

BTW 2by4 You should clean out some of your sticky mail. I tried responding before calling it a night last night.

Brett search routines are a royal pain, I'm spending way too much time looking at several.

Leosghost




msg:307568
 10:39 pm on Nov 30, 2005 (gmt 0)

4- "why don't you move over to a real forum software"
This claim is totally ridiculous, I'm sorry. I don't know what planet this poster is from, but I have never seen forum software I like better in almost every single way than bbbs. I'm not kissing brett's butt here, the way these forums look and feel, it's all exactly the way I'd like any forum I visit to look and feel, and I imagine that feeling goes for a lot of longer time posters here.

Same as regards the no butt kissing ..

This place .as is with no bots? speed etc ..it smokes ..why on earth anyone would want to ask some one who has hand built this lil' diamond ..to switch it for zircon ..'cos some people's bots etc cant play nice ..beats me ..

BBS ..work of art ..could never afford it ..respects ..BT

( gets lots of respect from the less reputable areas of the net also ..you wouldn't know 'cos g doesnt "yet" 'dex IRC and non gmail emails ....doesnt mean they aren't trying tho ..)

Go entirely private .. ..no ads ..premod a few more threads ..( that is more important for many than you might think Brett ) sometimes the S/N ratio attracts "non thinking"" .."me too" posted in some thread .. fora posters and "trolls" ;) ..and many more would sign if the S/N dropped .did I say "pre mod more"? :).( sometimes resembles the wurldz ..ick ..ick .. )

then take your time to decide how to filter the crap and the se"s and bots ..

site search ..yeah it's a pain ..always was ..but most people help each other out with pointers from old bookmarks etc for now and there is always this method [webmasterworld.com]for now ..

no long term view ...in the eyes of many ..or only take ..no give or discuss and learn ..in their heads ..

Play_Bach




msg:307569
 1:42 am on Dec 1, 2005 (gmt 0)

> site search ..yeah it's a pain

Agreed - and that's the one thing that Google was doing a damn good job here. I don't know what Brett has in mind for his site search, but Google is not going to be easy to beat (just ask Alta Vista...).

2by4




msg:307570
 4:35 am on Dec 1, 2005 (gmt 0)

theBear:

"StickyMail Total File Size: 0 k Quota:150k"

no way to know I guess what is actually used, have to go page by page to clear it out.

Ok, so bbbs has some issues with the sticky stuff, but otherwise I stand by my words.

Brett, I was just joking about downloading the site, you can unblock me any time you feel like it. And here I thought WebmasterWorld had gone down again... LOL...

Anyway, have to go along with leosghost on this one, this software is a gem, even if brett did block my ip, it's still the best around.

Kirby




msg:307571
 4:41 am on Dec 1, 2005 (gmt 0)

I agree with Oil in alot ways, but - like alot of other guys - he isn't getting the scale and scope of the situation. It isn't just server management, page views, bandwidth, or servers. There are also issues of scrapping, copyright, and liability. The site is here for the human members - not bots.

More of us get it than you think. So lets assume you cannot win the war of the bots. What is the acceptable loss range? Scraping and copyright, no search feature, what? Does the host die to kill the parasite?

notsleepy




msg:307572
 5:02 am on Dec 1, 2005 (gmt 0)

>More of us get it than you think.

Here, here.

notsleepy




msg:307573
 5:08 am on Dec 1, 2005 (gmt 0)

2by4: I'll give you 2 gold stars for spotting the user-agent cloaking of WebmasterWorld but PLEASE stop the brown nosing. I'm vomiting over here.

carguy84




msg:307574
 9:54 pm on Dec 1, 2005 (gmt 0)

Google is pretty clear on this: "Don't deceive your users or present different content to search engines than you display to users, which is commonly referred to as "cloaking."

Eh? The data the google bot sees is EXACTLY the same data the user would see. You're just requiring the user to login. The data behind the login page is still the same. No one is being deceived.

Lord Majestic




msg:307575
 10:08 pm on Dec 1, 2005 (gmt 0)

Eh? The data the google bot sees is EXACTLY the same data the user would see. You're just requiring the user to login. The data behind the login page is still the same. No one is being deceived.

The USER who clicks on Google's result is being deceived - instead of seeing page with relevant data he will see login page. I feel seriously pissed off when I click on some links in Google news only to be greeted with subscriber's only page - it may be acceptable for Google news since they have trusted feeds, but its not for Google search engien.

Granted registration is free, however its not important - what's important is that there is no way computers can distinquish between good Brett cloaking for good reason and bad Bill-The-Spammer who cloaks for bad reasons, thus anybody who cloaks should be penalised because machines simply can't see the difference.

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 [7] 8 9 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Local / Foo
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved