homepage Welcome to WebmasterWorld Guest from 54.197.111.87
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 41 message thread spans 2 pages: 41 ( [1] 2 > >     
Gridzoom Throws Down the Gauntlet
gridBot/0.3alpha
pendanticist




msg:402369
 9:19 pm on Oct 19, 2004 (gmt 0)

209.123.8.** - - [19/Oct/2004:11:52:21 -0700] "GET /robots.txt HTTP/1.1" 200 1705 "-" "gridBot/0.3alpha (+ [gridzoom.com...]
209.123.8.** - - [19/Oct/2004:11:52:21 -0700] "GET / HTTP/1.1" 200 20402 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

From their Mission Statement:

- So what will make Gridzoom different?

First of all, our spider rates a site by certain quality factors as a human reader would precieve them. Sites with no other purpose than getting at your wallet will be ranked respectively. Then sites will be ranked only by those factors a webmaster cannot influence. Of course, in an ideal world search engines with sophisticated full text algorythms would deliver better results. Unfortunately, this isn't an ideal world, but a web full of cheaters looking for a quick buck. We are out to clean up that mess.

Let the games begin!

<They have an add URL too.>

 

volatilegx




msg:402370
 2:55 pm on Oct 20, 2004 (gmt 0)

> certain quality factors as a human reader would precieve them

What in the heck does that mean?

From their home page:
DO NOT EXPECT ANY USEFUL RESULTS!

;)

pendanticist




msg:402371
 3:51 pm on Oct 20, 2004 (gmt 0)

Were I one who derived, ( or attempted to ) income from my domain(s), I'd be more concerened with how they view the entire Ad Campaings thing, with respect to their engine, and just how popular they (and or others) may become.

Not everyone feels completely monetizing the Internet is the best thing to do.

Besides, I don't see that warning you've quoted as being preceeded by the word...NEVER.

jdMorgan




msg:402372
 6:53 pm on Oct 20, 2004 (gmt 0)

"algorythms"?

Hope they code better than they spell...

They are suffering under the misapprehension that commercial sites are "bad." Look, some people use the Web because they want to find information. Some use it because they want to buy something. A search engine that penalizes commercial sites won't last long, because it will lose the latter part of that user-base.

I applaud efforts to clean up the SERPs, but this confrontational, judgemental sttitude will be their undoing.

Jim

pendanticist




msg:402373
 5:21 am on Oct 21, 2004 (gmt 0)

Yeah, well with everything said-n-read so far, the fools have crawled thru today with the following UA:

209.123.8.44 - - [20/Oct/2004:19:55:50 -0700] "GET /robots.txt HTTP/1.1" 200 1705 "-" "gridBot/0.3alpha (+http://www.gridzoom.com/gridbot.php)"
209.123.8.44 - - [20/Oct/2004:19:55:51 -0700] "GET /Blahblah.html HTTP/1.1" 200 4499 "http://www.example.com/blahblah.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
209.123.8.44 - - [20/Oct/2004:19:55:55 -0700] "GET /Blahblah.html HTTP/1.1" 200 6506 "http://www.example.com/blahblah.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

You'd think they'd continue using the gridBot moniker...

grandpa




msg:402374
 5:39 am on Oct 21, 2004 (gmt 0)

You'd think they'd continue using the gridBot moniker...

It looks like they've got their hands full. The programmers are probably doing their own Q/A. I wish them well.

volatilegx




msg:402375
 1:18 pm on Oct 21, 2004 (gmt 0)

The programmers are probably doing their own Q/A.

Look at the time between requests. 1 second. 4 seconds. Looks automated.

Thunderstorm




msg:402376
 2:44 pm on Oct 21, 2004 (gmt 0)

Damn, you guys are fast.

A few things:

1. As the front page says, it's a technology demo. It's not an alpha or even a pre-alpha. We are cleaning out the index every other day, working on the ranking, refining the spider, yadda. The test box is an ole pentium. The live hardware isn't even ordered yet.

In conlusion: Hold your horses. ;)

2. We are not out to keep webmasters from earning money. A good service is worth good money. A good site deserves an advertising income.
However, as you probably know full well, there are plenty of sites pretending to be worthwhile, but really only exist to make money, preferrably for nothing. It's those we want to get.

3. We are webmasters ourselves. (Even some adult webmasters, which is probably the dirtiest (no pun intended) of business on the web.)
If I create a good site I deserve to be ranked well, without having to optimize my site for Google or whoever. If my site sucks, I don't deserve a good ranking. Easy as that.
It's a search engines job to rank sites, it's not the webmasters job. The only way to raise up the ladder should be quality.

Thomas

P.S.: The spider bot does only identify itself for the robots. God knows how many websites deliver different content when the agent is Google. We ain't gonna have that. In the production environment the content query will also carry a different IP address. (As in many many different subnets. :)

[edited by: engine at 9:23 am (utc) on Oct. 22, 2004]
[edit reason] TOS [webmasterworld.com] [/edit]

volatilegx




msg:402377
 3:30 pm on Oct 21, 2004 (gmt 0)

Hi Thunderstorm, welcome to WebmasterWorld. Thanks for posting about your bot/engine. Good luck to your team!

So how do you plan to automate a scoring routine which judges how "good" a website is? That sounds like a very tall order!

Thunderstorm




msg:402378
 3:57 pm on Oct 21, 2004 (gmt 0)

Thanks volatilegx.

The description in the mission statement (which was whipped up after a long night in about 3 minutes :) might be a bit misleading.

Of course we do not judge your site by the nature of its content. We do prefer sites with unique content, but mostly weed out dupes, penalize sites for annoying or unethical advertising methods. Things like that ...

In addition to that we take every type of search engine optimization we learned about in recent years and make sure they won't work on our concept. :)

There will be a white paper available about the exact algorithm in the future.

pendanticist




msg:402379
 5:34 pm on Oct 21, 2004 (gmt 0)

Allow me to also Welcome you to WebmasterWorld! :)

Your mention of robots.txt requests aside, I trust you understand how those which crawl the 'Net without accurate user agents may tend to become banned? <Hence, my use of the term fools previously.>

I personally don't mind being crawled, but with more and more engines coming out, demands to respect my site(s) must prevail for it is our backs upon which you tred.

Good Luck!

Pendanticist.

Thunderstorm




msg:402380
 7:10 pm on Oct 21, 2004 (gmt 0)

Well, there are two sides to this:

As a webmaster I want to know who crawls my site. But do I really need to know every page the spider has accessed and when? I really don't, unless I want to serve different content to that spider than, say, to an internet explorer.

On the other hand, a search engine is always a good thing for a webmaster, as it drives traffic.

So, anyone who'd rather not have his "usual" content indexed or who doesn't like this method out of principle is welcome to block the spider in robots.txt. :) (Which is kinda untested, since noone blocked this specific robot yet. Wildcard blocks do work though. ;)

With the amount of content on the web it's really not a loss for a search engine not being able to index a page. There are just so many "top spots" - far less than sites deserving them. That's why we need more search engines with different algorithms and why it really doesn't matter if a search engine only indexes half the web.

For thorough fulltext research noone will beat google for a long time anyway. If I am looking for a brownie recipe, I'll use my search engine. If I am looking for exactly that brownie recipe my grandma always used to make, I'll use google. ;)

So in conclusion, I don't care if your brownie recipe is in my index. ;-)

P.S.: My grandma never made brownies.

pendanticist




msg:402381
 1:37 am on Oct 22, 2004 (gmt 0)

>I don't care if your brownie recipe is in my index

As I recall, you already have it. ;)

bull




msg:402382
 10:10 am on Oct 22, 2004 (gmt 0)

As I do not have the time to check whether this bot will behave well even If i do have an appropriate entry in the robots.txt as it does its crawl with a fake User-agent, I denied the IP.

Additionally, its inventor implicitly states here that we all are bad people and do cloaking.

Lord Majestic




msg:402383
 12:16 pm on Oct 22, 2004 (gmt 0)

Additionally, its inventor implicitly states here that we all are bad people and do cloaking.

And what is your idea on how search engines can determine those sites that cloak content depending on user agent? As far as I can see the only way out (without humans checking of course!) is to crawl site with bot real UA and then crawl it with IE's UA. If you have other way of doing it then please share this information.

Thunderstorm




msg:402384
 1:43 pm on Oct 22, 2004 (gmt 0)

I denied the IP

Since the IP is temporary and the final set of IP addresses will be hard to figure out, you'd be better off using the robots methods to block the bot.

bakedjake




msg:402385
 1:47 pm on Oct 22, 2004 (gmt 0)

is to crawl site with bot real UA and then crawl it with IE's UA

And that's if they're doing crummy UA cloaking. IP cloaking is even harder to detect.

I'd be interested in how you're decloaking as well. Care to provide us with an overview?

Are you simply running spiders on seperate (hidden) IP space?

bull




msg:402386
 1:52 pm on Oct 22, 2004 (gmt 0)

There are enough people here who deprecate this and will figure out your final IP range.

Lord Majestic




msg:402387
 1:58 pm on Oct 22, 2004 (gmt 0)

And that's if they're doing crummy UA cloaking. IP cloaking is even harder to detect.

True, but the easiest way to cloak is to cloak by UA, so far no ideas here on how to uncover sites that cloak, then why being so aggressive to bots that will fake UA for the purpose of uncovering cloaking?

Just like its wrong to say that all SEO folk are search engine cheats, it is wrong to say that all bots that use faked UA are bad.

Thunderstorm




msg:402388
 2:45 pm on Oct 22, 2004 (gmt 0)


Are you simply running spiders on seperate (hidden) IP space?

Yes. We don't fetch the data twice and compare it, even though we might do that in the future and assign a ranking penalty to sites then delivering different content. But that might unfairly penalize legit dynamic sites serving different content each time you access their homepage. One of my own pages, for example, displays a random user profile and blog entry every time you access the front page.

The robots will be fetched from a static IP with gridbot as the UA. This tells that webmaster that his content will be spider "soon". The data will then be fetched from completely different IPs with a random valid UA each session. We also provide the referer of the page spidered before on your site (if they are linked, of course).

The cloaked spider still obeys robots.txt.

I am sure people will figure it out sooner or later, but we will be on at least two rather large backbones to start with and they agreed to help us out with temporary IPs from their free ranges. Blocking their networks would be a bad idea.

It also really doesn't make sense. Everyone can easily block the bot through robots.txt. The ONLY thing one needs the IP for would be for cloaking.

volatilegx




msg:402389
 2:51 pm on Oct 22, 2004 (gmt 0)

I certainly wouldn't say this user agent switching bot will be a "cloak-buster"; any cloaker with half a brain is using IP detection. This particular bot will simply receive the "human" version of any cloaked page, just like any unknown search engine.

I'd like to bring up the basic ethical question behind a robot switching user agents. It seems to me that it is very similar to the ethical questions behind cloaking. A robot switching user agents when crawling is trying to hide its identity.

Isn't it the right of webmasters to decide who gets to view what content? As owner of my intellectual property (my web pages) I reserve all rights to it -- meaning it is definitely my write to selectively serve content.

A user or bot who intentionally falsifies his identity (by switching user agent strings) is, in effect, stealing content I have reserved for others. In a liberal sense, this invasion is actually a type of fraud.

Some would proclaim that I am simply a cloaker whining that his methods are being made irrelevent (boo hoo). But you have to remember who the owner of the content is. It is not the search engine... it is the webmaster. The search engine is the one coming in (possibly uninvited) and using the webmaster's server resources, purusing the intellectual property of the webmaster, and using it to make money.

Lord Majestic




msg:402390
 3:02 pm on Oct 22, 2004 (gmt 0)

As owner of my intellectual property (my web pages) I reserve all rights to it -- meaning it is definitely my write to selectively serve content.

What about shops refusing to sell their goods to people who happen to be of diffent race?

Bots that hide their UA in order to bust cloakers might not provide complete solution to the problem, however its a good start and while many people know Google's IPs, it still becomes much more dangerous game for them - Google is rich enough to afford new IPs every month to bust cloackers. I'd imagine honest webmasters who don't do cloacking would be happy for bots to bust people who compete unfairly (by deception).

What is more important for you - ensure people who are competing with your sites using deception are taken down or insist on knowing who is browsing your sites at any given time? If the latter than you should start gun retail business instead - at least it would be understandable why you ask for an ID, which it is not on the Net.

And let me ask you - do you insist on knowing names, salary, marital details of every single human user who hits your site? It seems to me the level of scrutiny that you apply to bots is way higher than what you apply to human browsers.

stealing content I have reserved for others

As I walk on a high street past stores I see lots of free leaflets, catalogues, magazines - I can just pick one up and I can do what I feel like with it, whoever has adopted the model of "free content to end user" accepts that some people would just bin it in a second. Just like these established players you will be better off accepting that in your model you will have some things that happen not the way you like it. Since you give away free digital content its not like it costs you anything - bandwidth is getting cheaper by the day so its no longer an excuse unless bot is not behaving well (disobeys robotstxt, overloads site for no reason etc).

Bots are integral part of what made Internet the way it is now. You can't have it both ways by on one hand taking advantage of what Internet has got to offer you but on the other hand cutting out things that come as part of the packege such as search engine bots.

[edited by: Lord_Majestic at 3:15 pm (utc) on Oct. 22, 2004]

victor




msg:402391
 3:08 pm on Oct 22, 2004 (gmt 0)

Various posts here suggest a misreading of the HTTP1.1 spec.

There is no requirement in the protocols for a user-agent to provide a id, or even a consistent one.

The key word in RFC2068 on the subject is "should" not "must".

In fact, they are permitted to do what they like as an aid to the "tracing of protocol violations".

Cloaking is not a given reason for responding differently to different user-agent ids except to "avoid particular user agent limitations" So other forms of cloaking may count as a protocol violation. If so, a user-agent is explicitly allowed to detect that.

If you want that "should" to be a "must" in the protocol, and other changes: contact the W3C consortium.

volatilegx




msg:402392
 3:31 pm on Oct 22, 2004 (gmt 0)

What about shops refusing to sell their goods to people who happen to be of diffent race?

Ahh but cloakers aren't violating anybody's civil rights, which is the case in this example.

I'd imagine honest webmasters who don't do cloacking would be happy for bots to bust people who compete unfairly (by deception).

There is nothing unfair or deceptive about cloaking. Cloaking is a defensive technique used to keep others from stealing our optimized html. The optimized html used in cloaking has no advantage over optimized html used by non-cloakers.

Like any tool, cloaking can be used deceptively, for example to get a porn site ranked for the term "disney". I would dispute that the great majority of cloakers aren't doing anything like that.

However, I don't want to debate the ethics of cloaking here (there are plenty of other threads already doing that)... I simply wanted to compare them to the ethics of an engine that switches user agents in an attempt to make cloaking irrelevent.

And let me ask you - do you insist on knowing names, salary, marital details of every single human user who hits your site? It seems to me the level of scrutiny that you apply to bots is way higher than what you apply to human browsers.

A marketer's dream come true :) If I had that data I could make a fortune! Realistically, of course I don't expect that. However, it is an expectation that at least the visitor will report a truthful User-Agent. Plenty of websites vary content based on the User-Agent string, for many reasons: CSS incompatabilities, JavaScript, etc.

As I walk on a high street past stores I see lots of free leaflets, catalogues, magazines

I think this is a bad analogy. It disregards the fact that the web is a dynamic medium capable of dicriminating between the identities of visitors. A better analogy would be walking down the street and seeing various hucksters handing out leaflets. The hucksters can choose to hand you their leaflet based on your appearance. A huckster handing out adverts about a strip bar might not hand one to a little old lady, but would be sure to hand them to a bunch of sailors :)

Since you give away free digital content its not like it costs you anything

But if I base the decision of whether to give away free content (or what content to serve) on the User-Agent string, then changing the string defrauds me.

Thunderstorm




msg:402393
 3:55 pm on Oct 22, 2004 (gmt 0)


I think this is a bad analogy. It disregards the fact that the web is a dynamic medium capable of dicriminating between the identities of visitors. A better analogy would be walking down the street and seeing various hucksters handing out leaflets. The hucksters can choose to hand you their leaflet based on your appearance. A huckster handing out adverts about a strip bar might not hand one to a little old lady, but would be sure to hand them to a bunch of sailors :)

Let's use this analogy. So I am a little old lady with extra large shoulder pads and a lot of makeup under a trenchcoat disguised as a horny young male. If I get a leaflet now, would that be fraud? It's still a gift. They didn't put up a sign saying: "No disguised little old ladys allowed". If you want, you can do that. (robots)

Lord Majestic




msg:402394
 4:05 pm on Oct 22, 2004 (gmt 0)

There is nothing unfair or deceptive about cloaking. Cloaking is a defensive technique used to keep others from stealing our optimized html. The optimized html used in cloaking has no advantage over optimized html used by non-cloakers.

Ahhh, now we getting to the bottom of the issue ;)

I suppose if you want to serve, ummm, "optimised html" to certain search engines that you care about like Google, then you would serve the same non-optimised stuff to everyone else, in which case you would not suffer any tiny bit from bots that declare themselves as being IE6, right? So, whats your problem?

I guess you will have no problems until after Google starts, if it has not started yet, checking sites by not telling the whole world its Google ;)

Plenty of websites vary content based on the User-Agent string, for many reasons: CSS incompatabilities, JavaScript, etc.

Yes fair play, but in this situation if a client (bot) fakes user-agent then its his problem to not being able to parse possibly wrong CSS or whatever - you should not be concerned about it, but you are because you just want to exercise what I consider is an unreasonable degree of control :)

But if I base the decision of whether to give away free content (or what content to serve) on the User-Agent string, then changing the string defrauds me.

Defrading, strictly speaking, is a serious offence and I would not use it for case when the world does not bend the way you want it to suit your business model -- the world does not owe you a living and while you can call it "defrauding" you it will certainly not stand in court :)

[edited by: Lord_Majestic at 4:08 pm (utc) on Oct. 22, 2004]

volatilegx




msg:402395
 4:08 pm on Oct 22, 2004 (gmt 0)

Thunderstorm,

Touche :) My analogy is flawed, too.

volatilegx




msg:402396
 4:15 pm on Oct 22, 2004 (gmt 0)

I suppose if you want to serve, ummm, "optimised html" to certain search engines that you care about like Google, then you would serve the same non-optimised stuff to everyone else, in which case you would not suffer any tiny bit from bots that declare themselves as being IE6, right? So, whats your problem?

I have no problem with your statement. It fits my point exactly.

I guess you will have no problems until after Google starts, if it has not started yet, checking sites by not telling the whole world its Google

Who says they haven't? They may have already begun doing this. However there are better ways for the search engines to do their "cloak-busting". There are also much better ways to make cloaking "irrelevent". But I'm not going to give anybody any free ideas ;)

Defrading, strictly speaking, is a serious offence and I would not use it for case when the world does not bend the way you want it to suit your business model -- the world does not owe you a living and while you can call it "defrauding" you it will certainly not stand in court

Quite right. Fraud is probably too strong a word, and I certainly didn't mean it in a legal sense.

Lord Majestic




msg:402397
 4:17 pm on Oct 22, 2004 (gmt 0)

I have no problem with your statement. It fits my point exactly.

I then perhaps misunderstood you by thinking you are NOT okay with a bot crawling your site twice from different IPs and using their own unique UA and then pretending to be something like IE6 in order to find out peolpe who cloak in a way that is deceptive, ie serving different totally content?

jdMorgan




msg:402398
 4:19 pm on Oct 22, 2004 (gmt 0)

Folks,

Let's not argue, please. It isn't going to do anyone any good.

I think that Thunderstorm has gotten a sample of both sides here now. Gridzoom is going to do what they're going to do, and there's not much we can do about it, except to use robots.txt or other fairly standard methods to control spider access.

There's a middle ground here, too; Those who serve differing content to different user-agents for good reasons, and there are plenty (e.g. simple language-selection depending on UA and accept-language). But ultimately, the market will decide whether Gridzoom's method is better than what we are accustomed to today.

Those of us who don't cloak can go on as usual with robots.txt, .htaccess, etc. Those of us who do will simply add Gridzoom to their IP-address-tracking watch lists and go on as usual (Thunderstorm's first words in this thread are revealing). Gridzoom is at present a new engine. If over time it becomes a major player, Webmasters will do what they always do: adapt.

Jim

This 41 message thread spans 2 pages: 41 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved