homepage Welcome to WebmasterWorld Guest from 107.22.45.61
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Whitevector Crawler
vordmeister




msg:3742919
 6:33 pm on Sep 11, 2008 (gmt 0)

I'm being bugged by this crawly thing. Apparantly Whitevector analyse forums and blogs and let paying clients know if anyone is saying anything about their company. I'd prefer those clients to come to my forum or blog and maybe join in so I have blocked it.

Currently it's blocked in htaccess but I don't see any 403 requests for robots.txt. Also it doesn't take the hint and keeps on pulling 403 responses a thousand times a day which is annoying as it's cluttering the log files and getting in the way of some other debugging work.

Anyone know how to block the thing? Will it revisit robots.txt after a few days and respect a block there? Also what user agant would I block? - there's nothing I can find on their website.

 

jdMorgan




msg:3743164
 2:48 am on Sep 12, 2008 (gmt 0)

> 403 requests for robots.txt

What does that mean? I suggest that you never 403 a request for robots.txt, as some search engines will consider an inaccessible robots.txt file as carte blanche to spider your site... and then the 403 creates more problems than you started with.

There's no telling what user-agent string you might try to use in robots.txt; If they don't tell you, then it's just a guess to try "Whitevector Crawler". But if that doesn't work, then it could be that they don't respect robots.txt, or it could be that you didn't specify the proper string. So who knows?

The only way to stop "log cluttering" is to use a firewall that prevents their requests from ever reaching your server. Or if you have access to customize your log files, then define a custom-logging rule that suppresses logging for their requests.

Other than that, about all you can do is serve a minimum-sized 403 response page along the lines of "Access Denied. Click here [webmasterworld.com] for more information."

Be sure that in addition to robots.txt, your 403 error page and the suggested "403 information page" are accessible regardless of any .htaccess blocks. Requests for robots.txt, custom 403, and 500-server error pages should bypass all .htaccess access control rules.

Jim

phred




msg:3743294
 8:34 am on Sep 12, 2008 (gmt 0)

When someone is banned my php code that handles robots.txt serves:

User-agent: *
Disallow: /
(plus a couple of traps..)

403's, Disallow when requesting robots.txt,, the spiders keep hitting the web site and keep getting 403's.

Phred

vordmeister




msg:3743534
 2:15 pm on Sep 12, 2008 (gmt 0)

Thanks both. I need to get better at mod_rewrite. I used to do everything using mod_access but having moved to a windows server that's a toy I can no longer use.

In mod_rewrite how would you go about excluding robots.txt from the rewrite?

Samizdata




msg:3743556
 2:34 pm on Sep 12, 2008 (gmt 0)

In mod_rewrite how would you go about excluding robots.txt from the rewrite?

Before the RewriteRule insert this:

RewriteCond %{REQUEST_URI} !^/robots\.txt$

...

Lord Majestic




msg:3743558
 2:39 pm on Sep 12, 2008 (gmt 0)

I'd prefer those clients to come to my forum or blog and maybe join

Those clients can't do that unless they know that something of interest (like their brand name) is mentioned in your forum - it is precisely when they know that they can actually come to your forum, register etc.

vordmeister




msg:3743568
 2:53 pm on Sep 12, 2008 (gmt 0)

Thanks Samizdata!

People are able to search their company name in Google to find out if anything is being said about them on the internet. Of they can Google for stuff they are interested in. A number of company representatives have joined my forums after finding the site in these ways.

Lord Majestic




msg:3743569
 3:01 pm on Sep 12, 2008 (gmt 0)

Of they can Google for stuff they are interested in.

Costs too much in staff time - it is much smarter to pay small subscription to custom search engine like (presumably) Whitevector is developing which would (presumably) automatically notify about those matches so that the staff can get directly to your forum rather than spend time on Google every day searching for possibly hundreds of phrases without hope to see your forum because it is very hard to get into top 10 for household brand names.

I am not affiliated in any way with these people - never heard of them in fact until today, just posting this to show that it seems that you don't mind such traffic to reach your forum, and if so then blocking that startup search engine is not in your interests (unless they disobey robots.txt or something similar).

Samizdata




msg:3743585
 3:13 pm on Sep 12, 2008 (gmt 0)

Vordmeister, before you rush to implement the example code I posted read this from jdMorgan:

Be sure that in addition to robots.txt, your 403 error page and the suggested "403 information page" are accessible regardless of any .htaccess blocks. Requests for robots.txt, custom 403, and 500-server error pages should bypass all .htaccess access control rules.

My method of dealing with this is to put all my error pages in an "errors" directory, then exclude that directory from any RewriteRule - so the example I posted above would then become:

RewriteCond %{REQUEST_URI} !^/(robots\.txtĤerrors)

NB replace the "broken pipe" character with a "solid pipe" (this forum intercepts them).

...

vordmeister




msg:3743609
 3:35 pm on Sep 12, 2008 (gmt 0)

Good point Samizdata. I've just moved server and haven't got around to writing error pages. But true enough the standard ones seem to be blocked. I'll look into that.

Good point Lord Majestic too. These guys were requesting pages at the same rate as Googlebot. If every start up company punched above their weight like this I would have no bandwidth left for visitors.

I'm happy for pay for the bandwidth if they can offer a business case, but I could find no webmaster page. In my speed read of the site I came to the conclusion they were using my bandwith to offer an alternative to me rather than to help promote me - an anti-business case.

wilderness




msg:3743636
 4:00 pm on Sep 12, 2008 (gmt 0)

I'm happy for pay for the bandwidth if they can offer a business case, but I could find no webmaster page. In my speed read of the site I came to the conclusion they were using my bandwith to offer an alternative to me rather than to help promote me - an anti-business case.

There are many third-party services (and I use that term very-loosely and BROAD) that are NOT beneficial to webmasters and/or websites.

Each webmaster must determine what is beneficial or detrimental to their own site (s).

Lord Majestic




msg:3743647
 4:11 pm on Sep 12, 2008 (gmt 0)

These guys were requesting pages at the same rate as Googlebot.

I don't know if they support Crawl-Delay but this could be an option.

If every start up company punched above their weight like this I would have no bandwidth left for visitors.

Thing is that Google existed for 10 years now and startups have to catch without waiting 10 years - this is what can explain high crawl rate. I recently visited proper data center and it turns out these days the main cost (to them) is energy rather than bandwidth, if you are not getting a good deal on that from your current provider then by all means switch away - bandwidth these days is pretty cheap.

Your other considerations are fairly good however, I was merely trying to point out that in original description of the situation such (or some other) specialised search engine could benefit.

vordmeister




msg:3743669
 4:33 pm on Sep 12, 2008 (gmt 0)

Even googlebot was blocked by some webmasters when it first started up. They thought "what the heck is this thing and why should I allow it". (Also it was badly behaved at times).

Bandwidth isn't really the issue - more the principle. There are so many parasites roboting their way around the web these days. I allow the ones I believe aren't evil as search engine competition can only benefit webmasters. But if someone takes a big chunk and can't be bothered to tell me why they tend to end up on the naughty step.

I've made the point. Perhaps they'll spot the point using their technology and sort it out. Though I suspect they aren't allowed on here. :-)

Samizdata




msg:3743705
 5:08 pm on Sep 12, 2008 (gmt 0)

I'm being bugged by this crawly thing

I'm a bit slow on the uptake today. What is the user-agent? Does it always use the same IP range?

If I understand correctly, it is a commercial "brand protection" bot with an identifiable UA and it does not request robots.txt - not a regular search engine but just another automated nuisance.

Presumably it found a mention of one of its paying customers somewhere on your site and wants to alert their PR (paranoid reaction) department so they can post corporate spin to your forum or blog.

If this is the case, Jim had it covered - drop at firewall or 403 it and filter the logs.

...

vordmeister




msg:3743719
 5:38 pm on Sep 12, 2008 (gmt 0)

A commercial brand protection bot? This is the first one I've noticed, but I think you are right from their sales blurb. What's the world coming to?

User agent is Whitevector+Crawler in the logs but I've not checked to see whether the IP 83.145.232.* remains the same. It did today.

Given their Scandinavian location I can guess the brand they are trying to protect as well. That brand only has good press on my forum as they make decent stuff. Maybe they are planning to slip something doubtful onto the market and are keeping a watchout for bad feedback?

I've allowed them to see robots.txt now thanks to your advice. Hopefully they'll stop or give up soon.

[edited by: incrediBILL at 11:18 pm (utc) on Sep. 12, 2008]
[edit reason] Obscured IPs [/edit]

incrediBILL




msg:3743910
 11:21 pm on Sep 12, 2008 (gmt 0)

There are lots of these spybots out there, Whitevector is just one of many.

That's why I block them all because I don't need parasites climbing all over my site.

My web site is NOT part of their ecosystem.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved