what is this spider

Forum Moderators: DixonJones

Message Too Old, No Replies

what is this spider

what is vang.net.spider 1.6

PaulPA

8:38 pm on May 27, 2004 (gmt 0)

I keeping seeing this one show up and I'm not sure if it should be banned. Anybody know about it. Spider is: vang.net spider 1.6

webnerd

4:36 pm on May 28, 2004 (gmt 0)

spider 1.6 is software anyone can download which apparently is used to search for music files etc.
vang.net is located in the Neitherlands. It is definitely not associated with a search engine like Google etc.
check forum entry:
[webmasterworld.com...]

I not sure how you would block it if it is a free piece of software?

PaulPA

9:56 pm on May 28, 2004 (gmt 0)

So a disallow in robot.txt would not work?

sidyadav

5:28 am on May 29, 2004 (gmt 0)

Yes -- if it doesn't obey robots.txt.

You can check that by going back to your server logs and seeing which file it requested first. If it's robots.txt - it should obey it.

An alternative method to ban this is via .htaccess.

Sid

PaulPA

11:38 am on May 29, 2004 (gmt 0)

Thanks. I'll give it a try.

webnerd

12:48 pm on May 30, 2004 (gmt 0)

An alternative method to ban this is via .htaccess.

Hi SID,
What would you do to ban it in .htaccess if the software is a free download and different hackers with different ip addresses and or domains are using it?
What code would I use so I could ban it in .htaccess?
I not much of a .htacess guru.

sidyadav

7:26 am on May 31, 2004 (gmt 0)

Well, there are heaps of ways of doing it with .htaccess.

But because it's a free software - and can be installed on more than 100,000 computers - you can't of course, ban it via all it's IPs, which would seem impossible.

However, you can add the following code to your .htaccess file and forbid access to it. This way I think is the best way to do it, because you can generally block any robot you think is nasty, even if it doesn't obey the Robots.txt Exclusion Standard.

[engelschall.com...]

Blocking of Robots
Problem Description:
How can we block a really annoying robot from retrieving pages of a specific webarea? A /robots.txt file containing entries of the "Robot Exclusion Protocol" is typically not enough to get rid of such a robot.
Problem Solution:
We use a ruleset which forbids the URLs of the webarea /~quux/foo/arc/ (perhaps a very deep directory indexed area where the robot traversal would create big server load). We have to make sure that we forbid access only to the particular robot, i.e. just forbidding the host where the robot runs is not enough. This would block users from this host, too. We accomplish this by also matching the User-Agent HTTP header information.
RewriteCond %{HTTP_USER_AGENT} ^NameOfBadRobot.*
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.[8-9]$
RewriteRule ^/~quux/foo/arc/.+ - [F]

Sid

webnerd

5:06 pm on May 31, 2004 (gmt 0)

fantastic post sid.
Thanks a bundle.