Welcome to WebmasterWorld Guest from 54.196.175.173

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

LinkSaver and LinkParser

wikimedia bots?

     

dstiles

4:56 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Just had half a dozen (blocked 403) hits from LinkSaver and LinkParser from Wikimedia range. All to same page, possibly multiple because of the 403. It is quite possible the page is in wiki - several pages from this web site are.

IP: 208.80.153.192 - rDNS is ns0.wikimedia.(etc...)
UA: LinkSaver/2.0
UA: LinkParser/2.0

Wikimedia IP range: 208.80.156.0 - 208.80.159.255

I have wiki bots whitelisted at 91.198.174.201 - 91.198.174.211 but can't recall the UA - probably "Checklinks ... pywikipedia".

Anyone know if this is actually wiki or merely someone accessing through wiki? The only references I can find are to Firefox and similar tools.

dstiles

5:20 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Also found, on the same IP...

UA: CorenSearchBot/1.7 en libwww-perl/6.03

This was to a different web site which is also probably on wiki. This one preceded the Link bots above by about 25 minutes but (as noted) to a different web site. Initial time 08:14 GMT - not seen another hit at time of posting 17:20 GMT.

Pfui

5:56 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The last time I saw LinkParser/2.0, it also came from Wikimedia's 208.80.153.192 [robtex.com...]

PHP's Comments section for that IP [projecthoneypot.org...] shows yet another bot: COIBot/2.0

LinkParser
LinkSaver
CorenSearchBot
COIBot

That's four strikes from one IP. Three's more than enough for me from any one IP.

FWIW, I've blocked Wikipedia's bots and such forever because they drill into areas where all bots are denied access in robots.txt (and in other ways, too). Also, I know the linked-to resources are in the same places they've been forever.

Aside: Plus now I'm irked with Wales over his doublespeaking, SOPA-grandstanding blackout. [webmasterworld.com...]

dstiles

9:10 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Thanks for the extra info, pfui. I blocked the range when I found it but I'm still keeping the original bot. Can't say it's been prominent, and that's the important bit.

lucy24

9:11 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Ugh, 208.80, just what I want to hit me in the face while eating my granola. Do any humans live there? I haven't met .150-anything, but 208.80.192.0/21 sticks in my memory as one of the first ranges I ever, ever blocked. (It's really only ...194.something, but who's counting.)

wilderness

10:01 pm on Jan 29, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



There are some commonly abused User Agents that being utilized via black-listing could easily stop these critters in their tracks.

Most have been common for a decade or more (note; some UA's have been modified to stop multiple UA's).

What ever method floats your boat:

SetEnvIfNoCase User-Agent (capture|crawl|data|harvest)
SetEnvIfNoCase User-Agent (java|larbin|Library|libww|link)
SetEnvIfNoCase User-Agent (load|lwp|lynx|MJ12bot|nutch)
SetEnvIfNoCase User-Agent (Proxy|pyton|reaper|retrieve|spider)
SetEnvIfNoCase User-Agent (Validator|wget|WinHttp)

or

RewriteCond %{HTTP_USER_AGENT} (capture|crawl|data|harvest) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (java|larbin|Library|libww|link) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (load|lwp|lynx|MJ12bot|nutch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Proxy|pyton|reaper|retrieve|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Validator|wget|WinHttp) [NC]
RewriteRule .* - [F]

Pfui

12:44 am on Jan 30, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



P.S. for copy-pasters: Make that python (not pyton:)

adrian20

5:17 pm on Jan 31, 2012 (gmt 0)



Hi, this is my first post in the forum, but I've been strong reading tips and experiences here.

To wilderness, after reading the entire manual Apache, I concluded that to block access using htaccess file, it is best (for your resources) to use;

Access control by environment variable, instead of mod_rewrite.

In this case I used for a while, something like this;

# Browsermatch
BrowserMatchNoCase "capture|crawl|data|harvest|java|larbin" BAT_bot
deny from env=BAT_bot

lucy24

9:58 pm on Jan 31, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I use mod_setenvif for simple conditionless blocks that involve looking for a single word, or a basic string like "MSIE [1-4]" *

But then I've got another complication, which is that I've got two domains in one userspace. My .htaccess compromise is to use a shared file for environmental variables and IP (core-level Deny from...). The individual domains' htaccess is for mod_rewrite to deal with more specific circumstances, exact filenames and so on.

This means that someone asking for the wrong file with the wrong referer using the wrong UA from the wrong IP will spend some time waiting on the virtual doorstep before getting the door slammed in their face. But you can't really feel sorry for them can you? :)


* Staggering but true, there exist at least three human beings on this planet who still use MSIE for Mac. That means 5.something. Their UA will say "Mac PPC" even if it's really Intel. (I tested on myself.)
 

Featured Threads

Hot Threads This Week

Hot Threads This Month