homepage Welcome to WebmasterWorld Guest from 54.196.197.153
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
LinkSaver and LinkParser
wikimedia bots?
dstiles




msg:4412090
 4:56 pm on Jan 29, 2012 (gmt 0)

Just had half a dozen (blocked 403) hits from LinkSaver and LinkParser from Wikimedia range. All to same page, possibly multiple because of the 403. It is quite possible the page is in wiki - several pages from this web site are.

IP: 208.80.153.192 - rDNS is ns0.wikimedia.(etc...)
UA: LinkSaver/2.0
UA: LinkParser/2.0

Wikimedia IP range: 208.80.156.0 - 208.80.159.255

I have wiki bots whitelisted at 91.198.174.201 - 91.198.174.211 but can't recall the UA - probably "Checklinks ... pywikipedia".

Anyone know if this is actually wiki or merely someone accessing through wiki? The only references I can find are to Firefox and similar tools.

 

dstiles




msg:4412100
 5:20 pm on Jan 29, 2012 (gmt 0)

Also found, on the same IP...

UA: CorenSearchBot/1.7 en libwww-perl/6.03

This was to a different web site which is also probably on wiki. This one preceded the Link bots above by about 25 minutes but (as noted) to a different web site. Initial time 08:14 GMT - not seen another hit at time of posting 17:20 GMT.

Pfui




msg:4412106
 5:56 pm on Jan 29, 2012 (gmt 0)

The last time I saw LinkParser/2.0, it also came from Wikimedia's 208.80.153.192 [robtex.com...]

PHP's Comments section for that IP [projecthoneypot.org...] shows yet another bot: COIBot/2.0

LinkParser
LinkSaver
CorenSearchBot
COIBot

That's four strikes from one IP. Three's more than enough for me from any one IP.

FWIW, I've blocked Wikipedia's bots and such forever because they drill into areas where all bots are denied access in robots.txt (and in other ways, too). Also, I know the linked-to resources are in the same places they've been forever.

Aside: Plus now I'm irked with Wales over his doublespeaking, SOPA-grandstanding blackout. [webmasterworld.com...]

dstiles




msg:4412162
 9:10 pm on Jan 29, 2012 (gmt 0)

Thanks for the extra info, pfui. I blocked the range when I found it but I'm still keeping the original bot. Can't say it's been prominent, and that's the important bit.

lucy24




msg:4412164
 9:11 pm on Jan 29, 2012 (gmt 0)

Ugh, 208.80, just what I want to hit me in the face while eating my granola. Do any humans live there? I haven't met .150-anything, but 208.80.192.0/21 sticks in my memory as one of the first ranges I ever, ever blocked. (It's really only ...194.something, but who's counting.)

wilderness




msg:4412175
 10:01 pm on Jan 29, 2012 (gmt 0)

There are some commonly abused User Agents that being utilized via black-listing could easily stop these critters in their tracks.

Most have been common for a decade or more (note; some UA's have been modified to stop multiple UA's).

What ever method floats your boat:

SetEnvIfNoCase User-Agent (capture|crawl|data|harvest)
SetEnvIfNoCase User-Agent (java|larbin|Library|libww|link)
SetEnvIfNoCase User-Agent (load|lwp|lynx|MJ12bot|nutch)
SetEnvIfNoCase User-Agent (Proxy|pyton|reaper|retrieve|spider)
SetEnvIfNoCase User-Agent (Validator|wget|WinHttp)

or

RewriteCond %{HTTP_USER_AGENT} (capture|crawl|data|harvest) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (java|larbin|Library|libww|link) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (load|lwp|lynx|MJ12bot|nutch) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Proxy|pyton|reaper|retrieve|spider) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Validator|wget|WinHttp) [NC]
RewriteRule .* - [F]


Pfui




msg:4412192
 12:44 am on Jan 30, 2012 (gmt 0)

P.S. for copy-pasters: Make that python (not pyton:)

adrian20




msg:4412744
 5:17 pm on Jan 31, 2012 (gmt 0)

Hi, this is my first post in the forum, but I've been strong reading tips and experiences here.

To wilderness, after reading the entire manual Apache, I concluded that to block access using htaccess file, it is best (for your resources) to use;

Access control by environment variable, instead of mod_rewrite.

In this case I used for a while, something like this;

# Browsermatch
BrowserMatchNoCase "capture|crawl|data|harvest|java|larbin" BAT_bot
deny from env=BAT_bot

lucy24




msg:4412866
 9:58 pm on Jan 31, 2012 (gmt 0)

I use mod_setenvif for simple conditionless blocks that involve looking for a single word, or a basic string like "MSIE [1-4]" *

But then I've got another complication, which is that I've got two domains in one userspace. My .htaccess compromise is to use a shared file for environmental variables and IP (core-level Deny from...). The individual domains' htaccess is for mod_rewrite to deal with more specific circumstances, exact filenames and so on.

This means that someone asking for the wrong file with the wrong referer using the wrong UA from the wrong IP will spend some time waiting on the virtual doorstep before getting the door slammed in their face. But you can't really feel sorry for them can you? :)


* Staggering but true, there exist at least three human beings on this planet who still use MSIE for Mac. That means 5.something. Their UA will say "Mac PPC" even if it's really Intel. (I tested on myself.)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved