homepage Welcome to WebmasterWorld Guest from 54.196.159.11
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Semalt
Referrer spamming gone mad.
blend27




msg:4642188
 8:24 pm on Feb 3, 2014 (gmt 0)

Seems like another on of those pesky SEO/SEM firms from mother country.

User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36
REF: http:// semalt .com/crawler.php?u=domain.com

Several domains hit, mostly from LACNIC ranges. Greece and Italy IPs there 2.

Me thinks it's a bot, never(almost) had visitors or competitors from Colombia or Peru.

Net Alex, nechorosho....

 

iomfan




msg:4650565
 10:02 pm on Mar 2, 2014 (gmt 0)

Yes, when they find you they come in droves, and from places all over (such as Brazil) that never send any real traffic.
Referers are of the type
"http://semalt.com/crawler.php?u=http://www.example.com" - no proof that they always identify themselves, but at least the many hosts that use this referer are easily blackholed.
wilderness




msg:4650596
 1:12 am on Mar 3, 2014 (gmt 0)

"they come in droves, and from places all over"

They been getting turned away in droves, however had one get through recently from a Comcast Philly IP, which I assumed was a compromised machine.

lucy24




msg:4653829
 10:17 pm on Mar 13, 2014 (gmt 0)

Huh. Came by to ask about them, since I've been much vexed in recent days by Brazilian humanoids giving semalt+mysite as referer.

Could it be some type of preview? All requests are complete, including the with-javascript version of piwik. The latest attempt was locked out because they had the wrong version of the site name in the referer slot. (I didn't think to include an opening anchor.) With a robot that would have been the last of it. But here it led to paired requests for

shared stylesheet
error stylesheet
piwik js (from the 403 page)
piwik php

Why paired? Because the original request was effectively an auto-referer and got blocked at its originally requested (wrong) hostname. So each request for a supporting file came in to the wrong hostname and was duly redirected.

So either a botnet or a preview.

(Aside: Since the vast majority of robots don't ask for non-page files, I find it more efficient on the whole not to block non-page requests other than general IP blocks. Botnets are a different story.)

Wonder if I've ever had a legitimate human visitor from Brazil? The rest of LACNIC occasionally takes a look at Perez the Mouse, but that wouldn't apply to Brazil. Hm. Maybe I should just start blocking them as the occasion warrants, same as I do with botnets that happen to come from eastern Europe.

ken_b




msg:4653831
 10:21 pm on Mar 13, 2014 (gmt 0)

OK, double checked my feeble memory with someone else.

This sounds like a site that might be listed in on the "Sites" list in your AdSense stats if you run AdSense.

I looked at a site for someone today and the .com version was listed there.

Could that be through the preview thing Lucy24 mentioned?

[added] OK, checked my own AdSense "Sites" list and sure enough, there it is.

iomfan




msg:4653846
 11:05 pm on Mar 13, 2014 (gmt 0)

[blog.semalt.com...] :)

keyplyr




msg:4653934
 7:54 am on Mar 14, 2014 (gmt 0)


First I've seen them. Blocked by putting my domain name in their UA (I do allow a couple exceptions) and of course by the obvious buzz term.

112.202.157.196 - - [13/Mar/2014:23:14:22 -0700] "GET / HTTP/1.1" 403 879 "http://semalt.com/crawler.php?u=http://my-domain.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"

lucy24




msg:4653949
 9:09 am on Mar 14, 2014 (gmt 0)

Blocked by putting my domain name in their UA

When your fingers typed "UA" did your brain mean "referer"? Every time I think I've found another useful auto-referer block, I remember that search engines also include my sitename in the referer string :(

Annoying that the Bad Word "crawler" is also in the referer, instead of in the UA where it would do some good.

keyplyr




msg:4653954
 9:22 am on Mar 14, 2014 (gmt 0)



When your fingers typed "UA" did your brain mean "referer"?



Why yes, yes it did :)

lucy24




msg:4654195
 9:53 pm on Mar 14, 2014 (gmt 0)

Awright, that does it. Three more requests-- on my personal site, whose front page rarely gets three humans in a single day. (I am not a front-driven site at the best of times. On this one, humans go straight for the /games/ directory.)

SetEnvIf Referer semalt keep_out

I don't normally use this form-- in fact it looks as if I've never used "SetEnvIf Referer" before and had to go check the wording-- but it's that or add three separate RewriteRules in three separate htaccess files.

Hmph.

keyplyr




msg:4654240
 1:38 am on Mar 15, 2014 (gmt 0)

I already had the rewrite block for "example" accompanied by two different allow lists, one with ^ and one without.

lucy24




msg:4655531
 8:42 pm on Mar 19, 2014 (gmt 0)

Follow-up:

I think it's a botnet. I searched raw logs for the two affected sites. Sample:

189.47.122.156 - - [11/Mar/2014:12:11:29 -0700] "GET / HTTP/1.1" 200 2558 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
187.127.119.103 - - [11/Mar/2014:18:01:30 -0700] "GET / HTTP/1.1" 200 2558 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.39.167.155 - - [12/Mar/2014:10:03:02 -0700] "GET / HTTP/1.1" 200 2558 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
186.244.221.156 - - [12/Mar/2014:12:53:59 -0700] "GET / HTTP/1.1" 200 2558 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
200.203.64.88 - - [12/Mar/2014:16:43:48 -0700] "GET / HTTP/1.1" 200 2558 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"

et cetera. 29 front-page requests in all, beginning abruptly on 11 March on both sites. Notice the unifying theme? Every single request had the identical UA

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36

The UA itself-- separate search-- first showed up in the first half of February. I guess that's when Chrome 32 was released; someone will know.

Why do I think a botnet? Because what I pasted above is only what I get in a referer search. If I do an IP search there are five times as many hits, because it looks like this:

177.158.151.67 - - [12/Mar/2014:16:03:28 -0700] "GET / HTTP/1.1" 403 1642 "http://semalt.com/crawler.php?u=http://example.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:31 -0700] "GET /sharedstyles.css HTTP/1.1" 301 588 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:31 -0700] "GET /boilerplate/errorstyles.css HTTP/1.1" 301 608 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:32 -0700] "GET /sharedstyles.css HTTP/1.1" 200 4842 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:32 -0700] "GET /boilerplate/errorstyles.css HTTP/1.1" 200 1790 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:32 -0700] "GET /piwik/piwik.js HTTP/1.1" 301 586 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:32 -0700] "GET /piwik/piwik.js HTTP/1.1" 200 22980 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:33 -0700] "GET /piwik/piwik.php?action_name=The 403 Page&idsite=1&rec=1&r=511966&h=20&m=3&s=12&url=http://example.com/&urlref=http://semalt.com/crawler.php?u=http://example.com&_id=a73e83474fb77af2&_idts=1394665392&_idvc=1&_idn=1&_refts=1394665392&_viewts=1394665392&_ref=http://semalt.com/crawler.php?u=http://example.com&cookie=1&res=1366x768 HTTP/1.1" 301 1182 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"
177.158.151.67 - - [12/Mar/2014:16:03:33 -0700] "GET /piwik/piwik.php?action_name=The 403 Page&idsite=1&rec=1&r=511966&h=20&m=3&s=12&url=http://example.com/&urlref=http://semalt.com/crawler.php?u=http://example.com&_id=a73e83474fb77af2&_idts=1394665392&_idvc=1&_idn=1&_refts=1394665392&_viewts=1394665392&_ref=http://semalt.com/crawler.php?u=http://example.com&cookie=1&res=1366x768 HTTP/1.1" 200 302 "http://example.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"

(disencoded for readability) Most robots ask only for html, don't always follow redirects, and almost never execute javascript. This is human behavior.

See the "res=" part? That's the only thing that changes. Assorted monitor sizes, all in the middling desktop range. Botnet with fake UA, or a vulnerability in one version of Chrome.

Oh, yes, the IPs. I found isolated specimens from the US, Malaysia and Indonesia, and a handful from other South American countries, but the overwhelming majority are Brazil as noted earlier.

Howzitza




msg:4655538
 8:58 pm on Mar 19, 2014 (gmt 0)

Semalt stuffed up my nice stats as well...
Block them out by simply adding this to your .htaccess file in your root folder:
RewriteEngine on
RewriteCond %{HTTP_REFERER} semalt\.com [NC]
RewriteRule .* - [F]

lucy24




msg:4655590
 11:29 pm on Mar 19, 2014 (gmt 0)

They're blocked. Unfortunately this leads to more requests, because...

A non-blocked request will be redirected from example.com to the preferred form www.example.com. All supporting files will then be requested from www.

A blocked request is blocked at its originally requested URL. All supporting files are therefore requested from without-www, leading to each separate one being redirected. These requests of course give the 403 page, not semalt, as referer, so they can't be blocked. Well, short of blocking the entire nation of Brazil, which seems overkill. There are humans in Brazil aren't there?

It now occurs to me that the one thing they don't request is the favicon (which typically has no referer at all). Huh.

keyplyr




msg:4655597
 12:10 am on Mar 20, 2014 (gmt 0)



There are humans in Brazil aren't there?

Not during Carnival

Howzitza




msg:4655630
 3:43 am on Mar 20, 2014 (gmt 0)

Lucy24, I am sure you only need to give your stats some time to flush Semalt out... With adding the mentioned code in the .htaccess file it completely blocks Semalt.

keyplyr! And we were not invited! ha ha

lucy24




msg:4655673
 9:14 am on Mar 20, 2014 (gmt 0)

it completely blocks Semalt.

It would block them if they were a robot. But they appear to be running on infected human machines. That means they request all supporting files, not just the html. My error document happens to call on two stylesheets (yes, this is excessive, but I hate having to say the same thing twice-- and I especially hate having to change it twice when I redesign) and also analytics. All of this is intended to detect humans who got locked out by mistake. Botnets are, I guess, collateral damage.

At least the 403 page doesn't have pictures ;)

Edit: I went the SetEnvIf route because this way I can put it in my shared htaccess, protecting all five sites. RewriteRules are site-specific-- so, again, I'd have to give the same rule at least two or three times, depending on how many sites I want to cover.

Howzitza




msg:4655722
 11:59 am on Mar 20, 2014 (gmt 0)

This is not a human visit... Its a spider bot of some sort.

lucy24




msg:4655833
 6:40 pm on Mar 20, 2014 (gmt 0)

:: sigh ::

What we have here is a failure to communicate.

dstiles




msg:4655849
 9:17 pm on Mar 20, 2014 (gmt 0)

Lucy - not necessarily a botnet. Think panscient and other distributed bots: they all run from (mostly) uninfected machines. Well, uninfected apart from the idiot bots themselves.

lucy24




msg:4655913
 3:41 am on Mar 21, 2014 (gmt 0)

uninfected apart from the idiot bots themselves

This may be an academic distinction :)

:: idly wondering how many people would unwittingly sign up for a Distributed Robots venture if it were presented in just the right way ::

dstiles




msg:4656068
 9:26 pm on Mar 21, 2014 (gmt 0)

Majestic MJ12? It used to be popular; maybe still is.

wilderness




msg:4656135
 2:10 am on Mar 22, 2014 (gmt 0)

idly wondering how many people would unwittingly sign up for a Distributed Robots venture


FunWebProducts ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved