Forum Moderators: phranque

Message Too Old, No Replies

Blocking visitors

         

qimqim

10:59 am on Dec 18, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm being bombarded with visits from an outfit in Russia that shows in Google Analytics as econom.co / referral. So, I added them to the .htaccess file, but theu keep coming.

Maybe I did something wrong in the file

#1 # block visitors referred from indicated domains

RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR]
RewriteCond %{HTTP_REFERER} buttons\-for\-website\.com [NC,OR]
RewriteCond %{HTTP_REFERER} make\-money\-online\.7makemoneyonline\.com [NC,OR]
RewriteCond %{HTTP_REFERER} darodar\.com [NC,OR]
RewriteCond %{HTTP_REFERER} econom\.co [NC,OR]

RewriteCond %{HTTP_USER_AGENT} libwww-perl
RewriteRule .* – [F]



Could you have a look, please?

I've seen them listed as Econom.co. Could the capital "E" make a difference?

Thank you

wilderness

5:32 pm on Dec 18, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, I added them to the .htaccess file, but theu keep coming


Just denying access doesn't prevent future requests, nor does denying access remove denied visitors from your access logs.

When you say "theu keep coming", are you referring to 403'd status showing in your logs or new 200 requests that are gaining access?

qimqim

11:34 pm on Dec 18, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm geting about 20 visita a day. Most are 100% bounce rate but once a day they actually go into another page and stay for considerable time.

I would like to make sure that they do not access my site. I have not seen the logs; I just read the GA star for the day.

phranque

12:20 am on Dec 19, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you need to see what's actually happening in your server access logs before you act on what you see reported in GA.

lucy24

1:14 am on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The fact that they're showing up in analytics at all implies that it's not a pure robot, it's an infected human browser that "thinks" it's sending in a normal request including all supporting files.

20 visits a day and you're successfully blocking them? We should all be so lucky.

once a day they actually go into another page and stay for considerable time

Now, that is really, really interesting. And unnerving. How do you (or GA) know how long they stay on the second page? Is there a third page, or do they take additional actions on that second page?

Could the capital "E" make a difference?

That's why you have the [NC] flag on each line. But in general it's more efficient to learn the correct casing and omit the [NC]. For example
[Ee]conom\.co
if they themselves can't decide what they're spamming for.

Is it really .co as in Colombia? If you don't get any real referers from there, you could easily block the whole TLD. Express it as
\.co($|/)
to weed out legitimate .co.uk and similar.

:: detour for quick check of own logs ::

Buncha robots claiming to be from t.co and nonews.co; other than that, nothing but google.com.co which you might want to exempt if you've got Spanish-language content.

qimqim

4:37 pm on Dec 19, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi All

Many thanks

Pharanque, I tried to read the access logos but they are in a funny format (gz)that I have been unable to read despite using PeaZip.

Lucy: Strangely enough,even though yesterday ecomom.co kept appearing in GA long ad«fter I blocked them in the .htaccess, today they've gone, and instead I'm being bombarded with iloveitaly.com. These appear as 100% bounce all accessing my index.html. Again, I blocked them in the ht.access but they're dtill aappearing in the GA once an hour.

I found a couple of sites referring to them, but I don't want to change my htaccess without your approval, after all the trouble you took to help me building it.
http://www.econom.co.ipaddress.com/

[cradlecloud.com...]

What did you mean by
Now, that is really, really interesting. And unnerving.
You got me worried. What could that mean?

lucy24

7:33 pm on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What could that mean?

My question too. What are they doing on that second page? Is it always the same page, or random? Is the second page visibly linked from the front page?

General question for anyone who knows more about this stuff than I do: If a virus-borne robot happens to do its stuff while the computer's human happens to be sitting in front of the terminal, will the human see the victim site opening up? And is it then possible for the human to click on a link, exactly as if they'd gone to the original page on purpose?

but they are in a funny format (gz)

Someone please help him out here :) My computer opens most standard zip formats transparently; I wouldn't even know how to do it on purpose. You don't read the .gz file directly. It unzips and you read the resulting file. (I've instructed my computer to open anything with .log extension with SubEthaEdit. Don't remember if the alternative is default-to-TextEdit or "Help, I'm just a dumb computer, tell me what to do".)

Again, I blocked them in the .htaccess but they're still appearing in the GA once an hour.

Google Analytics lives on Google's server, so nothing you put in your own htaccess can prevent robots from requesting the analytics file once they know of its existence. What you're seeing is a browser with a cached copy of your page, putting in fresh requests for the js/php component periodically until the cached page runs out. (I don't personally know how GA works in this regard, but someone else will know. I've set a 15-minute expiration on piwik.php so I can tell when someone returns to a page even if they don't explicitly reload.)

wilderness

9:25 pm on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but they are in a funny format (gz)


Someone please help him out here


Google is your friend 'zip file software'

phranque

11:46 pm on Dec 19, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I tried to read the access logos but they are in a funny format (gz)that I have been unable to read despite using PeaZip.


on *nix you would use gzip to uncompress that file.
on windows you would need utility such as 7-zip.

tangor

1:29 am on Dec 20, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The active .log file is a TEXT format. gz is a compressed format. In most cases any change you make will first appear in the active .log file.

Have you considered a block to all in country? You will still get a 403 entry in the log, but will reduce worry of further, or unusual access attempts.

qimqim

5:39 pm on Dec 20, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi

Sorry for the delay

Yesterady I put anothe block for the iloveitaly.com but it kept coming for a few hous more. Then, It stopped. I wonder if these blocks taje some time to take effect. If they do, they seem to be doing their job and I have not had anything else from these Russian sites for over 12 hours (they used to come once every hour). if they do come back could I block the ip address for their host? How?

According to [econom.co.ipaddress.com...]

Hostname:

www.econom.co

IP Address:

78.110.60.230

Host of this IP:

230-60-110-78.net.hts.ru

not2easy

6:39 pm on Dec 20, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Blocking the IP of a referrer usually won't help because for the most part they are not really coming from the domain in the "referrer" field. Anyone can manually add a referer to a request and it is done to litter your logs (and Analytics) with fake links to your site.

That is why people are suggesting that you view the raw access logs and see the IPs that are planting those referrers in your logs. Those IPs you can block.

As suggested, for Windows, try 7Zip ( a free utility) and "save as" the filename you want. I use something like domainLog122014.txt but you can give it any convenient name after it is unzipped. As lucy24 mentioned, on Mac you just click it and unzip it.

The format is one you can open in MS EXcel or OpenOffice or as a Text file.

lucy24

8:40 pm on Dec 20, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wonder if these blocks take some time to take effect.

See above. (Uh.. It was this thread where I explained it, wasn't it? Some recent thread, anyway.) What you're seeing in analytics is only the requests to Google's servers, where the analytics program lives. You need to look at your own raw access logs to see when the 403s start being served. A lockout is instant.

Any given site named in referer spam is probably hosted on a server farm. So it won't do any harm to block the site itself-- but it also probably won't do much good. (To cite the obvious example: I've never even bothered to look up semalt.whatever-it-is. What would be the point? They're not personally visiting, they're just spamming logs.)

qimqim

9:32 am on Dec 21, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi Lucy

if I understand well, what you are saying is that from the moment I put the block in the htaccess all attempts from the spammer to access my site are negated and they get a 403; yet it still registers in GA. Is that it?

You say they are just spammers. What's the purpose?

Thanks

lucy24

8:06 pm on Dec 21, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yet it still registers in GA. Is that it?

It's absolutely possible, depending both on the Analytics setup and the (infected) browser that's making the request. The key is that GA-- or any analytics program-- isn't logging the actual visits to your actual page. Instead it's logging requests to the analytics program's database. For example

:: pulling a random example from own logs, with spaces added manually ::

135.23.abc.def - - [17/Dec/2014:06:27:15 -0800] "GET /piwik/piwik.php? action_name=Duct%20Tape%20Is%20Your%20Friend &idsite=3 &rec=1 &r=001129 &h=9 &m=27 &s=14 &url=http%3A%2F%2Fexample.com%2Fhovercraft%2Fduct_tape.html%23tape_hansard &urlref=http%3A%2F%2Fexample.com%2Fhovercraft%2Fhansard%2Fcold_here.html &{snip,snip}&_ref=http%3A%2F%2Fr.search.yahoo.com%2F{rest-of-request-snipped} HTTP/1.1" 200 303 "http://example.com/hovercraft/duct_tape.html" "{ UA string snipped }"

This will show up in analytics as a visit to the page
example.com/hovercraft/duct_tape.html but the analytics program has no way of knowing whether the page itself was actually requested. It only knows that it received a request for a php file with several miles of parameters. Ordinarily these parameters would be generated by a script that's called by the original page, but you can also put them in manually. Once a robot knows that the supporting files for page such-and-such include php file such-and-such, it may continue putting in the identical request each time.

What's the purpose?

There's a thread from a month or two back that explored this question.

:: search, search ::

Not sure if this [webmasterworld.com] is the one I was thinking of, but it's a start.

Somewhere out there, there's probably an individual human who is The World's Leading Authority on how the robotic mind works.

qimqim

4:11 pm on Dec 22, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi not2easy

I'm afraid I am unable to download 7zip for x64 Windowd. I get 500 error. Any other programme would do it easily?

tangor

4:23 pm on Dec 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



WinZip also reads gz and tar files. Just a note: you don't have to run 64bit... the 32 will work just fine.

qimqim

4:30 pm on Dec 22, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



In fact I can't download 32 bit either. Something wrong with their website (or my computer!).

I have PeaZip. Could you tell me if I can use thta? I've tried a million ways to no avail. Maybe I'mdoing thr wrong thing, or zipping the wrong files...

qimqim

6:06 pm on Dec 22, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi

The mystery deepens...

I've managed to get the log for today. Even though GA shows, so far, 10 visits from Russia, I cannot see anything in the logs. How come?

If you like, and I can, I will post the log in the forum.

lucy24

7:43 pm on Dec 22, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How come?

Because you've successfully blocked them and they've stopped asking. If you'd successfully blocked them and they were still trying, you'd see their requests listed as 403 in your site logs.

GA may have a setting for "ignore such-and-such IP". By its nature, a shared analytics package (such as GA) can't block IPs on its own behalf. So if your unwanted Russians keep asking for a file that lives on Google's servers, they will keep showing up in analytics even if they're never allowed within spitting distance of your own site.

qimqim

8:46 am on Dec 23, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi Sherlock (Lucy)

Thank you for the excellent explanation but... how do you explain that yesterday I had 2 instances of semalt and 1 of buttons in the log?

You may recall that they have been banned for some time in the htaccess.

lucy24

6:33 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Banning a visitor won't stop them from showing up in logs; it only stops them from receiving the requested page. Do your buttons and semalt spammers show up as 200 or 403? If it's 200 then there's a problem in the Deny structure and we need to take another look. If it's 403 then all is copacetic* and you need take no further action.


* I've only recently learned that this is a real word with a real definition. Always assumed it was a Hollywood invention. Huh.

qimqim

7:19 pm on Dec 23, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



84.85.253.15 - - [22/Dec/2014:09:14:53 -0700] "GET / HTTP/1.1" 500 561 "http://semalt.semalt.com/crawler.php?u=http://pintotours.net" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"


Hi Lucy

Can't find the others right now, as the logs I saw only show the lst few hours. Here's one from semalt in the log from yeaterday that was downloaded

lucy24

7:29 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The log entry shows a 500 response, meaning something on the server side. Can you get your error logs? As it is, we know that the visitor's request was not granted. That may be all you need to know. But a look at error logs should shed light on the 500.

qimqim

8:16 pm on Dec 23, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi Lucy

The error log is for everybody in the shared hosted ip and I only get the results for today; so I don't know how to look at the ones for yesterday. I have to start learning how to download the stats.

lucy24

9:14 pm on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check your host's control panel. There may be an option to keep logs available longer.

Mine defaults to 3 days, but I've changed it to 15 days (which works out to 17 days, because reasons). That's enough for even the subsidiary sites whose logs I only process every two weeks.

qimqim

9:48 pm on Dec 23, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'll look at that.

meanwhile: Merry Xmas

qimqim

4:08 pm on Dec 26, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi

Still battling with these bots...

I saw this on the Net. is it better than the blocks I have now in the htaccess?

SetEnvIfNoCase Referer darodar.com spambot=yes
Order allow,deny
Allow from all
Deny from env=spambot

lucy24

7:28 pm on Dec 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The simplest and cleanest lockout is IP-based. These can be done in various ways, but by far the easiest is the mod_authzzzz approach:
Deny from 12.34.45.56


If you need to lock out visitors based on some other element, such as referer or user-agent, the two most common ways are mod_setenvif (SetEnvIf etcetera) or mod_rewrite (RewriteRule). Both will work, but I tend to prefer the SetEnvIf route. It takes up less space in your htaccess / config file, and it's inherited by default.

Look at these two lines:
SetEnvIfNoCase Referer darodar.com spambot=yes
SetEnvIf Referer darodar\.com spambot


#1 Only use NoCase if the text you're matching against can really occur in all possible casings. When possible, an exact-text match makes less work for the server. In some cases (haha), casing itself is significant: for example, I once met an unwanted robot calling itself "GoogleBot" where the real thing is "Googlebot".

#2 mod_setenvif uses Regular Expressions, so any literal periods . should be escaped as \. In the specific case of "darodar.com" the oversight is not likely to cause trouble, but you should do it as a matter of habit.

#3 When you create an environmental variable, you can optionally assign it a value, like "spambot=yes". Otherwise it defaults to 1, so "spambot" is the same as "spambot=1". If you will only be using this environmental variable for blocking, its value doesn't matter. mod_authzthingy (the "Deny from..." line) only checks whether the variable exists; it can't do anything about its exact value.

#4 You can give your environmental variables any name you like: "bad_bot", "spambot", "keep_out", "nasty-ugly-robot". If you're copying and pasting other people's code, make sure you change all variable names to whatever name you've personally chosen to use. Otherwise you'd have to add multiple mod_authwhatsit lines, like
Deny from env=bad_bot
Deny from env=yuk
Deny from env=get-out-of-my-sight-you-horrible-Ukrainian

Note again that "env=suchandsuch" only means "the environmental variable 'suchandsuch' has been defined". It doesn't matter what its value is. Un-defining an environmental variable is different from setting its value to zero.

qimqim

3:18 pm on Dec 27, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi Lucy

many thanks for the explantion but I have to admit that it is too advanced for me and I did not get much of it...

I finally managed to get to the logs and found a couple of interesting things:

semalt that I successfully got rid of through the htaccess file, visits on a daily basis but does not show in GA. Here is one example:

93.63.155.227 - - [06/Dec/2014:03:35:47 -0700] "GET / HTTP/1.1" 500 561 "http://semalt.semalt.com/crawler.php?u=http://example.net" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"


The current spam from ilovevitaly, etc(20 a day in GA), DOES NOT show in the logs abut I get many visits from217.69.133.xxx I expect this is the origin.
How can I block this in the htaccess to see if it stops these people?
I find it interesting that they are looking at my robots.txt file!
217.69.133.251 - - [07/Dec/2014:00:12:05 -0700] "GET /robots.txt HTTP/1.1" 200 365 "-" "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
217.69.133.248 - - [07/Dec/2014:00:12:07 -0700] "GET /Pinto/insurance.html HTTP/1.1" 200 3311 "-" "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
217.69.133.21 - - [07/Dec/2014:00:12:10 -0700] "GET /Asia/Indonesia/RamadaBintangMap.html HTTP/1.1" 200 1614 "-" "Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)"
This 67 message thread spans 3 pages: 67