homepage Welcome to WebmasterWorld Guest from 67.202.56.112
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
HTTrack / Virgin Media
cyberdyne




msg:4420614
 7:00 pm on Feb 22, 2012 (gmt 0)

My site received a visit from a Virgin Media customer during which five hits from the U-A: HTTrack appeared hitting my robots, site root and one thumbnail image. It initially appeared as though the user was attempting to download my site or maybe Virgin Media themselves were trying to cache my site but it seems unusual for such a large ISP to use such a tool.

As the user registered on my site gallery I emailed them to ask if they knew anything about the activity and they genuinely seemed clueless as to why HTTrack was used, or even what it was.

Has anyone seen this sort of activity before, specifically from Virgin Media, and is it un/common for an ISP to use a commercial tool such as this?

Thanks in advance

77.100.245.xx - - [22/Feb/2012:03:50:44 +0000] "GET /robots.txt HTTP/1.1" 200 6265 "-" "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
cpc18-nmal16-2-0-custxx.19-2.cable.virginmedia.com - - [22/Feb/2012:03:50:44 +0000] "GET / HTTP/1.1" 403 1343 "-" "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
77.100.245.xx - - [22/Feb/2012:03:51:48 +0000] "GET /robots.txt HTTP/1.1" 200 6265 "-" "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
cpc18-nmal16-2-0-custxx.19-2.cable.virginmedia.com - - [22/Feb/2012:03:51:49 +0000] "GET / HTTP/1.1" 403 1343 "-" "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"
cpc18-nmal16-2-0-custxx.19-2.cable.virginmedia.com - - [22/Feb/2012:03:52:26 +0000] "GET /gallery/thumb_001.jpg HTTP/1.1" 403 1412 "-" "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"

 

keyplyr




msg:4420683
 11:53 pm on Feb 22, 2012 (gmt 0)

It wasn't the ISP itself, it was a user using HTTrack to DL your site. HTTrack is a program the user puts on their machine. It is a downloading tool that's been around a very long time. It requests robots.txt but ignores it, so most of us block it by UA:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} httrack [NC]
RewriteRule !^robots\.txt$ - [F]

cyberdyne




msg:4420697
 12:37 am on Feb 23, 2012 (gmt 0)

Thanks keyplyr.

I'm aware of what HTTrack is, already had it blocked in htaccess and received an email from my 403 page letting me know someone tried to use it.

The reason I was asking about the ISP is that I was personally assured in an email by the user of the IP address that they neither knew what HTTrack was or had attempted to use it. Knowing vaguely who this person is, I partly believed them - with reservations - and wanted to find out for sure if there was any chance the IPS's system could be responsible.

However, I guess by your reply that you're quite positive this would definitely not be action by the ISP.

keyplyr




msg:4420704
 1:52 am on Feb 23, 2012 (gmt 0)

I was personally assured in an email by the user of the IP address that they neither knew what HTTrack was or had attempted to use it.

The user lied to you - LOL

I suppose there is an outside chance that VirginMedia would use a basic DL tool like HTTrack (although I can't think of a reason) however it doesn't make sense why they'd use an IP range delegated to private cable customers.

cyberdyne




msg:4420705
 2:01 am on Feb 23, 2012 (gmt 0)

I thought as much. Disappointing but pleased to have my suspicions confirmed.
Thanks

tangor




msg:4420706
 2:01 am on Feb 23, 2012 (gmt 0)

Second above... the lie part that is! The world would be a nicer place if folks would just go to their BROWSER cache and name/rename those strange name files... they have ALREADY downloaded it... so why do it twice? (Joking... they won't do that, heck, even I don't do that except for some very unusual file formats...)

Frank_Rizzo




msg:4420798
 9:10 am on Feb 23, 2012 (gmt 0)

May not be a lie. Could be proxy or dhcp assigned.

The downloader had that IP one day. The next day it was used by honest Joe.

lucy24




msg:4420811
 9:54 am on Feb 23, 2012 (gmt 0)

Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

Uhm... This is a human being we're talking about?

keyplyr




msg:4420822
 11:07 am on Feb 23, 2012 (gmt 0)

Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)

Uhm... This is a human being we're talking about?

Guess you missed my point. The UA is a tool. The person running the tool from his/her computer is a human (or a very skilled cat.)

wilderness




msg:4420851
 12:36 pm on Feb 23, 2012 (gmt 0)

Uhm... This is a human being we're talking about?


keyplr,
Believe this was meant to be tongue-in-cheek-sarcasm ;)

cyberdyne




msg:4420854
 12:40 pm on Feb 23, 2012 (gmt 0)

I'm guessing by all the comments that the jury is still out on this one. I think I'll have to give the user the benefit of the doubt for now, unless it can be definitively proved one way or the other. Either way, I do appreciate everyone's opinions.

May not be a lie. Could be proxy or dhcp assigned.
The downloader had that IP one day. The next day it was used by honest Joe.


Regarding this post, the use of HTTrack and the 'legitimate' visit were one in the same. ie: the copied data I've provided above was an extract from the middle of the 'legitimate' visit log data.

visit from a Virgin Media customer during which five hits from the U-A: HTTrack

wilderness




msg:4420868
 1:06 pm on Feb 23, 2012 (gmt 0)

cyberdyne,
My widgets people are a very diverse group, not unlike the the rest of the www.
As webmasters, we learn that some things are simply not possible without the users knowledge.

I've had widget people (whom I've emailed with almost daily), in which I've matched their email IP to website visitor logs harvesting, actually inquire as to why they were denied access to my sites.
When it was explained how they violated my long standing TOS (not even bothering to read) by harvesting, they flatly denied it.
Go figure!
I guess as a lot, webmasters are simply stupid ;)

Peoples your user inherited a machine that had HTTrack pre-installed by the previous owner?
That would be the only real act of innocence that I could imagine.

FWIW, somewhere here, there's a recent thread in one of the forum participants described HTTrack as the first UA he denied back in the 1990s.

If you feel your user is legitimate, suggest a multiple condition mod_rewrite which would deny the visitor when the HTTrack is present in the UA, however allows when another UA is used. It may take some patience and testing to accomplish, however it will hone your skills ;)

cyberdyne




msg:4420869
 1:18 pm on Feb 23, 2012 (gmt 0)

Good advice, thank you wilderness.

FWIW I was pretty much already convinced before even starting the thread that it was the user who was responsible but I guess I was hoping to be proved wrong as the implications of this particular individual using software such as this could be very important to the organisation concerned. 'Hostile take-over' is a phrase that springs to mind (for reasons other than and including this incident).

I had httrack (and now as many of its' cousins as I could find) blocked anyway.

Just for the hell of it:
For the IP addresses 77.96.* to 77.103*:
#Block Virgin Media IP if U-A is HTTrack
RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|1(0[0-3]))\.
RewriteCond %{HTTP_USER_AGENT} httrack [NC]
RewriteRule .* - [F]

wilderness




msg:4420889
 2:09 pm on Feb 23, 2012 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|1(0[0-3]))\.

FWIW, you may change this to

RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|1(0[0-3])\.

eliminating the extra parentheses, however it may have generated a 500 error anyway.

cyberdyne




msg:4420890
 2:21 pm on Feb 23, 2012 (gmt 0)

It certainly has returned a 500, as opposed to showing my 403.

Why is that ? I'm looking at the flags but unsure if that is the cause.

Actually, looking properly, it is a 403 that was returned, but not my custom 403 page. Also the error message:
"Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request."

wilderness




msg:4420896
 2:33 pm on Feb 23, 2012 (gmt 0)

just remove the extra parentheses at the end and it'll be ok.

cyberdyne




msg:4420898
 2:36 pm on Feb 23, 2012 (gmt 0)

RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|1(0[0-3]))\.
RewriteCond %{HTTP_USER_AGENT} httrack [NC]
RewriteRule . 403.php [L]


This worked better

just remove the extra parentheses at the end and it'll be ok.


I hadn't seen your reply when I posted that. Will try that also, thanks.

cyberdyne




msg:4420916
 3:15 pm on Feb 23, 2012 (gmt 0)

I'm afraid removing the extra parentheses returned a 500.

Changing the pattern, target and flag returned the required result, my custom 403, but even once these were changed and the rule worked, then removing the extra parentheses still returned a 500.

The best solution I've found which works is:
RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|1(0[0-3]))\.
RewriteCond %{HTTP_USER_AGENT} httrack [NC]
RewriteRule . 403.php [L]

wilderness




msg:4420927
 3:35 pm on Feb 23, 2012 (gmt 0)

My apologies.
I missed the trailing parentthese behind the "1".
Change to:

RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|10[0-3])\.

cyberdyne




msg:4420931
 3:48 pm on Feb 23, 2012 (gmt 0)

Ah of course. I should have spotted what your intentions were myself. Makes complete sense now.
That works fine, thank you a always wilderness.

lucy24




msg:4421139
 11:53 pm on Feb 23, 2012 (gmt 0)

... ^77\.(9[6-9]|1(0[0-3]))\.
...
Change to:

RewriteCond %{REMOTE_ADDR} ^77\.(9[6-9]|10[0-3])\.

Is this an apache-specific issue? Would it be happy again if you said

^77\.(9([6-9])|1(0[0-3]))\.

--or anything with matching parentheses on each side of the pipe? Or would that just make it even madder? I'm trying to figure out if it's the potential for null captures-- as opposed to empty captures, which are fine-- that causes the error.

RewriteRule . 403.php [L]

Next "why" question: Why are you doing it this way instead of

RewriteRule . - [F]

? That is, why do you want the request to come through as a 200 instead of a bona fide 403?

Samizdata




msg:4421161
 12:35 am on Feb 24, 2012 (gmt 0)

why do you want the request to come through as a 200 instead of a bona fide 403?

I would guess that the PHP file contains this:

header("HTTP/1.1 403 Forbidden");

which uses fewer bytes than a custom 403 page designed for humans.

...

cyberdyne




msg:4421346
 2:50 pm on Feb 24, 2012 (gmt 0)

Would it be happy again if you said
^77\.(9([6-9])|1(0[0-3]))\.


Yes, worked fine when I replaced the ')'

Next "why" question: Why are you doing it this way instead of
RewriteRule . - [F]


Because for some reason, 'RewriteRule . - [F]' was not returning my custom 403 but throwing a 500. I've not yet had time to work out why.

wilderness




msg:4421710
 3:44 pm on Feb 25, 2012 (gmt 0)

cyber,
are you using the closing line of:

RewriteRule . 403.php [L]

Or and as lucy suggested:

RewriteRule . - [F]

There are certainly other options.
FWIW, I never used custom 403's until my recent ractivated website.
I've always believed that a plain-Jane 403 served by the visitors own machine is most appropriate, however every webmaster has their own preference.

In addition, many of the shared hosts will automatically insert a custom 403 with advertising for their services, if the webmaster doesn't have his/her own custom 403 in place.

lucy24




msg:4421830
 1:26 am on Feb 26, 2012 (gmt 0)

I've always believed that a plain-Jane 403 served by the visitors own machine is most appropriate

But the 403 isn't served by the visitor's browser. It's served by, er, the site's server. Insert boilerplate about MSIE and 512K Exception. The browser itself only kicks in when there's an error involving external redirects. The server is no use there, because it has no memory and can't say "Uh, weren't you here just .0000001 seconds ago, and .0000001 seconds before that, and...?"

In addition, many of the shared hosts will automatically insert a custom 403 with advertising for their services

Ugh. Do they really? Yuk. I think mine only does that with the placeholder page.

keyplyr




msg:4421850
 4:03 am on Feb 26, 2012 (gmt 0)


In addition, many of the shared hosts will automatically insert a custom 403 with advertising for their services

Only if the site does not have its own 403 page. I've never seen a host overwrite the site's own 403 or 404 page.

wilderness




msg:4421857
 5:06 am on Feb 26, 2012 (gmt 0)

Only if the site does not have its own 403 page.


that is correct keyplr.
Perhaps I failed to specify that.

wilderness




msg:4421860
 6:07 am on Feb 26, 2012 (gmt 0)

I've always believed that a plain-Jane 403 served by the visitors own machine is most appropriate


But the 403 isn't served by the visitor's browser.


A browser will in fact produce its own plain-Jane 403.
I used to have a link which explained how to make your own browsers 403 appear.

where does the 403 come from on a site that that neither offers a custom 403 or host doesn't present an advertising 403?

It's really a non-issue to anybody with the exception of yourself ;)

However since you have all this free time for testing, turn all your custom docs off and give it a run, unfortunately your using one of the el cheapo hosts like myself and I wager the hosts advertising 403's will appear in the absence of your own.

lucy24




msg:4421894
 8:41 am on Feb 26, 2012 (gmt 0)

Not Found

The requested URL {made-up name} was not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

And that was all she wrote. Identical in all browsers except Lynx, which just says "Alert! 404" and MSIE 5, which went to a different custom page before it got as far as the original request.

The second line is specific to my host (they have a set of standard error-document names, and you can also do the ErrorDocument thing); conversely, they leave out the default line that says exactly what Apache version they're using, like
Apache/2.3.15-dev (Unix) mod_ssl/2.3.15-dev OpenSSL/1.0.0c Server at www.apache.org Port 80.
(quoting from Apache Itself since my host won't say).

Or (after a deliberately introduced error):
Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, webmaster@example.com and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.

That last line is hardly surprising, since my deliberately introduced error was in the very line that references the custom 404 document!

For comparison purposes, it's kinda fun to go over to apache.org and request a bogus filename. I remember pondering this before: on the one hand, you'd expect apache to make up their own error documents-- but on the other hand, if the default error document isn't good enough for them, why would they expect anyone else to use it?

Although most error messages can be overriden [sic], there are certain circumstances where the internal messages are used regardless of the setting of ErrorDocument. In particular, if a malformed request is detected, normal request processing will be immediately halted and the internal error message returned. This is necessary to guard against security problems caused by bad requests.

Couldn't test this because I couldn't arrive at a malformed request-- although the robots seem to have no trouble.

where does the 403 come from on a site that that neither offers a custom 403 or host doesn't present an advertising 403?

It's part of the Apache installation. Was that a rhetorical question?

Perhaps I failed to specify that.

He said, deadpan.


Hm. Can't help but notice that-- if the error message is truthful-- Apache themselves aren't using 2.4 yet ;)

cyberdyne




msg:4422179
 8:56 am on Feb 27, 2012 (gmt 0)

Wilderness, to answer your question, all my sites have custom error pages. I've always had it fairly close to the top of my to-do list at the start of any new site. I appreciate some prefer not to but I don't particularly like pages which don't aesthetically fit with the rest of a site, even if it is just an error page.

With regards to HTTack, after checking logs from prior to last month it appears it (and a few of its' cousins) have been used on a number of occasions on other my sites but all unsuccessfully due to htaccess blocks. They all now have a notification system on their error pages too.

Additionally, the owner of one of the sites where (attempted) use this type of software now appears to have been fairly frequent has requested the IP range responsible is blocked permanently - perhaps a touch of paranoia setting in but it seems they're trying to protect their investment. You'd think a new company, regardless of how similar to another their business is, would want a totally unique web site with individual content and appearance, but apparently that's not always the case.

This 31 message thread spans 2 pages: 31 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved