Welcome to WebmasterWorld Guest from 18.104.22.168
They come from different IP addresses. Out of a sample of 1045 "no referrer" log lines that were sampled, there remained 826 unique IP addresses after filtering out duplicates.
Only 19 of these had "Firefox" and/or "Gecko" in the user-agent, and only 1 had "Apple" and/or "Safari." All the rest consisted of various types of Windows boxes, with no obvious bias as to the type of installation. This might suggest some sort of Windows-only malware that forces the home page, or 100 percent framing. Or perhaps it's a new type of bot that randomly selects from a list of Windows user-agents.
Of all of the IP addresses, there were zero Tor exit nodes among them.
A country lookup of each IP address, and a count for selected countries, shows that there is a bias. Both of my sites are English-only.
111 from PERU
68 from BRAZIL
51 from RUSSIA
43 from UKRAINE
36 from MEXICO
36 from INDIA
26 from INDONESIA
22 from UNITED STATES
18 from ECUADOR
18 from AUSTRALIA
16 from POLAND
7 from CHINA
2 from UNITED KINGDOM
2 from JAPAN
1 from CANADA
All of the AVG LinkScanner hits, as defined by the .htaccess algo that was used last June and July, are already getting filtered out and don't show up in the above stats.
Does anyone have any idea what might be happening? I'd like to know whether there is a real person with real eyeballs seeing my page, or whether it's some sort of automatic fetch. Is there any way to grab the headers used by the IP address that is fetching the page? I'm using Linux and Apache. I know that .htaccess can examine specific headers, but I'd like to dump them all into a file as the request is made, so that I can see if there are any patterns.
I know that sounds simplistic but it immediately cuts off scrapers that aren't smart enuf to grab the web page too.
Legit bots like google are now just hitting my web pages for speed reasons (nothing to do with the above).
Sometimes they try and come back and get media files separately. (with no referer)
The only other things that try to do this are:
webmail sometimes seems to refer directly to the media.
a/v services sometimes try to hit what their client just hit without referer to test it for malware
google translate service - appears to cache then try to hit jpegs.
I suppose if I added a 'noindex' tag to the jpeg, not the web page somehow then they wouldn't list in index?
I find it real handy because scrapers show up real quick as 306's or 404's in the error log, then I can 403 them.
I've recently had some bizarre hits where there is no referer and no UA either.
There are probably several other common cases that I've forgotten for the moment...
HTTP/1.1 specifically prohibits sending a Referer header unless a single originating HTTP-accessible URL can be provided. So the lack of a referer is to be expected in all but cases 4 and 5 above. And in those cases, the referer is being suppressed out of a concern (right or wrong) for security, and although we might not like that, it is best to remember why we call the requestor the client, and call the machine that runs our Web site the server: On the Web, the client is in charge; They are the restaurant customers, and we are the waiters, and the trick is to give the bad customers the boot without offending --or even disturbing-- the good customers.
Clearly, blocking by blank referrer is not a good idea. However, if both the referrer and the user-agent strings are blank, it's a good bet there's a problem (there are a few exceptions, but not many, and most involve favicon.ico requests).
However, if either the referrer or the user-agent is a literal hyphen ("-"), then you can be sure the request is unfriendly; Some baddies send hyphens in these headers specifically to bypass the "blank referrer and UA" tests, but yet appear as otherwise-normal accesses in the server logs. This is because standard server logs show a quoted hyphen in the log entry if the HTTP Referer or User-agent header is blank. So this is a cute trick, but one that's easy to detect.
Scarecrow, you might want to take a look at some of the other attributes of these requests. Look at all of the HTTP "Accept" headers and make sure they match the expected values for each claimed user-agent. Examining the "X-Forwarded-For" and "Via" proxy headers may be worthwhile as well. This can be done by adding a bit of PHP or SSI code to the pages hit by theses user-agents, since these headers are not normally logged in standard server access logs.
I always block the referrer and user-agent when I browse pages over the internet simply because I do not like my whereabouts to be known. I do not see a problem with any of the sites I shop or post comments regularly though. There are cases where I get blank or forbidden pages but it's a minority. If I cannot shop from site A, I go to site B.
I agree with ways to detect manipulation of the referrer or ua fields, but all these (client input) can be faked to whatever the site accepts. The best ways to identify whether or not someone runs a script or a browser from my xp, were through the careful check of the server logs, followed by processes to identify access patterns and deploy countermeasures. So someone who claims to use a browser will access a page along with all associated resources (images, etc). If you see an image been accessed but not a page then there is something wrong or a proxy is in place. Or when the page is retrieved but not the stylesheet, etc.
Of course static resources are difficult to restrict so depending what language the site is on, the resources could be displayed via the programming language. For instance a stylesheet.css or image.jpg files could be renamed to stylesheet.php and image_process.php?image_id=1 and the php script will do the job and emit the resources with the right headers. There is some programming overhead but with some cache utilization via 304 headers the bandwidth consumption will be the same. The drawback with this approach is it requires programming skills and lots of testing.
The HTTP Referer header may be legitimately missing for the following reasons:
1) The visitor typed in the address
3) The user is behind a corporate or ISP caching proxy (and has no choice about it)
4) The user is running "internet security" software which blocks referrers (often by default)
5) The user is behind a firewall (home or corporate) which blocks the referrer header
6) The client is a search engine robot
7) The visitor used a bookmark
If you want to block or spoof your user-agent, that's your choice, although your user-agent is fairly useless for tracking your whereabouts, and the referrer only slightly more useful; The one that locates you is your IP address, which you cannot block except through the use of anonymous proxies.
But if you need the unique info from any of my sites, you'll need to provide a valid user-agent along with completely-correct request headers for that user-agent, which is my choice. Each to his own... :)
What I also challenge is alternating referrers or static referrers.
Some attempts at faking the referrer just use the domain name regardless of what page they request and that page may not be linked off the home page so too many static referrers triggers a challenge.
Other idiot scripts and/or referrer spammers change referrers to something different every page and it's never your domain, another trigger for a challenge.
FWIW, I've seen SE's used to attack sites by feeding a SE pages full of carefully crafted URIs containing potential vulnerabilities. Would you also block all SEs from crawling your site just because they can be tricked into requesting URIs containing exploits?
[edited by: incrediBILL at 11:11 am (utc) on Nov. 16, 2008]
Ok let me clarify the term block from my end. By "block" I mean I block the specific attempt. I do not ban ips I find it pointless, as entire packets can be crafted and contain any ip really. Theoretically anyone can place a request with any IP (although he would never see the response).
Yes I am aware of the method of using the SEs as the means to do xss or rfi or any other type of injection by posting the url somewhere and then the bot crawls the page and attempts to do an access. Here is an entry from my log moded of the real domain that shows just that.
22.214.171.124 - - [07/Nov/2008:00:56:30 -0500] "GET /\\\"http://mysite321.com/somepage.html\\\" HTTP/1.0" 200 10782 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
The 10782 size of the response was just some binary code that emitted for this attempt. I really don't care if slurp or any other popular spider would consider the response as cloaking or whatever. That does not mean I blocked the spider or anything for accessing the site. I just blocked the attempt.
The thing that was so strange was that I had a four-fold increase in home page hits on those two sites of mine almost overnight, and this level of activity has been sustained now for more than two weeks. One home page ranks fairly well for the word "wikipedia" and the other for the word "gmail," just to give you an idea of the general-interest nature of my content.
The other thing that was curious was the fact that they were all various IE configurations in the User-agent field, with only 20 out of 826 that were other than IE.
And unlike search engines, these visitors picked up all the images from the home page. There weren't very many subsequent clickthroughs by these visitors to deep pages, but there were enough to convince me that I'm dealing with real eyeballs.
I decided that this traffic should not be blocked.
Fake UAs are often bog-standard MSIE, FF or Opera and are detected as fakes by other means. A single MSIE UA can include any number of patches and can therefore give very little information to hackers beyond a specific OS and Service Pack level: Despite not using it for browsing I've upgraded IE several times in the past couple of mnoths - thanks MS! - and the UA hasn't changed at all. The addition of thinks like "media" and toolbars may enhance the possiblilty of vulnerability detection but that's up to you (if you know what you're doing).
FF gives a bit more information but I doubt most users keep a single version long enough for it to denote vulnerability. In any case I suspect it's easier for a potential exploiter to serve up an MSIE exploit regardless rather than check to see if it's worth the bother: it's likely to hit paydirt in an alarmingly large number of cases.
Suppressing header information on a web browser is generally a sign of a) scraper; b) hacker; c) faulty or idiot-controlled "privacy" software (firewalls, proxies etc). Since a) and b) seem to make up the majority of fake/suppressed headers (in my experience) they get zapped, blank referers excepted. As noted above there are a number of reasons why a referer may be blank so it generally, by itself, gets accepted on my servers unless I detect a scrape pattern.
This tho also shows up some strange ISPs or bots that use several IPs to request them. I had one strange one where instead of requesting a sub page it requests the html page from one IP like a user and a second IP in same range at same time requests .jpegs - without referer, and causes 404 errors.
LACNIC showed it as coming from a small country to our south - Guyana. I don't know if that should make me think it is a bot but the strange behavior does.
Now it seems google-image (if it's not a fake bot) and yahoo slurp are causing the same errors regularly. they try and later go get my media outside the web page.
Yahoo got strange and started sending 'wrong' URLs that never existed also, possibly to test status but very bizarre. Microsoft had done this also from one address. I guess their 'researchers' are out of control.
I guess they'll just have to immolate themselves. (vision of a spikey haired google bot packet hitting my site head on while listening to its iPod and not looking crossing the street)
The result is they pull-in the main page from one IP then the stylesheet from another, one image from another IP and so forth.
If you're familiar with the programming language of your site (ex: php) then you could deploy a thumbnailer for your images. A thumbnailer could save b/w on the one hand and at the same time it can give you the ability to verify whether the main page was requested by the same ip (before the image request). Same goes for the stylesheet and other side scripts. I was able to isolate several incidents when I tested sites using this approach.
The problem with proxies is that they can be totally transparent and it's impossible to reliably detect them. Other methods like following a sequence of resource access from a page may help to a certain extend. It becomes very complicated when the browser caches images for instance to track ip access.