homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Strange new "no-referrer" traffic
Scarecrow




msg:3777666
 6:33 pm on Oct 31, 2008 (gmt 0)

Two of my sites have seen a massive increase in "no referrer" traffic. If it is bot-driven, it is not a normal bot. Each one picks up any images on the same page, but at the same time it doesn't follow any on-site links on that page. This has been happening for a couple of weeks now.

They come from different IP addresses. Out of a sample of 1045 "no referrer" log lines that were sampled, there remained 826 unique IP addresses after filtering out duplicates.

Only 19 of these had "Firefox" and/or "Gecko" in the user-agent, and only 1 had "Apple" and/or "Safari." All the rest consisted of various types of Windows boxes, with no obvious bias as to the type of installation. This might suggest some sort of Windows-only malware that forces the home page, or 100 percent framing. Or perhaps it's a new type of bot that randomly selects from a list of Windows user-agents.

Of all of the IP addresses, there were zero Tor exit nodes among them.

A country lookup of each IP address, and a count for selected countries, shows that there is a bias. Both of my sites are English-only.

111 from PERU
68 from BRAZIL
51 from RUSSIA
43 from UKRAINE
36 from MEXICO
36 from INDIA
26 from INDONESIA
22 from UNITED STATES
18 from ECUADOR
18 from AUSTRALIA
16 from POLAND
7 from CHINA
2 from UNITED KINGDOM
2 from JAPAN
1 from CANADA

All of the AVG LinkScanner hits, as defined by the .htaccess algo that was used last June and July, are already getting filtered out and don't show up in the above stats.

Does anyone have any idea what might be happening? I'd like to know whether there is a real person with real eyeballs seeing my page, or whether it's some sort of automatic fetch. Is there any way to grab the headers used by the IP address that is fetching the page? I'm using Linux and Apache. I know that .htaccess can examine specific headers, but I'd like to dump them all into a file as the request is made, so that I can see if there are any patterns.

 

Megaclinium




msg:3786088
 4:18 am on Nov 14, 2008 (gmt 0)

I have no-referrer hits blocked from anyone via the control panel leach function, put in media types, e.g. *.jpg (I'm on shared host).

I know that sounds simplistic but it immediately cuts off scrapers that aren't smart enuf to grab the web page too.

Legit bots like google are now just hitting my web pages for speed reasons (nothing to do with the above).
Sometimes they try and come back and get media files separately. (with no referer)

The only other things that try to do this are:
webmail sometimes seems to refer directly to the media.
a/v services sometimes try to hit what their client just hit without referer to test it for malware
google translate service - appears to cache then try to hit jpegs.

I suppose if I added a 'noindex' tag to the jpeg, not the web page somehow then they wouldn't list in index?
I find it real handy because scrapers show up real quick as 306's or 404's in the error log, then I can 403 them.

I've recently had some bizarre hits where there is no referer and no UA either.

enigma1




msg:3786762
 12:30 pm on Nov 15, 2008 (gmt 0)

My opinion is, you should not rely on the referrer field, because this will eliminate legitimate page accesses by people who simply have a firewall or some other web-shield and block the referrer.

jdMorgan




msg:3786785
 1:52 pm on Nov 15, 2008 (gmt 0)

The HTTP Referer header may be legitimately missing for the following reasons:
1) The visitor typed in the address
2) The link was loaded by JavaScript on the referring page
3) The user is behind a corporate or ISP caching proxy (and has no choice about it)
4) The user is running "internet security" software which blocks referrers (often by default)
5) The user is behind a firewall (home or corporate) which blocks the referrer header
6) The client is a search engine robot

There are probably several other common cases that I've forgotten for the moment...

HTTP/1.1 specifically prohibits sending a Referer header unless a single originating HTTP-accessible URL can be provided. So the lack of a referer is to be expected in all but cases 4 and 5 above. And in those cases, the referer is being suppressed out of a concern (right or wrong) for security, and although we might not like that, it is best to remember why we call the requestor the client, and call the machine that runs our Web site the server: On the Web, the client is in charge; They are the restaurant customers, and we are the waiters, and the trick is to give the bad customers the boot without offending --or even disturbing-- the good customers.

Clearly, blocking by blank referrer is not a good idea. However, if both the referrer and the user-agent strings are blank, it's a good bet there's a problem (there are a few exceptions, but not many, and most involve favicon.ico requests).

However, if either the referrer or the user-agent is a literal hyphen ("-"), then you can be sure the request is unfriendly; Some baddies send hyphens in these headers specifically to bypass the "blank referrer and UA" tests, but yet appear as otherwise-normal accesses in the server logs. This is because standard server logs show a quoted hyphen in the log entry if the HTTP Referer or User-agent header is blank. So this is a cute trick, but one that's easy to detect.

Scarecrow, you might want to take a look at some of the other attributes of these requests. Look at all of the HTTP "Accept" headers and make sure they match the expected values for each claimed user-agent. Examining the "X-Forwarded-For" and "Via" proxy headers may be worthwhile as well. This can be done by adding a bit of PHP or SSI code to the pages hit by theses user-agents, since these headers are not normally logged in standard server access logs.

Jim

incrediBILL




msg:3786803
 3:21 pm on Nov 15, 2008 (gmt 0)

I have no-referrer hits blocked from anyone

Everyone that bookmarks your site can never use their bookmark.

I would rethink that strategy.

enigma1




msg:3786807
 3:39 pm on Nov 15, 2008 (gmt 0)

Hi Jim,

I always block the referrer and user-agent when I browse pages over the internet simply because I do not like my whereabouts to be known. I do not see a problem with any of the sites I shop or post comments regularly though. There are cases where I get blank or forbidden pages but it's a minority. If I cannot shop from site A, I go to site B.

I agree with ways to detect manipulation of the referrer or ua fields, but all these (client input) can be faked to whatever the site accepts. The best ways to identify whether or not someone runs a script or a browser from my xp, were through the careful check of the server logs, followed by processes to identify access patterns and deploy countermeasures. So someone who claims to use a browser will access a page along with all associated resources (images, etc). If you see an image been accessed but not a page then there is something wrong or a proxy is in place. Or when the page is retrieved but not the stylesheet, etc.

Of course static resources are difficult to restrict so depending what language the site is on, the resources could be displayed via the programming language. For instance a stylesheet.css or image.jpg files could be renamed to stylesheet.php and image_process.php?image_id=1 and the php script will do the job and emit the resources with the right headers. There is some programming overhead but with some cache utilization via 304 headers the bandwidth consumption will be the same. The drawback with this approach is it requires programming skills and lots of testing.

jdMorgan




msg:3786809
 3:47 pm on Nov 15, 2008 (gmt 0)

Yup, I forgot that one, Bill:

The HTTP Referer header may be legitimately missing for the following reasons:
1) The visitor typed in the address
2) The link was loaded by JavaScript on the referring page
3) The user is behind a corporate or ISP caching proxy (and has no choice about it)
4) The user is running "internet security" software which blocks referrers (often by default)
5) The user is behind a firewall (home or corporate) which blocks the referrer header
6) The client is a search engine robot
7) The visitor used a bookmark

Jim

jdMorgan




msg:3786820
 4:09 pm on Nov 15, 2008 (gmt 0)

My intent was to suggest ways to investigate the requests described by Scarecrow in the original post, and to warn about not rejecting blank referrers, as they can result from perfectly-legitimate requests. Access control topics, such as simple referrer-based anti-hotlinking blocks, and more sophisticated access-control methods such as page-request/included-object-request correlation, cookies-and-script solutions, and many others have been widely-discussed here already.

If you want to block or spoof your user-agent, that's your choice, although your user-agent is fairly useless for tracking your whereabouts, and the referrer only slightly more useful; The one that locates you is your IP address, which you cannot block except through the use of anonymous proxies.

But if you need the unique info from any of my sites, you'll need to provide a valid user-agent along with completely-correct request headers for that user-agent, which is my choice. Each to his own... :)

Jim

incrediBILL




msg:3786848
 5:19 pm on Nov 15, 2008 (gmt 0)

I don't get curious about 1 blank referrer but on the 2nd hit to the site with a blank referrer I give them a quick challenge to see if it's human.

What I also challenge is alternating referrers or static referrers.

Some attempts at faking the referrer just use the domain name regardless of what page they request and that page may not be linked off the home page so too many static referrers triggers a challenge.

Other idiot scripts and/or referrer spammers change referrers to something different every page and it's never your domain, another trigger for a challenge.

enigma1




msg:3787207
 9:32 am on Nov 16, 2008 (gmt 0)

Hi Jim,
The user agent can could be very useful from a point of view of an attacker. If a server say is compromised or intends to do some harm, the original user agent info could identify vulnerabilities the server scripts could take advantage of. The way it is done is fairly simple. Every so often the browser vendor releases updates along with the reasons for code mods. These mods can be integrated with some server scripts to attempt to compromise the user's browser or system. It is the reason I block it and simply pass a blank UA. There is no much point faking it and populate the UA with irrelevant info. Same goes for jscripts where I tend to block as they can also pass sensitive info to the server end.

incrediBILL




msg:3787233
 11:08 am on Nov 16, 2008 (gmt 0)

I personally think people need to reconsider these types of blocks because they will be impacting real visitors and especially people bookmarking your site.

Javascript is an integral part of the web with Ajax being used all over the place so protecting your site from cross-site scripting is good, blocking referrers from javascript bad.

FWIW, I've seen SE's used to attack sites by feeding a SE pages full of carefully crafted URIs containing potential vulnerabilities. Would you also block all SEs from crawling your site just because they can be tricked into requesting URIs containing exploits?

[edited by: incrediBILL at 11:11 am (utc) on Nov. 16, 2008]

enigma1




msg:3787235
 11:33 am on Nov 16, 2008 (gmt 0)

jscripts like flash and like any other active content (activex etc.,) are just options for someone who browses the pages over the internet. I would strongly recommend to have them off unless you really trust the site (at that point in time). You should always have the <noscript> alternative to support people who do not want to use jscrpts.

Ok let me clarify the term block from my end. By "block" I mean I block the specific attempt. I do not ban ips I find it pointless, as entire packets can be crafted and contain any ip really. Theoretically anyone can place a request with any IP (although he would never see the response).

Yes I am aware of the method of using the SEs as the means to do xss or rfi or any other type of injection by posting the url somewhere and then the bot crawls the page and attempts to do an access. Here is an entry from my log moded of the real domain that shows just that.

74.6.8.92 - - [07/Nov/2008:00:56:30 -0500] "GET /\\\"http://mysite321.com/somepage.html\\\" HTTP/1.0" 200 10782 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

The 10782 size of the response was just some binary code that emitted for this attempt. I really don't care if slurp or any other popular spider would consider the response as cloaking or whatever. That does not mean I blocked the spider or anything for accessing the site. I just blocked the attempt.

Scarecrow




msg:3787291
 3:01 pm on Nov 16, 2008 (gmt 0)

I added "Accept-Encoding" to my Apache logging and determined that almost all of them have "gzip, deflate" for this variable. This is normal for most real-eyeball visitors.

The thing that was so strange was that I had a four-fold increase in home page hits on those two sites of mine almost overnight, and this level of activity has been sustained now for more than two weeks. One home page ranks fairly well for the word "wikipedia" and the other for the word "gmail," just to give you an idea of the general-interest nature of my content.

The other thing that was curious was the fact that they were all various IE configurations in the User-agent field, with only 20 out of 826 that were other than IE.

Further web research revealed that various versions of IE have a habit of dropping the referrer if the link on a web page is encoded with JavaScript. Firefox, by way of contrast, picks up the referrer for those same links.

And unlike search engines, these visitors picked up all the images from the home page. There weren't very many subsequent clickthroughs by these visitors to deep pages, but there were enough to convince me that I'm dealing with real eyeballs.

Given the impressive skew toward southern-hemisphere countries of origin, my best guess now is that some social networking sites (Orkut us wildly popular in Brazil, for example) linked to my two home pages using JavaScript, and social-networking enthusiasts clicked on them to see what was going on. Then they went away because most of them don't read English and they lost interest.

I decided that this traffic should not be blocked.

dstiles




msg:3787510
 10:25 pm on Nov 16, 2008 (gmt 0)

A blank User-Agent should be blocked, in my opinion. None of the blank UAs that I log have any useful purpose as far as I can tell. I log about 2 new IPs per day for them across about 50 sites. If not obviously from a server (by IP) they have other header parameters suppressed and are usually from a country that probably has no legitimate interest in the target site's content in any case. New ones also tend to come in groups of a few within an hour or so and then none for a few more days. Coupled with many of them coming from IPs supporting a "web server" this suggests compromised IPs rather than "real" people. Within this context I can allow the very occasional "real" visit to be rejected.

Fake UAs are often bog-standard MSIE, FF or Opera and are detected as fakes by other means. A single MSIE UA can include any number of patches and can therefore give very little information to hackers beyond a specific OS and Service Pack level: Despite not using it for browsing I've upgraded IE several times in the past couple of mnoths - thanks MS! - and the UA hasn't changed at all. The addition of thinks like "media" and toolbars may enhance the possiblilty of vulnerability detection but that's up to you (if you know what you're doing).

FF gives a bit more information but I doubt most users keep a single version long enough for it to denote vulnerability. In any case I suspect it's easier for a potential exploiter to serve up an MSIE exploit regardless rather than check to see if it's worth the bother: it's likely to hit paydirt in an alarmingly large number of cases.

Suppressing header information on a web browser is generally a sign of a) scraper; b) hacker; c) faulty or idiot-controlled "privacy" software (firewalls, proxies etc). Since a) and b) seem to make up the majority of fake/suppressed headers (in my experience) they get zapped, blank referers excepted. As noted above there are a number of reasons why a referer may be blank so it generally, by itself, gets accepted on my servers unless I detect a scrape pattern.

Megaclinium




msg:3788489
 3:56 am on Nov 18, 2008 (gmt 0)

I hope my post didn't make people think I was blocking web pages based on this. Only direct links to my media types (jpegs). they can link to any web page I have. I'd rather they refer to my page to pull up the media, and make it harder for scrapers.

This tho also shows up some strange ISPs or bots that use several IPs to request them. I had one strange one where instead of requesting a sub page it requests the html page from one IP like a user and a second IP in same range at same time requests .jpegs - without referer, and causes 404 errors.

LACNIC showed it as coming from a small country to our south - Guyana. I don't know if that should make me think it is a bot but the strange behavior does.

Now it seems google-image (if it's not a fake bot) and yahoo slurp are causing the same errors regularly. they try and later go get my media outside the web page.

Yahoo got strange and started sending 'wrong' URLs that never existed also, possibly to test status but very bizarre. Microsoft had done this also from one address. I guess their 'researchers' are out of control.

I guess they'll just have to immolate themselves. (vision of a spikey haired google bot packet hitting my site head on while listening to its iPod and not looking crossing the street)

enigma1




msg:3788913
 4:05 pm on Nov 18, 2008 (gmt 0)

If you see they're requesting the page and resources from multiple ips it probably means they use a proxy. Or a multi-proxy dedicated tool.

The result is they pull-in the main page from one IP then the stylesheet from another, one image from another IP and so forth.

If you're familiar with the programming language of your site (ex: php) then you could deploy a thumbnailer for your images. A thumbnailer could save b/w on the one hand and at the same time it can give you the ability to verify whether the main page was requested by the same ip (before the image request). Same goes for the stylesheet and other side scripts. I was able to isolate several incidents when I tested sites using this approach.

The problem with proxies is that they can be totally transparent and it's impossible to reliably detect them. Other methods like following a sequence of resource access from a page may help to a certain extend. It becomes very complicated when the browser caches images for instance to track ip access.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved