Welcome to WebmasterWorld Guest from 23.20.147.6

Forum Moderators: phranque

Tracking down how "hidden" URLs showed up on search engine

Yandex seems to be directing visitors to the batcave

     
1:45 pm on Dec 21, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Hello there everyone!

I've written an e-commerce platform from scratch. There's an admin section and the template uses switches to display an admin link to only those with the sufficient privileges in the header as a convenience. Recently, I've begun getting some visits to pages that are supposed to be completely hidden from everyone else, bots included via Yandex and I'd love to find out how Yandex came upon these links.

So, here's what I've done so far to try to figure out how the links ended up visible(so far, all have been fruitless):

1) The first thing I did was try to decipher the Yandex redirect URL to backtrack. Unfortunately, I can find NO info on how to do this and the redirect does not match what it does when I enter a search term on the search engine. For instance, the "text" var is completely empty and all the info seems to be packed into the "etext" var, which I can't figure out how it's used. Maybe "encrypted text"? I tried plugging all the various vars into the search URL on the Yandex site but all my efforts resulted in a blank search page.

The URL in question:
[noparse]http://yandex.ru/clck/jsredir?from=yandex.ru%3Bsearch%3Bweb%3B%3B&text=&etext=1271.RJS9ZfLhVdj6nXam87qy4e0e-DG9BQd_KlyA1gFVBu1uuZOuUSRTgOEasX71Cupm.fe839c38b17c539463c0b2f7d01d86940f4b3320&uuid=&state=_BLhILn4SxNIvvL0W45KSic66uCIg23qh8iRG98qeIXmeppkgUc0YL_nDC5hqtEQ6WayFoZKRZE&data=UlNrNmk5WktYejY4cHFySjRXSWhXUFJiWDhna1NqZnBmd1YzNG43VS13RUpmdUZXdnBLOHdkMFlqUzVDamF1OVBVb2xkMmtvMUxXWUxJM1hSVW5hS2x5R1R6LVpCcGVXZFZZNkprR0JOSUVPc3d0ZnBVOXpDV295ckZDdFpqS3l4WkZSOFF3c0RmVTN2ZkhIYWIwT0JzNVQyWko5ME9vMw&b64e=2&sign=08505d8afebc7cb1b4568d3e92c11ecb&keyno=0&cst=AiuY0DBWFJ7IXge4WdYJQXbYQp9t5VF6sf_IfF4r6pdt0ojCe4cFQNegojWnJn8UToJJyLyR96RrC_bl9mqJxfCjbo3nl3EPqUjNd2ADc0Zxar8tKC1hQd4R3WTMI1AD3dVkg_IhwheNgkWXjuLnig&ref=orjY4mGPRjk5boDnW0uvlrrd71vZw9kp5uQozpMtKCXdCnh-_wii4V8gT36dWFhYdLgT8HVc5IPL1yluhUPYHlzmn9nr8Aaa3y8eC13fJRd5RgTTAPeGmg&l10n=ru&cts=1481853806438&mc=4.32492874929[/noparse]

Next, I downloaded the entire site via wget while using both a browser and Yandex search UA(it's how my site distinguishes bots to hide logins and human-specific content). Performing a search through all the downloaded content, I was unable to find any instance of the URLs in question.

I checked my sitemap.xml just to make sure it didn't get accidentally placed in there. All clean.

Finally, I did tons of searches on the Yandex site to see if I could stumble upon something but I can hardly find the site mentioned in the search engine, much less find the no-no URLs.

So, in the absence of any forward progress with this, I took the steps of forbidding any Yandex bots as well as automatically banning any user that is either showing the Yandex URL as a referrer or using Yandex's YaBrowser. This doesn't hurt the site as it sells product to 'Murrica only and Yandex has been the source of only malicious visits. Another point of interest is that Yandex is the only search engine to be the go-between for these hidden links.

There's a few scenarios I've imagined that could have been the genesis for these links getting seen. I'm keeping in mind that Yandex might not be the source. The links could have been picked up by Yandex on a malicious site sharing links or the visitors in question might be using the Yandex search engine to obfuscate the inbound links. At this point, I honestly have no clue. Regardless, here's my thoughts:

1) My code was faulty. Although all the pages check out now, maybe at one point my security checks weren't doing their job when the crawler hit. The fact that only one search engine is showing up with the links, makes this somewhat unlikely.
2) Site got hacked. It's not very likely, since the site keeps track of all visitors in a 30 day running window, I'm always on the site and constantly monitoring the visits to see what's going on. They'd have to find a way to bypass the tracking system, which is pretty unlikely.
3) Database got scraped. Maybe they got the links from the database, either on the web server or at the remote backup location.
4) I inadvertently shared them somehow. Often, when I'm asking for help on design or PHP forums, I'll save the generated HTML file on the server so others can see the page in question. I try to be careful to strip out the sensitive bits but perhaps I missed it once.

So that's it, I think. If you either have an idea for deciphering the Yandex redirect URL or one concerning how else I might track down the origination of these links on the web, I'd love to hear it. Thanks for your time!
5:23 pm on Dec 22, 2016 (gmt 0)

New User

joined:Dec 14, 2016
posts:6
votes: 1


Are you using any third-party sources for linked-in javascript files, fonts, or any other files?

In the past, I have seen this type of question asked by a Wordpress user who could not understand how Google found non-public pages. The default installation of Wordpress downloaded a font from Google and that is a tracking vector and is probably how Google found out about the non-public pages.

If you don't have any third-party files linked-in, is it possible that these URLs could have been found by Yandex crawling your javascript files (if you have any) or robots.txt file? Otherwise, I would chalk it up to your CMS revealing something it shouldn't.
6:57 pm on Dec 22, 2016 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


I've got to ask, because it ties in with a long-standing Yandex headscratcher of my own:

Are these visits with apparent Yandex referer human or humanoid? The main difference is that the humanoids don't request image files, just scripts and stylesheets, and they don't seem to execute scripts. (There are also some subtle differences in headers--within the plausible-humanoid range--but I've never been motivated to investigate more closely.)
8:02 pm on Dec 22, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Hi there folks and thanks a bunch for taking the time to help out :)

Forgive me if my quotes are less snazzy than the forum allows. If the mechanic is there to do it better, I can't find it.

"Are you using any third-party sources for linked-in javascript files, fonts, or any other files? "

jquery gets loaded and some Google fonts are used. My robots.txt file is intentionally absent of any of this type of information, specifically to protect against drawing a map for the bad guys.

I've thought about it a bit more over the last few days and I'm leaning more toward a browser exploit of some sort although I've not got any actual proof of it. The URLs are awfully random and not related to each other but they are links that would be in my recent history.

"Are these visits with apparent Yandex referer human or humanoid?"

I've not determined this for sure but it seems doubtful. On one hand, some of the UAs are unlikely and shared across various IPs. Finally, it's odd that they go from the index to a page in the hidden section that is impossible to get directly from the index. There's a hierarchy of link clicking that would be necessary for a mere mortal to perform to get to that page.

So I've decided on a course of action. First, I decided I wanted these URLs to never be able to be recorded and used for later. I wanted a temporary URL. Secondly, I wanted a way to track how the URL's got scraped to some degree.

To achieve this, I've decided to alter the URL structure to use a random string that is bound to an admin's IP, SID and a token stored in the DB. If any one of those doesn't match, it requires the user to log in again.

So the URLs now always look different for not just each admin, but for each authorized session. If they go to another computer or move across IPs, it will invalidate the links. This would allow me to find not just the user whose URL token got grabbed but also a timeframe when it happened, what computer they were using, etc.

So currently, my admin link looks something like:

site.com/sdflSDA3890LSDAKJ/thing=someadminjob&id=something

And when that link shows up in my referers, I can immdiately see who and what "sdflSDA3890LSDAKJ" was bound to.

I'm still in the process of implementing the new system so any suggestions or thoughts on the matter would be more than welcome :)

Thanks for your time!
8:27 pm on Dec 22, 2016 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3230
votes: 146


Forgive me if my quotes are less snazzy than the forum allows. If the mechanic is there to do it better, I can't find it.

Hi schwim and Welcome to WebmasterWorld [webmasterworld.com] - there are some helpful tips on using the forums at that link that can help you learn how to add quotes here.

they go from the index to a page in the hidden section that is impossible to get directly from the index.

Regarding your investigations, have you checked your raw access logs in relation to visitors who claim to be coming to the hidden URLs from "index"? It is quite common for bots to use http://example.com/ as a referrer, where example.com is your site.

I think that examination of raw access logs is more likely to point to non-human behavior - the tell is in the "coming from the index" part.
12:39 am on Dec 23, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Thanks for the welcome and tips!

Regarding your investigations, have you checked your raw access logs in relation to visitors who claim to be coming to the hidden URLs from "index"? It is quite common for bots to use http://example.com/ as a referrer, where example.com is your site.

I think that examination of raw access logs is more likely to point to non-human behavior - the tell is in the "coming from the index" part.


I took the time to glance at the raw logs and found that the IPs in question did download the page assets(js, css, images). Also, it hit two pages, but I figured out that the second page load was simply due to the redirect to root that happens for non-admins visiting the admin page.

So to clarify, the IP first hits an admin page with the yandex redirect URL showing as the referer and they are downloading the associated elements for that page.

Still working on the new admin authentication system. It will probably take me another day to wrap it up. I'll share any interesting information that may arise from the new stuff.

Thanks everyone!
3:07 am on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


First, Merry Christmas, everyone(that observes it)!

Secondly, good news! I've already got an example to work with!

I checked tonight and saw one of the new URL styles in the auto-ban list. The new system allows me to associate the URL to a particular session, so I know that the URL in question is bound to my admin login on my desktop machine upstairs that was an active session from Fri, 23 Dec 2016 15:07:47 GMT to Fri, 23 Dec 2016 18:25:40 GMT. I checked the logs and my computer did utilize that URL during that period of time.

So this leads me to think that either an addon or some open JS connection is scraping my usage for malicious reasons. The reason I state it's for malicious reasons is my admin URLs make up a small percentage of my activity on the site while 100% of these links have been for the admin section.

I may try disabling some addons to see if it makes a difference but to be honest, there's such a lag between the legitimate usage of the URL and the malicious attempt to visit it that I"m not sure how I will correlate particular changes in the browser with the issue at hand.

Any thoughts on the matter? I'd love to hear any insights or suggestions as how I might track the issue down.

Thanks for your time!
5:58 am on Dec 26, 2016 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8313
votes: 335


Looks to me like a human accessing your first page from a Yandex search (the search parameter looks common for Yandex) then the user turns on a scraping tool in stealth mode (hidden UA) to capture your assets.

This is common, and can be for a couple reasons:

•   The user wants to collect data for an unknown purpose.

•   The user wants to download you site because they like it and may want to use some of the assets for an unknown purpuse.

•   The user's ISP charges for bandwith so they want to save the site (or some of its pages) locally for offline access to save cost.
6:20 am on Dec 26, 2016 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3230
votes: 146


Have you looked at changing file permissions (CHMOD) for specific resources so that the server does not allow access to "everyone"?
3:08 pm on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Looks to me like a human accessing your first page from a Yandex search (the search parameter looks common for Yandex) then the user turns on a scraping tool in stealth mode (hidden UA) to capture your assets.


Thanks to the new URL format I'm using, we can rest assured that this isn't the case. I may be explaining it poorly, so I'll use an image to help:

[i.imgur.com...]

If you look at the image, the URLs with the "1" at the end are the old style URLs. With this static URL system, your scenario very well could have been possible and is in fact the reason I wrote the new system.I needed to find a way to take that out of the pool of possibilities.

The URL with the "2" at the end is the URL I spoke of in the previous post. The path behind the domain only existed for me during the timeframe posted above. If anyone else ever visits the site and somehow generates an authorized admin session, they will never see that url, since it's a randomly generated string and it's checked for previous use before being assigned to the new admin session.

Which brings us to the URL with the "3" at the end. This is another admin sesion. It's tied to me again, but this time it's lifetime was from Fri, 23 Dec 2016 20:37:56 GMT to Fri, 23 Dec 2016 22:27:38 GMT. Compare that to the URLs labeled "2", it's a session that started about two hours after that session.

Have you looked at changing file permissions (CHMOD) for specific resources so that the server does not allow access to "everyone"?


There's actually only one publicly accessible element on the server. I use htaccess rules to route everything through that file. The rest sit outside of the public folder. Because these URLs are being dynamically generated and as I explained above, bound to a particular user in a particular place and at a particular time, we can be confident that these URLs are not being found by crawling or scraping.

I'm going to alter the system a bit more to also capture the UA string. I work across a couple computers and this will help me determine if it's only one computer that these URLs are coming from or whether they are URLs that were bound to more than one. I think that will help me narrow down the issue. I could first try a different browser for a while and if they stop showing up, then I can be pretty sure it's not a js exploit causing the issue and can concentrate on the browser or vice versa.

All in all, really happy with what I'm learning about it. Figuring out those Yandex URLs sure would help but I've accepted the fact that I'm not going to be able to solve this that way so I'm going to do what I can to learn about all the other aspects of how these URLs are getting out in the open. I'm confident that it's either a JS deal or a particular browser element.
6:00 pm on Dec 26, 2016 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


Does anyone other than you legitimately visit these URLs? Most of the time, it's enough to categorically block access to specified areas from everywhere but your own IP (assuming it doesn't change too often).

It's obviously nicest if you can stop unwanted visitors from making the request in the first place. But blocking is a solid second-best.
6:32 pm on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Does anyone other than you legitimately visit these URLs? Most of the time, it's enough to categorically block access to specified areas from everywhere but your own IP (assuming it doesn't change too often).


Others to have legitimate reason to be in the admin panel and IPs vary somewhat due to traveling, work and home, etc. so allowing access to static IPs isn't feasible.

It's obviously nicest if you can stop unwanted visitors from making the request in the first place. But blocking is a solid second-best.


All illegitimate access is being properly blocked which is why I've not got my knickers in a wad about it. At this point, it's turned into a chance to learn something about an exploit I'm not familiar with the mechanics of. Whether it takes me a day or a few months, I'll eventually figure this out.
8:07 pm on Dec 26, 2016 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8313
votes: 335


Compare that to the URLs labeled "2", it's a session that started about two hours after that session
That's when the user could have run the script (scraping tool.) Sorry, I still don't see the mystery other than you don't know who it is. I see lot of this.

BTW - Only raw logs, including the admin (if available) and the error logs, will reflect the server activity.
8:27 pm on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


I don't know if I can explain it any better. That path in the URL ( site.com/saOHSDFHNOeonsdf/?do=something ) is not a real path, does not exist anywhere on the server and has never been seen by any browser except the admin that was issued the auth token. It is absolutely impossible for anyone or thing to scrape, stumble upon, accidentally find, click on or see that path. It doesn't actually exist. URL's generated on the fly with that token by the script don't show for any other visitor. It's stored in the database and is only used to create the links on the page if the user's IP, SID and user ID are correct for that session. The links would never show for anyone else and the path does not and has never existed on the server to be found.

I might not have explained it any better but I did manage to explain it in a few more ways.
9:35 pm on Dec 26, 2016 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8313
votes: 335


I do understand what your concern is. You have explained it well.
It is absolutely impossible for anyone or thing to scrape, stumble upon, accidentally find, click on or see that path.
What I am saying is the path is created the same way it is when you do it.

It's either someone (thing) else or it is you :)
10:21 pm on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Maybe the issue is because I failed to explain the following.

1) I have the raw logs. If the visitor managed to spoof my user ID, IP address and php php sess ID(we're already in the realm of ultra-unlikelihood), I would see visits in the raw logs that occurred when I didn't make them. This is not the case. There are no visits from me during times that I wasn't the one making them. If they didn't spoof my info successfully, they just get a 404, like every other page that doesn't exist on the server. There would be no reason to save the path as a valid URL since it's not a valid URL for them.

2) Whatever is making the visit is hitting that URL first. It never has a prior page visit. There is not a single occurrence of the admin URL being hit by anything else with a referer of anything other than the Yandex search engine. Even if they spoof the referer, there's no page loads prior to the admin page in the raw logs. That's their entry page.

There's too many checks in place

They found the URL = It's a 404. This is logged both in the raw logs and the tracking system.
& they faked my user_id = They get a request to log into the admin panel for a new session. I see the login request on the logs and in the tracking system
& they faked my IP = They get a request to log into the admin panel for a new session. I see the login request on the logs and in the tracking system
& they faked my sessID = If this happens during the narrow window in which the auth token is valid, I see these visits both in the tracker and raw logs

None of this is occurring. The traffic on the site is almost non-existent(it's a new site) so it's very easy for me to track the requests made to the server. There's no entity hitting a page and then using a link to get to the admin panel. This is only occurring with referers from Yandex with no prior or subsequent page requests made to the server from that IP or PHP session.

I feel good about saying these URLs are not being crawled upon. This data is either being siphoned from some active js/ajax session or my browser is sharing the data somehow. If there were a single instance in which it didn't follow the rules above, I wouldn't feel comfortable saying that.
11:45 pm on Dec 26, 2016 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8313
votes: 335


So again... It's either someone (thing) else or it is you.

If you have admin access, check logs to see if anyone is accessing unilaterally, through another account on your server.

If file requests were made, there is a record of it.

Other than that, I would say the utility creating those parameters (your ID) may be creating the additional hits.
11:57 pm on Dec 26, 2016 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


It's become clear to me that we're both just saying the same thing to each other in a loop so I'm going to lay this particular horse to rest. I'll be sure to share any progress I happen to make on the matter!
4:39 am on Dec 28, 2016 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 25, 2005
posts:999
votes: 1


Are these visits with apparent Yandex referer human or humanoid? The main difference is that the humanoids don't request image files, just scripts and stylesheets, and they don't seem to execute scripts.

slightly off-topic but i have to chime in here because that is exactly the issue i stumbled upon the recent days when investigating my traffic.

i have a hand-coded traffic counter which basically works like this: user visits page > page loads a cgi script > cgi script displays png tracking image.

so, on the one hand my cgi script logs a visit each time it is called and on the other hand my analytics program logs a visit each time the tracking image is called. in nearly any case, the number of calls for the cgi script equals the number of calls for the png image, because when the script is called, it displays the image file.

but recently i observed that my stats from the cgi script don't match the stats for the png image. upon investigation, i found out, that a certain amount of visitors skew my stats. all of these visitors have certain characteristics in common, including:

- they come from various apparently legit ip addresses from different providers
- they all have a long hieroglyphic yandex search url as referrer
- they have an outdated user agent string (msie 8.0)
- they execute any script on my webpages
- they never load any image files

the crazy thing is, all the other apparent yandex traffic has exactly the same characteristics, with the only difference that their browsers have different user agents and they request images.

so my question: wtf? apparently, these are the "humanoid" yandex visitors that lucy is talking about. would you care to elaborate? any information or clue about the purpose of this traffic? when i block it, do i block humans or bots?

thanks in advance.
7:22 am on Dec 28, 2016 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


they execute any script on my webpages

That's interesting. Mine don't. (Or is there a missing "never"?) It may be because on most pages the only script is piwik analytics; for historical reasons this involves a file that lives on a different site (same server) and the humanoid may be under orders not to venture off-site. But yup, most of mine are also MSIE 8. Sometimes an old Opera, again typical for the region. This particular detail may be a red herring, though, if a lot of Russians are stuck with older computers.

Here's another thing you may want to check for: My Yandex humanoids tend to come in clusters. But then within 24 hours or so--either earlier or later, which makes no sense--there's an identical request for the same batch of pages, only this time with images. I've sometimes wondered if there's some kind of remote caching involved.

We don't have a YandexDude, do we?

do i block humans or bots?

Going by headers, they're fully humanoid; that's why they've been getting in. There's a minor quirk in Accept-Language, but nothing that really jumps out at you.
11:49 am on Dec 28, 2016 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 25, 2005
posts:999
votes: 1


i have experimented a little bit more, so here is an update.

Or is there a missing "never"?

well no, i noticed they execute my javascript and load any files apart from images. having said that, i now have tried the other way around: first shove the png down their throat through javascript and only after the image has loaded serve them the cgi script.
result is this time they will only request the image and not the cgi script :) so it seems that they are instructed to at most get one level deep in digging through your scripts for downloadable resources.

unlike yours, my yandex humanoids never seem to download images (apart from above experiment). first i thought it was an msie 8.0 glitch (lack of png support, quirky same origin policy or the like), but the more i dig into it, the more i tend to exclude that. about one third of my yandex traffic comes with an msie 8.0 user agent string (although one not matching exactly the other) and they all behave the same. i think this is way too much to represent normal browser usage stats even in russia. and the pattern of appearance as well as the surfing behavior is really strange. my yandex humanoids don't seem to come in clusters. in fact, they are pretty equally distributed throughout day and nighttime, which is also nothing you would expect that way but okay..

Going by headers, they're fully humanoid; that's why they've been getting in.

well, i can say that normally i'm good at filtering botlike traffic. but this one bothers me. i'm really not sure. maybe this are users with some kind of infected machines? or something to do with the yandex toolbar? a caching mechanism like you said? or even a technical measure from yandex to exaggerate their traffic numbers for our websites?

We don't have a YandexDude, do we?

it would be really cool to have one. i'd really like to hear the resolution. i still kinda like yandex, but they are doing strange things at times.
9:55 pm on Dec 28, 2016 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


I pulled up the last two months' logs and headers, separated the humans (defined as anyone who requests images and piwik) from the humanoids (no images, no piwik), and looked more closely.

#1 The requested pages are the same on both lists. The total number of requests varied, but any page that shows up on one list also shows up on the other. This happens too often to be a statistical glitch, especially since it doesn't correspond to the most popular pages sent by other search engines. Both lists include a redirect, possibly indicating that yandex is slow on the uptake (the URL in question was changed in November 2015).

#2 The humanoid requests include these headers (mentioning only the ones that are different from humans):
Pragma: no-cache
Accept-Language: en-us
Accept: */*
The Pragma: header is pretty rare, though not nonexistent, among humans-in-general; it's far more commonly sent by search engines, although not by the YandexBot. (Interestingly, it is also sent by another humanoid that's been active of late: the one from Drake Holdings at 204.79.180-181, which appears to be doing some kind of Bing-related investigation.)
Claiming to speak English is consistent--and unexpected for the region. Claiming to speak only English (that is, only one language, whatever it may be) is more often a robotic trait, though you also see it in mobiles.

The humans are generally:
Accept-Language: ru-RU
Accept: image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, */*
Note the “image/pjpeg, image/pjpeg” duplication. It isn't unique to these humans; it seems to be more common with older browsers and some MSIE versions. Claiming to speak Russian is understandable, since these requests come from bona fide Russian IPs. But, again, it's only Russian.

I checked: the YandexBot itself sends a longer language header, currently
Accept-Language: ru, uk;q=0.8, be;q=0.8, en;q=0.7, *;q=0.01

#3 A unifying feature of all these requests, both human and humanoid, is that they never ask for the favicon. This in fact is what flags them for my attention in the first place.
11:54 pm on Jan 11, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 25, 2005
posts:999
votes: 1


meanwhile i've investigated these "humanoid" visitors quite extensively, mostly because i hate that they are ruining my visitor stats. turns out that almost all of it is fake traffic from different bots on various ip addresses and only a fraction of them with alleged msie8. practically all of it from russian ips, to name but a few of the top offending hosts: corbina.ru, nationalcablenetworks.ru, netbynet.ru. so maybe yandex has nothing to do with it, these guys are simply piggybacking on the yandex.ru referrer to disguise as legit traffic. possibly some forced human requests in the mix as well.

unlike crawlers or even headless browsers this is a new threat, because you can only spot them by manually examining their surfing behavior in the logs. it's not even referrer spam. i guess - apart from annoying publishers by breaking web analytics programs - they are set up to inflate ad views and dry out advertising budgets?

hardly a way to automatize the detection, because they mimic human behavior pretty close. there are ways to identify which i won't mention here, because they would only get better with this info.

to blacklist them all is kind of a part-time job. well, i guess that's what you guys in the bot and spider identification forum are doing day in, day out, right? ;)

what a mess. most of my current website traffic is fake. check your numbers.
1:53 am on Jan 12, 2017 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


I block all traffic claiming a Yandex referer. It's worked great for cleaning up my tracking view and allows me to segregate the traffic for the purposes of taking a gander at their details and history at a later time.

[imgur.com...]
3:54 am on Jan 12, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 25, 2005
posts:999
votes: 1


well yes, that's the obvious hassle-free way to block in this case.

but it produces false positives as there are also real visitors from yandex. my philosophy is to never block on indicators that can be spoofed, so the only way for me is to block by ip.

interestingly this massive fake traffic with yandex referrer initially appeared simultaneously with a really extensive yandex crawl on my sites. so the first unsuspicious impression was naturally that yandex sent me these visitors. tricky.
5:16 am on Jan 12, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


I block all traffic claiming a Yandex referer.

Do you also deny the YandexBot, to eliminate legitimate humans who are legitimately using Yandex?

interestingly this massive fake traffic with yandex referrer initially appeared simultaneously with a really extensive yandex crawl on my sites.

That is interesting. Looking back, I've only been flagging the current pattern since March 2016, and I don't keep logged headers very long, but there was definitely Yandex-related sketchiness well before then.

Oh well. Since they don't request images and don't execute piwik, there's really no point to blocking them. It would use about the same amount of resources either way.

I wonder if putting an explicit favicon link in the page HTML would cause them to start requesting the favicon?
12:44 pm on Jan 12, 2017 (gmt 0)

New User

joined:Dec 21, 2016
posts: 15
votes: 1


Do you also deny the YandexBot, to eliminate legitimate humans who are legitimately using Yandex?


I do block the crawler as well as anyone using the Yandex Browser. The site in question sells product to 'Murrica only and for as far as I have records, there has been no legitimate traffic sent by the Yandex site so I have absolutely no qualms with blocking it.

I've also written a protection into the script that can block by locale, utilizing ipgeo. I've been known to "turn off" all of China(or a province therein if it's concentrated) or Russia when big malicious pushes are in effect from those regions. It's fantastic at preventing the mess left behind in my tracking system by some of the brute force attempts that groups sometimes like to use.
9:21 pm on Jan 12, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13677
votes: 440


I do block the crawler

Well, if you're denying the crawler (Yandex is currently compliant), then physically blocking people who claim to have Yandex referers definitely makes sense, since you already know they're lying ;)

Even with google, there's one easy-to-spot referer format that is only used by malign robots, so I block it.
10:17 am on Feb 18, 2017 (gmt 0)

New User

joined:Feb 18, 2017
posts: 2
votes: 0


Wow! So glad I stumbled upon this thread today.
Last month I blocked all referrals from yandex.

I am convinced that the yandex referrals are spoofed, and I am convinced the originating IP addresses are also spoofed (explained below).

I started noticing the yandex referrals early last year. They're hardly inconspicuous (like this: [yandex.ru...] followed by a very long code of some sort). I didn't know what they were doing, nor could I tell whether it was some sort of malicious code or hacking attempt. Sometimes they went to the root of my site /, and sometimes they went to random articles (the same articles repeatedly). I'm sure that my site is of no interest to anyone in Russia, so it seemed odd that I was starting to get a lot of referrals from Yandex. This activity continued until January this year.

There is only one User account on the site, and only one user. Me!

To access the admin login, I need to enter something like: /administrator/index?secret=password
For a hacker to guess that would be unlikely. Not impossible, but unlikely.

Most of the time the yandex referral would be to root or an article, but sometimes it would try "/administrator" and get a 404. When I noticed that they were also trying to access the admin page, I realized that they were not to be trusted.

Then, one day in January I noticed something like this in the apache log:
<IP Address> <date/time> "GET /administrator/index?secret=password HTTP/1.0" 200 2025 "http://yandex.ru/clck/jsredir?from=yandex.ru%3Bsearch%3Bweb%3B%3B&text=&etext= etc. etc."

What this meant was that he/they now could access the admin login page. I noticed this about 3 hours after the event. I immediately changed 'secret', 'password' and the admin login just to be sure. The following day they came back to that url and got a 404.

How did they get the secret url? I'm the only one who knows the url, however I did once give it to my hosting company in a Trouble Ticket several months ago. But, I find it hard to believe that the hosting company's Ticketing system had been hacked, just as I find it hard to believe that I have malware (key logger) on my laptop. I ran a Full virus scan with MSE and found nothing. I also downloaded the latest Adwcleaner and it also found nothing, not even one PUP.

As the above hack attempt had the usual yandex referral, it raised the question of "what exactly is that long code at the end of the referral"?

Earlier I stated that I believe that the originating IP addresses are spoofed. Why do I say that?
When the above successful access to my admin url was made, then 14 seconds later there were simultaneously 8 different IP's at exactly the same 'second' to different js files, as well as the admin template.css. They showed the referring url as "http://mysite.com/administrator/index?secret=password". The browser details for all 9 IP addresses were exactly the same.
I checked all 9 IP addresses on a couple of IP-block/IP-abuse websites, but none of them were in the hacker databases.

The above is food for thought. I may be onto something or I might not.


But also, around the same time as the above, my Firefox browser suggested that I remove 'My WOT' plugin (Web Of Trust) as it could be dangerous. I removed it immediately. I mention this because it was something else that was happening around the same time. This is where things start to get interesting.
While reading through this thread, I recalled the My WOT issue and decided to do some digging. This is what I found:

"the add-on ‘Web Of Trust’ collects and sells the browsing history of users to third-parties, without even bothering to anonymize the user data"
[news.thewindowsclub.com...]


Whether my attempted hack was due to yandex referrals or My WOT plugin I don't know. But, I don't want to take any chances. My WOT has been removed, and yandex referrals blocked.
10:56 am on Feb 18, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8313
votes: 335


Hi MrKen and welcome to WebmasterWorld [webmasterworld.com]

Yandex is an intetnational search engine, and as such can be a source of visitors from all over the planet. I get legit visitors with yandex referrers daily.

Surely you can be more surgical with blocking methods than to use such a wide brush.
This 45 message thread spans 2 pages: 45