Forum Moderators: open

Message Too Old, No Replies

Cyrillic pagenames

An annoying advertising method?

         

dstiles

10:12 am on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For some time now I have been noticing URIs in the logs with cyrillic pagenames - eg www.example.com/сезон-смотреть+онлайн-серия-Рик+и+Морти. They seem to vary from week to week, possibly more frequently, and so far hit two of my web sites. The one here translates as "season-watch+online-episode-Rick+and+Morty" - I assume that is an online video programme of some kind. Others I've translated, courtesy Yandex, are in a similar vein.

Rather more annoying is that Bing is trying to index those URLs. And as we all know, once Bing (or G) get a URL they seldom let it rest. The inference is that some web site somewhere in the world is linking to my sites with the cyrillic pagenames. Anyone else seeing this?

edit: The cyrillic seems to be reproduced here as Unicode - sorry about that.

lucy24

3:39 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yup, when necessary I get around it by copy-pasting into the text editor and hitting HTML Preview, which de-entitizes everything.

Bing does seem to have a habit of attaching one site's paths to another site's hostname. I haven't especially noticed non-Latin names, but there was a spell when they were requesting Norwegian paths from my site. (Surprisingly, I have no Norwegian-language content, barring the odd word here and there.) It does tend to suggest that their computer has the hiccups, leading them to waste a lot of crawl budget requesting nonexistent material. I can only hope for the other site's sake that they're also requesting those paths with the correct hostname ... or is some non-English-language site out there getting flooded with mysterious requests for paths beginning in /ebooks/ or /fun/ ?

blend27

4:00 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- Bing is trying to index those URLs --

what status code Bing getting?

I just used & in URI as a first character on WebmasterWorld and got custom 404, so was the response on IIS.NET. My local dev server(IIS) throws 400 - bad request. So there is a way to potentially drop the request all together, not sure where to look though...

боом шакалака

dstiles

5:28 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy:
I rather think these hits are deliberate, not accidentals. I get far more from arbitrary IPs than could be from deliberate human links from a site.

blend27:
Bing gets 404 because the page does not exist. I pre-empt the ones from ad hoc IPs and they get 403 plus an entry in my "trying to hack me" log.

lucy24

6:34 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just used & in URI as a first character on WebmasterWorld and got custom 404, so was the response on IIS.NET. My local dev server(IIS) throws 400 - bad request. So there is a way to potentially drop the request all together, not sure where to look though...
I assumed that the requests are coming in percent-encoded. They were disencoded for posting--but all for naught, since these Forums refuse to venture beyond the Windows codepage.

I get far more from arbitrary IPs than could be from deliberate human links from a site.
Oh, I don't think they're following links. I think they've got their shopping-list database garbled, resulting in paths from Column A being attached to hosts from Column B.

blend27

10:14 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- I assumed that the requests are coming in percent-encoded. --

the stuff I typed was: example.com/& or iis.net/&

-- боом шакалака --

is actually "boom shakalaka" typed in Cyrillic and converted by WebmasterWorld on a fly, :)

blend27

10:51 pm on May 24, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So I tried messing around with my web.config rules and this rule simply blocks/aborts all requests, not even an entry in logs/connection reset:
 <rule name="Block requests with ampersand at the beginning of URI">
<match url="^&amp;" />
<action type="AbortRequest" />
</rule>


Not sure how Bing or others will interpret this rule but FireFox says: PR_CONNECT_RESET_ERROR, which I am OK with.

dstiles

9:11 am on May 25, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy:
> I assumed that the requests are coming in percent-encoded

To the web site? No. True cyrillic.

> I think they've got their shopping-list database garbled

Who, Bing? As I said, most of the cyrillic comes from arbitrary IPs. This morning I checked a couple of the more numerous and traced them to US server farms.

dstiles

8:37 am on Jun 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google has joined Bing in asking for non-existent cyrillic pages. The originals have to be listed on a web site somewhere, I'm convinced.

I have now trapped the cyrillic pagenames and return a 410. In a couple of years perhaps SEs will stop asking for them.

I tried testing for the % versions within htaccess but it seems the conversion from cyrillic is done in php - not unexpectedly. To cover both cases I have included the following in htaccess for the beleagured site:
# anti-cyrillic
RewriteRule ^%d[01]%b[0-9a-f]% - [NC,R=410,L]
RewriteRule ^[&#1041;-&#1071;&#1073;-&#1103;] - [NC,R=410,L]

lucy24

4:13 pm on Jun 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[&#1041;-&#1071;&#1073;-&#1103;]
I don't know the exact rule for the RegEx flavor used by apache, but there should be a way to express this by character range: 0400-052F (and why does &#1072; have to be excluded? It's not "a")

dstiles

9:40 am on Jun 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Damn! The cyrillic got auto-converted again! :(

The tests are for specific character ranges, first line coded as unicode, second line by actual cyrillic characters. I omitted the "a" since it looked too much like an Latin "a" and I couldn't be sure - and the odd missing character is irrelevant anyway since there are always others in the group.