How do you submit a sitemap to Yandex so their bots will know to crawl your site?
Don't worry ..they will crawl ..unless you block them deliberately.
Where "block" means htaccess, because they do not obey robots.txt. They read it avidly-- this fooled me for a long time-- but then ignore it.
I agree with lucy24 and on top of that Yandex also comes regularly as a "search referrer" with the question being my domain name, yeah right.
Yeah right Yandex. But as long as your bot constantly triggers DoS protection on my server you remain blocked. What is the use of hitting the same webpage a dozen times anyway? I do not change my content every 2 seconds.
Lucy, I have allowed yandex bots on my sites for quite a while and they do not hit pages "blocked" by robots.txt - at least, I haven't seen it happen in my security logs.
In all, my experience of yandex bots on validated rDNS is a good one. Yandex look-alikes - now that may be something else.
I never realized until I took a closer look that Yandex always uses the identical IP: 220.127.116.11 for the regular textbot, 18.104.22.168 for the imagebot. That makes it easier to check.
On 10 May I closed off directory /bbb/ to robots. Directory /aaa/ has been closed since more or less the day of its creation (at least 5 years ago).
Textbot's activities, omitting other directories:
11 May: robots.txt
12 May: robots.txt
13 May: /bbb/file, robots.txt, /bbb/file
14 May: robots.txt
15 May: /bbb/file, robots.txt, /bbb/file (these are all different files, I think, but I didn't look closer)
At this point I locked out 22.214.171.124. Forgot about 126.96.36.199, so they stuck around a few days longer. Incidentally, it took about this long-- five days-- for the googlebot to stop crawling directory /bbb/
Yandex carries on, in spite of meeting a steady stream of 403 instead of 200:
16 May: /bbb/file, robots.txt
17 May: /bbb/file, robots.txt
18 May: /bbb/file, /aaa/file, robots.txt, /bbb/file
19 May: /bbb/file, robots.txt, /aaa/file
20 May: /bbb/file, robots.txt
21 May: /bbb/file, robots.txt, /bbb/file
22 May: robots.txt
23 May: robots.txt
24 May: robots.txt, /bbb/file, /bbb/file
Two weeks seems an awfully long time for a robot to not get the message that a particular directory is off limits.
Yandex has a much wider range than two IPs. I block the image bot but let in lots of IPs for the text bot.
Oh, I know. My actual blocking has 188.8.131.52/18 and 184.108.40.206/17. It's only when I looked in my raw logs that I realized how consistent they are. Regional, maybe.
My text editor speaks fluent RegEx so I told it to find ^.*?(77\.88\.|Yandex) in the raw logs. The Find All window gives the results in two different colors, making it very easy to eyeball. I'm glad I didn't constrain the search to /bbb/ or I would never have noticed they're still trying to get into /aaa/ even though it has always been roboted.
Wow, didn't realize how old this thread was. Seem to have misplaced a few months.
Around mid-august I decided to give the regular yandexbot (77.88 range) another shot. So far they have been behaving nicely. Then a couple of days ago for arcane technical reasons I had to un-Deny the imagebot (95.108 range). Could have re-blocked it via mod_rewrite but didn't have the energy.
It took the imagebot about 12 hours to realize that it was no longer blocked-- and then it went absolutely hysterical with excitement. It's been picking up everything in sight at a blazing pace, sometimes as little as 1 minute, 15 seconds apart. (I didn't make up this number. 75 seconds seems to be its absolute speed limit. Most of the time it's more like three minutes between hits. Do you suppose Yandex is on dialup? :))
I've been keeping close track and so far the imagebot has only made one visit to a place it wasn't supposed to go. But I've decided to cut it some slack because the last time it tried to get that specific file, the directory wasn't yet roboted-out. So the url was already on the shopping list.
Truth is that there is no one better than google. They just claim, if they would have been better than they would have been on TOP! but they are not.
Google's algorithm is far more better than any other search engine.
Have you had any visitors from Yandex yet ?
And if so, where were they from ?
The reason I'm asking, I just went through a bunch of log files and Yandexbot visits often though is blocked for the moment.
I am debating whether or not to let it in but my main blocking point is their location.
To get the most from any of my sites the visitor needs a reasonable level of understanding English (a translate service won't do) and I don't want to get just the RU script kiddies as visitors.
|Have you had any visitors from Yandex yet ? |
And if so, where were they from ?
You mean human visitors coming from yandsearch? I've always had a scattering of those.
:: shuffling papers ::
Oh dear God. Someone in Kazakhstan wants to read Grandmother Puss? Untranslated, at that.
I'm glad you asked, because it turns out I've missed a whole group of robots. Call them secondary robots: they go to yandex.ru and search for my domain name. If they get redirected (to www.), they hand off to a brother robot to pick up the page. In general it's something like Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) from assorted places, generally in the 95. area. Three or four times a month, front page only, no images, no skin off my nose.
Scattered amongst Messrs. Roboto are a handful of bona fide human searches. I particularly like the one who asked for сусанна дура and was handed the Susanna Memorial Doorway. Wonder what they were really looking for? Not a hotlink, anyway; I eventually had to block image searching in this area because people kept grabbing names.
Thank you Lucy24, that was enlightening.
As I said higher up in this thread, I have also seen Yandex as referrer for a search for any of my domain names .... as if ;o)
I'll just keep Yandex blocked from crawling since it doesn't seem worth it to let them in.
The point of the OP is that yandex is now available in English so anyone looking for an alternative to a certain other SE now has another choice.
I saw yandexbot coming from Turkey this week and earlier in August I got a bot IP from USA. Hopefully, at present, they share all of the information collected, since by far the most yandexbot hits I get are from Russia and they have been crawling (very politely) for a long time.
Thank you dstiles, that's interesting.
I'll keep my eyes open for the bot coming from an outside RU IP