homepage Welcome to WebmasterWorld Guest from 54.166.14.218
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Yanga WorldSearch Bot
keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 7:03 pm on Sep 11, 2008 (gmt 0)

Read robots.txt, then disobeyed it and requested disallowed files. Russian owned, no contact info. Banned until they conform.

77.91.224.6 - - [11/Sep/2008:09:17:39 -0400] "GET /robots.txt HTTP/1.0" 200 4651 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 1:35 am on Sep 12, 2008 (gmt 0)

It appears to be associated with Webalta:

inetnum: 77.91.224.0 - 77.91.224.255
netname: WEBALTA-NET
descr: WEBALTA / Internet Search Company
Search Engine Servers
country: RU

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 6:50 am on Sep 12, 2008 (gmt 0)

Well then banned by range :)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3742926 posted 9:44 pm on Sep 19, 2008 (gmt 0)

I've been getting this to the home page of most domains on my server over the past couple of hours but from the range 91.205.124.* which resolves as:

91.205.124.0 - 91.205.127.255
netname: GIGABASE-NET
descr: Gigabase Ltd
country: RU

Rather odd that it claims in the UA to be a UK domain until one reads the blurb at gigabase.com, which claims it's multi-national...

"The company was registered on August 2008 and is financed by it's foundators solely." (what's a foundator?)

Sounds like it's masquerading as a UK company without actually being one. GIGABASE Ltd isn't registered with Companies house (or CH isn't admitting it is!) and nor is yanga under than name

yanga.co.uk is hosted at 92.241.182.* which is a russian colo (Wahome colocation). The yanga.co.uk site seems to consist of a single home page with search box - no links to any other page saying who/what/why.

Google returns four hits on "GIGABASE Ltd", one of which is webalta in russian.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 8:38 am on Sep 20, 2008 (gmt 0)

The old shell game. Webalta/Yanga has been at it for over a year now. Anyone know what they're doing with the data?

91.205.124.6 - - [20/Sep/2008:01:18:56 -0400] "GET /index.html HTTP/1.0" 403 461 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"

Megaclinium

5+ Year Member



 
Msg#: 3742926 posted 11:53 pm on Sep 20, 2008 (gmt 0)

I had them kind of stupidly hit me from same address .14

I say stupidly because they tried to grab my media files directly without going thru my webpages.

And of course, with leach protection for .jpg's enabled they all failed! Didn't even have to 403 them, tho now I have.

Strangely I had something that claimed to be yahoo vertical mail crawler try something similar. regular Slurp doesn't even touch my media files nowadays tho they did for a while.

Yahoo or not (is it really Yahoo, seems to be one of their ranges?) I had to ban them for being stupid and ignoring 302 errors.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3742926 posted 1:21 am on Sep 21, 2008 (gmt 0)

... doing with the data... No idea. I only looked at the google cache version of the site so I made no search.

Being russian - I don't supposes it part of the botnet game? Nah. Too public, surely. And if it's going straight for media (can't corroborate, haven't checked the site logs) then there wouldn't be much point. Still, again being russian, it's banned.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3742926 posted 11:23 pm on Sep 22, 2008 (gmt 0)

This is webalta search project purchased by some other Russian entity (mobile content portal - artfon.com - notice webalta logo on press release they have) who appear to have worldwide search ambitions. I do not think they are botnet related even though they have certainly Russian origin.

ignoring 302 errors

302 is not an error - it's effectively temporary redirect.

--

This is not to defend their practices or intentions, just telling you what I know about this user-agent.

nativenewyorker

10+ Year Member



 
Msg#: 3742926 posted 2:31 am on Oct 12, 2008 (gmt 0)

Suspicions about Yanga being malevolent may be justified. There are reports of a Yanga search engine hijacking IE and FF browsers and replacing Google with itself.

yanga search engine - mozilla.feedback.firefox ¦ Google Groups [groups.google.com]

unimaximus

5+ Year Member



 
Msg#: 3742926 posted 6:08 pm on Oct 13, 2008 (gmt 0)

Hello guys

My name is Alexey and I am owner and CEO Yanga project. Now we have only one search cluster and if this cluster down we use Yahoo API as next cluster. Sorry, but we don't have money for two clusters now :( Now we use 100% our results.

Also we have a backlinks for SEO [yanga.co.uk...] with text links.

I don't have any botnets, we planned to start partnership programm with toolbar traffic (As Ask,Google,Miva,Yahoo ... etc).

If you have any question - write me :)

ps. wahome.ru - is biggest russian datacenter with 6000 servers.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3742926 posted 7:45 pm on Oct 13, 2008 (gmt 0)

Hello Alexey, and welcome to WebmasterWorld!

It seems that the logic in your robots.txt parser needs some improvement. Although our robots.txt file is whitelist-based and Yanga does not appear on the whitelist, it still attempts to fetch pages:
91.205.***.8 - - [13/Oct/2008:07:06:08 -0700] "GET /robots.txt HTTP/1.1" 200 3157 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"
91.205.***.8 - - [13/Oct/2008:07:06:09 -0700] "GET / HTTP/1.1" 403 666 "-" "Yanga WorldSearch Bot v1.1/beta (http://www.yanga.co.uk/)"

As you might understand, webmasters become very suspicious when a robot violates robots.txt. As an example, since I use a whitelist, I observe new robots that appear in our access log file, and only take action to "Allow" those that seem to offer some advantage (that is, search-driven traffic) and obey the initially-denied state expressed in our robots.txt file. Unfortunately, Yanga failed this test.

To be clear, our robots.txt is constructed like this (simplified example):

# Whitelisted user-agents are allowed
User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
Disallow: /admin
Disallow: /cgi-bin

# Disallow all others
User-agent: *
Disallow: /


As you can see, the four named user-agents are allowed to fetch everything except two URL-paths, while all other user-agents are disallowed. This construct is in full compliance with the Standard for Robot Exclusion (Martijn Koster).

Yanga does not parse this file correctly, and attempts to fetch resources from the site. All of the "allowed" user-agents parse this robots.txt file correctly, as do many other "disallowed" user-agents.

I strongly suggest fixing this problem before your robot's reputation is destroyed by threads like this one, many of which will be less-informed and more suspicious.

Jim

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 2:24 am on Oct 14, 2008 (gmt 0)

Jim,

Are you aware that the Live robots.txt validator doesn't like that format?

Error: MSNBOT isn't allowed to crawl the site.
**************************************************
Line #3: User-agent: slurp
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #4: User-agent: msnbot
Error: 'user-agent' tag should be followed by a different tag.
**************************************************
Line #5: User-agent: teoma
Error: 'user-agent' tag should be followed by a different tag.
**************************************************

Google likes it, it should be valid, but Live's validator doesn't.

Don't know if msnbot reads that right or wrong.

Anyway, not trying to hijack the thread away from Yanga, just pointing out the SEs have disagreements on that particular file implementation.

Another reason I went to dynamic robots.txt and serve it up on demand so there's no room for interpretation of my exact intent.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3742926 posted 2:41 am on Oct 14, 2008 (gmt 0)

Live's 'bot handles it fine, but their validator doesn't use the same parser, apparently. The validator is just wrong in this case -- It's perfectly valid to construct policy records that apply to more than one user-agent, and this has been in the Standard from Day One.

On this particular site, I don't have the option of doing dynamic robots.txt -- The host Aliases robots.txt to their script that apparently checks to be sure that the shopping cart scripts they provide are Disallowed (or some such check), and if so, pipes the customer's robots.txt through their script. As a result, we're into the content-handling phase of the API -- No SSI, no scripts, and no mod_rewrite available any more. It's the only thing I really dislike about this particular host. If I move, that'll be why.

Nevertheless, this construct was included in the original Standard, and I continue to take search engines to task if they mishandle it. I've already gone one round with Live, and the result was that they fixed that aspect of their parser, and also the previous (very annoying) problem of not differentiating their various user-agents strings when parsing robots.txt. So they do listen and act. Their support group is aware that the validator parser needs to be updated/sync'ed with the real one -- Hopefully that will be acted upon, too.

I tend to make a lot of noise at the search engines themselves, and only gripe here if they do nothing... Next up is Cuil; I've tried just about everything to make Twiceler aware that it can fetch some pages, but it's just not very smart.

Jim

koan

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3742926 posted 8:03 am on Nov 11, 2008 (gmt 0)

On the homepage, their example for search is: "Example: escorts in virginia". I really don't think they're legit.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3742926 posted 10:27 am on Nov 12, 2008 (gmt 0)

6 months ago I decided to allow Yanga to crawl. While I am not seeing measurable traffic from them yet, they do return my pages in their SERP for the appropriate search terms. I'm also not seeing a spammy SERP for my topics. Time will tell.

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3742926 posted 12:52 pm on Nov 15, 2008 (gmt 0)

when I check my logs they come from an IP that does not resolve properly. I check them by User Agent and send them unknown binary content if detected. As far I could see they had a single page from my site indexed and several on "how to crack" my applications.

soothsayer

5+ Year Member



 
Msg#: 3742926 posted 7:07 am on Jan 19, 2009 (gmt 0)

found out about this bot when they were slurping up my websites wholesale. i tried blocking them, first llnw.net, then the bots switched to 77.91.224*, next to a telenet.be* address, there's also a 91.205.124* addy that they use. all these addresses passes through the llnw.net network.

[edited by: incrediBILL at 1:12 pm (utc) on Jan. 23, 2009]
[edit reason] removed comment, see TOS #26 [/edit]

KenB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3742926 posted 2:28 pm on Feb 16, 2009 (gmt 0)

What I'm finding about bots like this is that they are consuming way too much of my server resources compared to the amount of traffic they are generating. All of these "me too" bots are spending so much time indexing my site that bots now account for over half of my traffic. At times hits from the "SE" bots are coming in so fast from different organizations that they end up effectively creating a DOS that prevents real users from accessing my content for short periods of time.

The only recourse I see to protect the availability of my site to human users is to ban some of these "me too" SE bots, especially when they don't result in enough traffic to justify their existence and/or they fail to obey the "crawl-delay" directive.

Yanga bot has been put on my ban list for the above reasons and because it is on a Russian IP address while claiming to be a UK company. So many spam bots are coming out of Russia now that banning Russian based bots is a necessity.

One solitary post by the "supposed" owner without substantiation is not enough to convince me that their game is legit.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved