Rogue Yahoo-MMAudVid crawler

Forum Moderators: open

Message Too Old, No Replies

Rogue Yahoo-MMAudVid crawler

Yahoo crawler hammers website, ignores robots.txt

tangent

8:22 pm on Sep 3, 2004 (gmt 0)

A Yahoo crawler called Yahoo-MMAudVid downloaded the whole of eleven of my MP3 files at the end of July and used up 3% of my monthly bandwidth (50 Mbytes) in nine days. It's a bit greedy compared with Google, which downloads less than 2 Mbytes each month. Why would Yahoo want MP3 files? Is it a mistake in their robot program?

Not only that but it's now downloading more MP3 files and at the same time ignoring the robots.txt file. And the log entries contain a false email address that bounces:

mms-mmaudvidcrawler-support@yahoo-inc.com

I've written to Yahoo to complain but of course they haven't answered. What can I do to stop it?

volatilegx

1:56 am on Sep 6, 2004 (gmt 0)

Sounds like a fake Yahoo bot to me. I'd ban it.

tangent

7:31 pm on Sep 6, 2004 (gmt 0)

When you say a fake Yahoo bot, you mean a bot that's pretending to come from Yahoo but doesn't. No, it appears to be coming from a genuine Yahoo server address, which makes it all the more surprising because you wouldn't expect any large corporation to use a rogue bot (with a fake email address in its log entries) that doesn't conform to bot standards.

It's stopped now, so maybe my email had some effect after all.

You say to ban it but how would I do that if it ignores robots.txt?

wilderness

11:40 pm on Sep 6, 2004 (gmt 0)

but how would I do that if it ignores robots.txt

A simple beginning
[webmasterworld.com...]

tangent

1:37 pm on Sep 7, 2004 (gmt 0)

Hey, that's not fair. You said simple beginning. As in simple moon buggy, huh?

It would take me 40 years to fathom that out.

So where do I start? What language is it written in, and what's the minimum I have to specify to prohibit server 111.222.33.44?

wilderness

4:51 pm on Sep 7, 2004 (gmt 0)

So where do I start? What language is it written in, and what's the minimum I have to specify to prohibit server 111.222.33.44?

The Apache guys/girls can provide any inquiries you might have reagrding "language" terms.

The link previously provided is used in "htaccess" is a module in Apache (me tinks.)

Here are some other explanations for you to explore:
(The first likely the simpliest answer)

[webhelpinghand.com...]
[baremetal.com...]
[edginet.org...]
[dimi.uniud.it...]
[webhelpinghand.com...]

and if none of that provides enough depth?
You may begin where I did, before joining WebMaster World:
[google.com...]

My becoming aware of htaccess was through monitoring usenet threads in alt.www.webmaster 5-6 years ago.

tangent

8:54 am on Sep 8, 2004 (gmt 0)

This is exactly what I need to get started, thanks. Protecting directories, custom error pages, denying access and redirecting pages are all subjects I need to know. I've skimmed through all the references and can see myself being quite busy over the next few evenings.

YahooMMS

1:19 am on Sep 16, 2004 (gmt 0)

Before you go out and wholesale ban our bot, a few comments:

Yahoo-MMAudVidCrawler is a real Yahoo bot. Its seeking out audio and video files (streaming and downloadable) for indexing. The audio and video files are analyzed for ranking, keyframes, and other meta information.

I apologize for the first bounced email. We are human as well, and it seems we were a little sloppy and forgot the mms-mm prefix when setting up the crawler response mailing list.

-MMS Team, Yahoo

tangent

3:09 am on Sep 16, 2004 (gmt 0)

Thanks Yahoo, I appreciate the response.

volatilegx

4:20 am on Sep 16, 2004 (gmt 0)

Not only that but it's now downloading more MP3 files and at the same time ignoring the robots.txt file.

Ignoring the robots.txt convention is a good reason to ban even a real search engine bot. YahooMMS, is the Yahoo-MMAudVid crawler designed to obey robots.txt?

Does this bot have anything to do with Yahoo's purchase of MusicMatch?

jcoronella

6:19 am on Sep 16, 2004 (gmt 0)

Can we have a crawler string to add to robots.txt please YahooMMS? I'm sure it's a great tool, but my files have no meta data for you to look at.

xx.xx.x.x - - [06/Sep/2004:00:00:00 -0500] "GET /somepath/somefile.ext HTTP/1.1" 200 164236 "-" "Yahoo-MMAudVid/1.0 (mms dash mmaudvidcrawler dash support at yahoo dash inc dot com)"

Do you really need all of those big files?

Added to robots.txt:
User-Agent: Yahoo-MMAudVid/1.0
Disallow: /

Will watch... and ban thereafter.

volatilegx

1:34 pm on Sep 16, 2004 (gmt 0)

YahooMMS,

Welcome to WebmasterWorld and thank you for posting about your bot :)

bull

4:23 pm on Sep 16, 2004 (gmt 0)

Would jcoronella's robots.txt entry (=the UA) be correct, or would

User-agent: Yahoo-MMAudVid
Disallow: /

fit too?

tangent

5:08 pm on Sep 16, 2004 (gmt 0)

Would jcoronella's robots.txt entry (=the UA) be correct, or would
User-agent: Yahoo-MMAudVid
Disallow: /
fit too?

Not at the moment because the bot doesn't look at robots.txt, but hopefully it will in the future.

jdMorgan

6:50 pm on Sep 16, 2004 (gmt 0)

Sorry, zero tolerance. No robots.txt compliance, no admittance.


RewriteCond %{HTTP_USER_AGENT} ^Yahoo-MMAudVid/
RewriteRule .* - [F]

Of all companies, Yahoo should understand this by now. If you're a household name and you're visiting our households, be polite. Unfortunately, it appears that Yahoo has a bunch of broken technology that they bought from others, and does not seem to take the problems seriously, viz the on-going 301-redirect problem.

Hopefully the word is getting around that robots.txt compliance is not optional. Personally, I don't mind missing a few crawls until a 'bot proves that it is working properly.

Everybody likes to gripe about MS, but I sent them a problem log about MSNBOT/0.1, got a personal reply, and they fixed the problem within three weeks with msnbot/0.11. Beat that.

The potential for media bots to cause problems is high. Take a look at the log entry that jcoronella posted. What if Yahoo hit his site 10000 times within a few days, and many of his media files were 100 times that big? That could get an unsuspecting Webmaster booted by his host, or subject him to monetary bandwidth overage penalties.

Jim

YahooMMS

2:40 pm on Sep 17, 2004 (gmt 0)

Hello Jim,

We take robots.txt and the impact our crawlers have on WebMasters very seriously. Coming from one of those companies you speak of (the one deeply involved in setting up the original robots.txt standard), we recognize our engine is nothing without content, and that we can be 'banned' at a moments notice.

Unfortunately, a mistake in setting up email aliases for this particular crawler has kept us in the dark (and frustrated webmasters who got a 'bounce') for the past several weeks. The email address provided by the bot should now be working (mms dash mmaudvidcrawler dash support at yahoo dash inc dot com), and webmasters who notice a problem should now at least be able to vent their frustration and get some relief.

The current AudVid system has a delay in responding to robots.txt. While we adhere to robots.txt in choosing the content, it may take several weeks for this to propigate to the content previously scheduled for crawling. My recommendation is to email the above support alias and we will work to remove the host from the scheduling queue. We are, of course, working to reduce the delay in applying robots.txt.

Note the image crawler (MMCrawler, not MMAudVid) and slurp are seperate systems that do not suffer from the same robots.txt delay and are better behaved at the moment. Also, the 301 redirect problem has been fixed (many thanks to the often hilarious emails sent to the MMCrawler alias regarding this bug), and the new content being scheduled does not suffer from this problem.

- MMS Team, Yahoo

jdMorgan

2:29 am on Sep 18, 2004 (gmt 0)

The current AudVid system has a delay in responding to robots.txt. While we adhere to robots.txt in choosing the content, it may take several weeks for this to propagate to the content previously scheduled for crawling. My recommendation is to email the above support alias and we will work to remove the host from the scheduling queue. We are, of course, working to reduce the delay in applying robots.txt.

I don't like to give people a hard time; it's not my nature at all, and I don't mean to do that here. But the answer above reflects the problem pretty clearly. I've got the knowledge and the technology to keep myself out of trouble with respect to spiders causing bandwidth overages on my sites by way of uncontrolled spidering. But a lot of Webmasters have neither.

With respect, I'd suggest that a robot that has a weeks-long delay in handling robots.txt should be taken off-line and fixed. The current non-robots.txt-compliant to-be-fetched queue should be purged. Not because those few of us who know how might block (403 response) the spider for bandwidth-limitation reasons -- that wouldn't even make a dent in the 'quality' of your content. But because the thing is simply not ready to go.

Spiders should fetch robots.txt first, and obey it when crawling. If the primary problem is one of hosting bandwidth quotas, it does sites absolutely no good if you spider the whole site first and then drop the content later -- a la <meta robots> exclusion. That's my main point: Multimedia files are very large compared to straight HTML pages. With robots.txt latency or compliance problems, the potential for getting small-time Webmasters booted off their hosts is considerable, and therefore the due-diligence burden on multimedia spiders is higher.

A note -- Some spiders examine the 'expires' response header of robots.txt and re-fetch it accordingly. That's a pretty good idea if the header is present and indicates a very short expiry time such as a few hours or less. If the expiry time exceeds the normal crawl cycle, you can just ignore it. But if it's very short, then several re-fetches within the crawl might be needed. (I discovered this accidentally when I inadvertently set the expiry time on a robots.txt file to 60 seconds instead of 60 minutes while doing some heavy changes. The pattern in the log files indicated that the robots were re-fetching robots.txt before each and every content page... Doh!) Anyway, in that light, it's pretty clear that the current robots.txt-handling would need some serious re-design to handle 60-second robots.txt expiry. Hopefully, this is a 'limit' case, but it should be handled.

The 301 redirect problem I was referring to is the one with Slurp; It has been mis-handling 301 redirects for many months and AFAIK we're still waiting for word from Tim on this one. Sorry for the ambiguity.

Thanks for the response and good luck with your project.

Jim

wilderness

2:56 am on Sep 18, 2004 (gmt 0)

With respect, I'd suggest that a robot that has a weeks-long delay in handling robots.txt should be taken off-line and fixed. The current non-robots.txt-compliant to-be-fetched queue should be purged. Not because those few of us who know how might block (403 response) the spider for bandwidth-limitation reasons -- that wouldn't even make a dent in the 'quality' of your content. But because the thing is simply not ready to go.

I agree with Jim and I'm neither as tolerant or polite as he.
"403's"

YahooMMS

5:43 am on Sep 18, 2004 (gmt 0)

Point taken Jim,

Speeding up the robots.txt refresh is a high priority for us (robots.txt hammering is less of an issue, but still a problem, for audvid files).

A further note, we have put in place a IP based breadth first crawl policy to limit bandwidth exposure (that is, for a given IP, download at most one audvid file at a time). We have also taken pains to design a system that will try not to download a file again if it has not changed (though this can be difficult with streaming files and live streams). The goal is that websites will never need to institute a last minute robots.txt block to limit their bandwidth consumption if they want their content indexed. Of course, some hosts are served by multiple IPs so there are still some problems.

The issue remains regarding image/audio/video and the relative gains/negatives for making this content indexable.

- Late night MMS Team

tangent

8:21 am on Sep 18, 2004 (gmt 0)

We take robots.txt and the impact our crawlers have on WebMasters very seriously.

erm... I've just checked the log entries and there are no requests for robots.txt from 260.190.43.xyz or containing mmAudVid during the past six months.

Here are a couple of sample log entries:

206.190.43.101 - - [20/Jul/2004:19:26:32 +0100] "GET /sounds/filename.mp3 HTTP/1.1" 200 4701571 "-" "Yahoo-MMAudVid/1.0 (mms dash mmaudvidcrawler dash support at yahoo dash inc dot com)"
206.190.43.49 - - [02/Sep/2004:20:47:17 +0100] "GET /sounds/filename.mp3 HTTP/1.1" 200 5782428 "-" "Yahoo-MMAudVid/1.0 (mms dash mmaudvidcrawler dash support at yahoo dash inc dot com)"

Ironically, the second one's a sermon entitled "Dealing with Disappointment".

... the potential for getting small-time Webmasters booted off their hosts is considerable...

Indeed it is. Another different robot used up 10% of my monthly bandwidth in three days by hammering MP3 files. But in their case, with a quick email, they fixed the problem and blocked their robot from my site with seven hours.

YahooMMS

4:41 am on Sep 21, 2004 (gmt 0)

tangent,

The robots.txt is not checked by the same machine or agent that does the audvid download. But yes, robots.txt is looked at when constructing the crawl list. If you have a problem with this bot, please send an email to the support address listed.

bull

9:07 am on Sep 21, 2004 (gmt 0)

The robots.txt is not checked by the same machine or agent that does the audvid download.

You surely know you are the only one in the search industry doing this? Even Googlebot/Test fetched robots.txt with its correct User-agent. I think it is consensus that each different User-agent has to fetch the robots.txt itself.