FAST Grabbing Massive amounts of mp3s

Forum Moderators: open

Message Too Old, No Replies

FAST Grabbing Massive amounts of mp3s

From the "403 - Denied" dept.

djv

1:47 am on May 25, 2001 (gmt 0)

If there's one thing I HATE about the FAST Crawler, it's the fact that it grabs mp3 files when it crawls. What use is this? What indexable information is there in an mp3 file? I blocked fast from my site today for just this reason in robots.txt, (it was swamping my bandwidth by downloading 6 mp3s at once, and I got ticked off) but the spidering and grabbing of all my mp3s just continued. In fact, fastcrawler has attempted to initiate over 530 downloads of mp3s from my site today since 2:00 pm CDT without ONCE checking robots.txt. I've had to deny access via my .htaccess file instead. Can anyone explain exactly WHY they're grabbing mp3s? Are they in cahoots with the RIAA? Is someone at fast.no trying to build the world's largest mp3 library? I see no logical reason for this monumental waste of bandwidth.

Brett_Tabke

2:21 am on May 25, 2001 (gmt 0)

Fast has spidered for the FTP and multimedia search market for a few years. The primary user is Lycos:

[mp3.lycos.com ]
[ftpsearch.lycos.com ]

Also at the bottom of the AllTheWeb/Fast page is Mp3 Search and FTP search.

Not sure who else is using the data.

djv

2:33 am on May 25, 2001 (gmt 0)

Well, for that reason I can see them spidering pages that link to mp3s, or maybe even doing a "HEAD foo.mp3" to make sure the file exists, but to download the whole thing? I don't think that's nesacarry to build an effective database of sites with mp3s to use in a search engine. It still makes no sense to me.

EDIT:
Plus, that still dosen't explain the poor behavior of the spider, in that it has now almost made 600 attempts at downloading mp3s, without once looking at robots.txt. As proof:

[djv@djandwes djv]$ grep Fast /etc/httpd/logs/access_log ¦ grep 403 ¦ wc -l
594

In english, that command asks how many times the strings "Fast" and "403" (as in 403 - Forbidden) are found on the same line in my server access log. The answer? 594

[djv@djandwes djv]$ grep Fast /etc/httpd/logs/access_log ¦ grep 403 ¦ grep robots.txt ¦ wc -l
0

In english, that command asks how many times the strings "Fast", "403" and "robots.txt" appear on the same line in my server access log. The answer? 0

Brett_Tabke

11:17 am on May 25, 2001 (gmt 0)

We do have serveral Fast guys who read here. Any help on this one?

djv

6:24 pm on May 25, 2001 (gmt 0)

One of the admins from FAST did e-mail me back today (yesterday was a holiday for them, they weren't in the office), and the problem was a new spider they were testing. He apologized, and the spider has stopped molesting my server. Thanks to the FAST guys for reading their e-mail and acting on it.

Mike_Mackin

6:41 pm on May 25, 2001 (gmt 0)

Welcome to WmW djv

"molesting my server"
I like that phrase!

rc should make a note :)

JuniorHarris

5:28 am on May 27, 2001 (gmt 0)

Was the spider corrected, or did they just exclude your site? If the spider was fixed, does it now obey the robots.txt exclusion directive? <just curious>