homepage Welcome to WebmasterWorld Guest from 54.205.144.54
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Singingfish extractor not obeying robots.txt
sound files hosted at different domain than website
Finder




msg:398002
 5:42 pm on Jun 9, 2003 (gmt 0)

We host a small number of sound files for a friend in a robots.txt excluded directory. The website that links to them, however, is at a different, unprotected domain, e.g.:

www.foomusic.com
www.mydomain.com/protected/soundfile.ra

Singingfish.com's spider sees the original website and declares it open season on the sound files. The extractor then blindly hits those sound files without checking the robots.txt file of the hosting domain.

I sent them an email. In the meantime, I am banning their extractor:

extractor.singingfish.com
63.251.169.234

ADDED: Got a reply to my email. They do not seem concerned that their extractor ignores robots.txt. Instead, they said they would add my domain to their exclusion list and run a script to remove our files from their db. Not what I was hoping for...

 

ritch_b




msg:398003
 3:28 pm on Jun 11, 2003 (gmt 0)

Adding your domain to an excluded list seems a pretty inefficient method of sorting things out - surely getting the spider to read robots.txt is the way to do things but hey, who are we to ask? ;)

As an aside, did you consider blocking the spider via .htaccess?

R.

Finder




msg:398004
 12:51 am on Jun 14, 2003 (gmt 0)

The extractor uses a generic Real Media UA, so I blocked them by IP instead.

I replied to their email, explaining the situation and even pointing them to this thread. Unfortunately I got no response.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved