homepage Welcome to WebmasterWorld Guest from 54.198.130.203
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
msn search violating robots.txt!
`disallowed` images appear in preview on result page
DoppyNL




msg:1527898
 2:52 pm on Jun 26, 2004 (gmt 0)

Hi,

in my robots.txt I've got the following:

User-agent: *
Disallow: /dl/
Disallow: /css/

(and some more lines, but it's about this one)

In the folder /dl/ all images and non-html files are stored.
In the folder /css/ css sheets are stored. (really! ;-) )
This would result in all crawlers not requesting those pages.
This has been in robots.txt for over a year now.

--
When you've got the msn-toolbar installed, it allows a user to enter a search query, and it will display a list of results on the left side of the window, and on the right side it will display preview images of the pages that may be interesting.

[search.msn.nl...]
This url is responsable for the part that displays the preview. Replace "query+string" with some keywords that will identify the site. (no specifics :S)
So, I used 2 keywords that identify one of my site's, and it turns up in first position.

Strange thing is, msn is able to display a preview image of the page, including all images and use of css-sheets.
But I've told them not to request those files!

I haven't got any chance to look at the log files of the server, but it may not be in there anymore (logfiles are cleaned up after some time).

Anyone else seen something like this?
I consider this bad behavior and am thinking of banning msn-search completely....

Any thoughts?

 

DoppyNL




msg:1527899
 2:58 pm on Jun 26, 2004 (gmt 0)

And I just found another thing with another site of mine with their crawler:

I moved the site to a new domain 2 months ago, and replaced all html files with a "site has moved to a new domain" html-file to redirect the user in a friendly way.
In those files I put this code:

<meta name="robots" content="noindex, follow" />

But that specific page now comes up in their results!

I'm getting closer to a complete ban for them now.....

DoppyNL




msg:1527900
 8:52 am on Jul 8, 2004 (gmt 0)

*bump*

Nobody else seen this?
Nobody that has any comment on this?

Mr_Brutal




msg:1527901
 9:04 am on Jul 8, 2004 (gmt 0)

Just a quick thought - if the images and css are available to the robot when it access a normal page ( one that is allowed in your robtos.txt ) then its gonna be able to see the pictures - i thought robots.txt disallow rules prevented the robot from following links to pages in the folders specified.

So because the images are viewable on a normal page the robot 'should' be allowed to view them.

This is my view any way, when i wrote my robot i think i implemented it like this - it was 3 years ago mind and it was never used other than testing for my dissertation but thats how i remember the rules.

Span




msg:1527902
 9:49 am on Jul 8, 2004 (gmt 0)

Mr_Brutal I think you're right.
For the same reason thumbnails of images show up at image searches, no matter if hotlinking is allowed or not.

py9jmas




msg:1527903
 10:00 am on Jul 8, 2004 (gmt 0)

Another possibility - when you requested the preview, MSN returned the HTML. The HTML still had the links to the images and css on your server. I expect they were pulled directly from your server and never went anywhere near MSN.

Try it again and immediately check your server logs. I expect you'll see the requests with the referrer set to something like [search.msn.nl...]

DoppyNL




msg:1527904
 10:35 am on Jul 8, 2004 (gmt 0)

Well, I allways thought that folders and files disallowed in robots.txt are never to be retrieved by a robot.

Saying that it is allowed to fetch the image because another file (wich isn't disallowed) links to it would be rediculous. As that would result in the fact that all files are free to get, as there is allways some file somewhere that links to it. (perhaps on another site?!?)

Google follows my robots.txt as far as I can see; there aren't ANY images stored in the google-image-search.

I think that when I say "Don't fetch anything from that folder", a crawler shouldn't get anything from it! I don't say it just for fun!

It's possible the browser fetched the CSS-file though.
But it isn't possible that the browser fetched the images, as I've got a referer-check in place, wich will return a "not allowed" image if the referer isn't from the site itself.
Taking a look at the result page it seems that the preview is an image generated by the crawler and stored on the server.

Try it again and immediately check your server logs. I expect you'll see the requests with the referrer set to something like [search.msn.nl...]

There isn't one in the log. (only from the clicks in the results-pages).

is there an "MSNGuy" on webmasterworld.com? :P

bird




msg:1527905
 10:38 am on Jul 8, 2004 (gmt 0)

Well, I allways thought that folders and files disallowed in robots.txt are never to be retrieved by a robot.

That's the correct interpretation.

Mr_Brutal




msg:1527906
 10:48 am on Jul 8, 2004 (gmt 0)

I was working on the idea that when something like directory browsing is turned off on an images folder you can't view any of the images in it but when you look at a page that calls the images that page has rights to the folder and can of course display.

I see your point and of course if you type a URL that links directly to an image in a "no browsing directory" you can still see it.

So using "browsing" as a framework the robot could be able to view the image.

Your right though if you wanna stop the robot from using bandwidth by downloading images then sticking the folder in robots.txt is the way to do that - different framework from "browsing" after all. Like i said - just a thought - just not a good one :-)

Leosghost




msg:1527907
 11:23 am on Jul 8, 2004 (gmt 0)

At the moment Msn are ignoring robots.txt....
there a thread about it somewhere here been running afew days ..

fiestagirl




msg:1527908
 11:59 pm on Jul 15, 2004 (gmt 0)

I was under the impression that Girafa provided the thumbnail shots to MSN for the preview.

Press Release [girafa.com]
..."technology enables the immediate display of a Web site's thumbnail preview alongside its textual URL, thereby enhancing and improving the entire search experience."

Remove a url [girafa.com]

jdMorgan




msg:1527909
 1:14 am on Jul 16, 2004 (gmt 0)

Right underneath those page images in the MSN search results is a link that says: "Website owners: prevent your page from being previewed." If you click on it, you are advised to use:
User-agent: searchpreview 
Disallow: /

in robots.txt, or
<meta name="robots" content="noimageindex, nomediaindex" />

The robots.txt method works for me.

Jim

DoppyNL




msg:1527910
 10:05 am on Jul 16, 2004 (gmt 0)

I don't want to disallow my entire site!
I don't want them to get the images!

My current robots.txt file does just that for ALL crawlers by disallowing the dir where the images are stored.

And their crawler simply isn't listening to that!
The * also includes their crawler (in fact: it includes ALL crawlers), so they should listen to that!

So they are still in violation of the robots.txt;
Not a very good thing.....

fiestagirl




msg:1527911
 10:25 pm on Jul 17, 2004 (gmt 0)

The Girafa technology creates a graphic screen shot of your page. Girafa does not go to your image file and does not go to the css file. It DOES ask for the robots.txt and will obey it. I have had it in my robots.txt for 2 years now and NEVER have a screen shot on MSN. The screen shots have been available for about 3 years now and are not provided by Microsoft.

June 18, 2003

"Today, MSN Search takes an approach that utilizes both internal technology, as well as the technology of third-party companies, including Looksmart, Inktomi, Overture and Girafa."
[microsoft-watch.com...]

jdMorgan




msg:1527912
 10:48 pm on Jul 17, 2004 (gmt 0)

DoppyNL,

Whether the screenshot was gathered by MSN or by Girafa, I agree that your "*" User-agent exclusion should have worked. There are two possibilities that I can think of:

1) There is a subtle problem with your robots.txt (validate it here [searchengineworld.com]).
2) MSN searchpreview or Girafabot does not recognize "*" as applying to their robot.

If you believe there is a bug in their robot, then report it to MSN. If you fully-document the problem and describe it precisely and concisely, you may even get a reply. (I have reported an msnbot problem to MSN, and received a reply after 3 days. A few weeks later, the problem was fixed with a new robot version release.)

Since there is no way to tell if they are using Girafa or their own image 'bot, I'd report it to MSN first.

Jim

fiestagirl




msg:1527913
 12:08 am on Jul 18, 2004 (gmt 0)

I agree that it is very likely that Girafa does not believe the robots.txt - except a complete disallow applies to them. Since they are not spidering per se but doing html to jpg.
Microsoft is the place to start. They probably haven't cared previously but they are flying above the radar now and should be called on it.

sprinttotal




msg:1527914
 11:05 pm on Jul 22, 2004 (gmt 0)

Just one quick answer...

That's a printscreen/thumbnail you are seeing there... Nothing related with robots.txt. It doesn't spider anything, just make a thumbshot from your website.

So, I don't see the point of this topic at all.

Just like the whois dot sc or at thumbshots (that provide thumbs for DMOZ) you only see a shot of your website, nothing more.

DoppyNL




msg:1527915
 8:21 am on Jul 23, 2004 (gmt 0)

I haven't found the time yet to send a complete report to msn, but I will be doing that!

Just one quick answer...

That's a printscreen/thumbnail you are seeing there... Nothing related with robots.txt. It doesn't spider anything, just make a thumbshot from your website.

So, I don't see the point of this topic at all.

Just like the whois dot sc or at thumbshots (that provide thumbs for DMOZ) you only see a shot of your website, nothing more.


And how is a crawler supposed to make a printscreen/thumbnail when it is NOT ALLOWED to fetch both the images and the css files? You say it doesn't spider anything: but then how does it know what everything looks like?
that IS related to robots.txt as I've said to the robots they should not fetch a thing from the locations where my images are stored!

sprinttotal




msg:1527916
 11:00 am on Jul 23, 2004 (gmt 0)

A normal user doesn't spider your website, but it sees your website just like that thumbnail. Is almost like downloading your website, printsreening it and send it to a website like msn. Nothing related with robots.txt ;)

blaze




msg:1527917
 11:04 am on Jul 23, 2004 (gmt 0)

This is broaching copyright infringement.

I think I'm going to start mass generating some websites which take 'thumbnails' and ignore robots.txt.

DoppyNL




msg:1527918
 12:34 pm on Jul 23, 2004 (gmt 0)

sprinttotal said:
A normal user doesn't spider your website, but it sees your website just like that thumbnail. Is almost like downloading your website, printsreening it and send it to a website like msn. Nothing related with robots.txt

Do you even understand what is happening here?
What is the `normal user` doing in your story? he is only viewing the results NOT GENERATING THE IMAGE!

msn search results return something that would not be possible because of robots.txt!

The results contain a small image wich is generated from all files that are fetched from my server.
IT IS NOT GENERATED BY THE PERSON WHO SEARCHES BY FETCHING THOSE FILES; THE IMAGE ALSO COMES FROM MSN!

I've got the idea you haven't read the topic or didn't understand the problem here.
MSN somehow managed to create the thumbnail, but that shouldn't be possible as I've told them that they may not retrieve the images and css.
So it should be impossible for them to create the thumbnail!

my apoligies if I sound a little rude, point is that I've got the idea that you don't understand the problem here.

sprinttotal




msg:1527919
 1:47 pm on Jul 23, 2004 (gmt 0)

Yes, I DO UNDERSTAND the problem... ;) Also, I can read, but thanks for the caps anyway. ;) Go to MSN and search for your images and css. Do they appear on the search index? I suppose not ;) So, MSN obeys the robots.txt

In the other hand, MSN uses Girafa (i think) to generate thumbnails for the websites indexed in the MSN search engine (even Alexa does that, and yes, I have blocked my images too in robots.txt).

My point is, robots.txt prevents your files to be indexed by some spiders. MSN spiders feed a search engine. Girafa isn't a search engine.

Anyway if you go to the Girafa website and remove your url that will be resolved. Also, have you been at ****/yourdomain.com or even alexa.com/data/details/main?url=http://www.domain.com?

It's like saying "Hey, I don't allow Google in my robots .txt but i'm in their directory! I only submited to DMOZ!"

So again, whats the problem with MSN and robots.txt? The files you blocked don't appear at their search index, do they? Also, it's a thumbnail of your website, if it looks good it's another factor to visitors click on that search result.

DoppyNL




msg:1527920
 2:41 pm on Jul 23, 2004 (gmt 0)

ok, you did understand. Can't read minds overhere so I couldn't know that ;)

What is robots.txt for?

To prevent some or all robots to fetch certain or all files from a web server.

robots.txt is NOT specificly for search engine's to tell them what they can put in their index and what not.
it's to tell a robot what it is not allowed to fetch from the server.

It just happens to do the job when you don't want certain file in a search engine:
If a crawler isn't allowed to fetch the file, it isn't possible it turns up in the results, now can it! :)

www.robotstxt.org describes robots.txt files and how to use them.
It states clearly that you can exclude robots from crawling your website by using a robots.txt file; it does not say this is only the case for robots that are building an index for a search engine.
ALL ROBOTS must follow the robots.txt, it doesn't matter what they want to do with the results, they must follow it.

So Safari didn't follow the robots.txt and I emailed MSN about it. (Safari indeed supplies the preview images).
so far, no response.
I will post about their response when I get it.

DoppyNL




msg:1527921
 5:19 pm on Jul 30, 2004 (gmt 0)

Since my server logs are purged on a regular basis, I couldn't show the log entry's of Girafa actually getting files that it wasn't allowed to get.

But now, I can. Since it came by again, here is the log.
I changed my domain to something that doesn't point you to my site.

[27/Jul/2004:09:17:03 +0200] "GET / HTTP/1.0" 200 2812 "http://my-old-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:03 +0200] "GET /robots.txt HTTP/1.0" 200 548 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /css/101.css HTTP/1.0" 200 4554 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /dl/sv.jpg HTTP/1.0" 200 17587 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /dl/HappyEyes.gif HTTP/1.0" 200 3877 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /dl/CoolGlasses.gif HTTP/1.0" 200 514 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /dl/extremehappy.gif HTTP/1.0" 200 185 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"
[27/Jul/2004:09:17:04 +0200] "GET /dl/banner468x60.gif HTTP/1.0" 200 39352 "http://my-domain.com/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; Girafabot; girafabot at girafa dot com; http*//www.girafa.com)"

Still no response from MSN; I did notice a visit from them to my site though. They fetched the home page and not even the robots.txt file.
Mailed MSN again, also sent one to Girafa as well this time.

DoppyNL




msg:1527922
 3:43 pm on Aug 1, 2004 (gmt 0)

Got a reply from MSN stating that it's not their bot so they can't do anything about it.
They redirected it to girafa (I allready sent that mail their also, but that's ok.)

Girafa responded to the mail MSN forwarded to them....
Funny, did they ignore my mail wich arrived sooner or haven't they processed that one yet?

Anyway, here is there reply:
Snipped. Direct email quotes are NOT permitted, TOS item 9. Paraphrased emails encouraged.

My thought on this is that they are incorrect. They are fetching files that are disallowed in the robots.txt; the fact that the files are referenced in files that are not disallowed doesn't really matter.

Anyone else has any thoughts on this?

[edited by: DaveAtIFG at 12:06 pm (utc) on Aug. 2, 2004]

sprinttotal




msg:1527923
 4:15 pm on Aug 1, 2004 (gmt 0)

It's really what I told you. The bot acts like a normal user, it goes to / and gets all the pages (includes, frames, css, js, images), make a thumbnail of it and deletes all the data.

What you have to do is block the girafabot in your robots.txt to /.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved