homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

How do I stop Y from cracking open PDF files and
publishing them as html?

10+ Year Member

Msg#: 2714 posted 3:23 am on Sep 14, 2004 (gmt 0)

These are private files on my site for members and customers--not for the general public. I use fully locked PDFs for a good reason--so that no one can extract the content.

Yet Y does it.

I have avoided using a robots.txt file because Google doesn't like them.

Any way to stop Y from doing this?



10+ Year Member

Msg#: 2714 posted 4:55 am on Sep 14, 2004 (gmt 0)

A robots.txt will do it. Make sure it is a valid file, and then Google will have no problem with it.

Your other option is to require the user to log in and store them on a secure part of your site.


10+ Year Member

Msg#: 2714 posted 5:17 am on Sep 14, 2004 (gmt 0)

The "view as" functionality is part of the page cache so you can use this tag to prevent this. I will also have someone investigate why we are cracking open these files if you send me some examples.

How do I keep my page from being cached in Yahoo! Search?
Our search engine contains "snapshots" of the majority of pages discovered during the crawl on the Web and caches them. This enables us to highlight the search terms on text-heavy pages so you can find relevant information quickly. And if the site's server temporarily fails, you can still see the page.
If you run a web site and do not want your content to be accessible through the cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:


This tag will tell robots not to archive the page. Our crawler will continue to index and follow links from the page, but it will not display a cached page in search results.

Please note that the change will occur the next time the search engine crawls the page containing the NOARCHIVE tag (typically at least once per month).

Also, the NOARCHIVE tag controls only whether the cached page is shown. To prevent the page from being indexed, use the NOINDEX tag. To prevent the crawler from following links, use the NOFOLLOW tag.


WebmasterWorld Senior Member 10+ Year Member

Msg#: 2714 posted 3:12 pm on Sep 14, 2004 (gmt 0)

AFAIK, if you reference them via "ftp", rather than "http", this will also prevent them being spidered.

Someone will no doubt correct me if I'm wrong, but it seems to have worked for me.


10+ Year Member

Msg#: 2714 posted 3:18 pm on Sep 14, 2004 (gmt 0)

You could also put the PDFs in a directory & use htaccess to require a username/password.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved