Welcome to WebmasterWorld Guest from

Forum Moderators: martinibuster

Message Too Old, No Replies

How do I stop Y from cracking open PDF files and

publishing them as html?

3:23 am on Sep 14, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 27, 2004
votes: 0

These are private files on my site for members and customers--not for the general public. I use fully locked PDFs for a good reason--so that no one can extract the content.

Yet Y does it.

I have avoided using a robots.txt file because Google doesn't like them.

Any way to stop Y from doing this?

4:55 am on Sept 14, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Nov 5, 2003
votes: 0

A robots.txt will do it. Make sure it is a valid file, and then Google will have no problem with it.

Your other option is to require the user to log in and store them on a secure part of your site.


5:17 am on Sept 14, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 27, 2002
votes: 0

The "view as" functionality is part of the page cache so you can use this tag to prevent this. I will also have someone investigate why we are cracking open these files if you send me some examples.

How do I keep my page from being cached in Yahoo! Search?
Our search engine contains "snapshots" of the majority of pages discovered during the crawl on the Web and caches them. This enables us to highlight the search terms on text-heavy pages so you can find relevant information quickly. And if the site's server temporarily fails, you can still see the page.
If you run a web site and do not want your content to be accessible through the cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:


This tag will tell robots not to archive the page. Our crawler will continue to index and follow links from the page, but it will not display a cached page in search results.

Please note that the change will occur the next time the search engine crawls the page containing the NOARCHIVE tag (typically at least once per month).

Also, the NOARCHIVE tag controls only whether the cached page is shown. To prevent the page from being indexed, use the NOINDEX tag. To prevent the crawler from following links, use the NOFOLLOW tag.

3:12 pm on Sept 14, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 6, 2003
votes: 0

AFAIK, if you reference them via "ftp", rather than "http", this will also prevent them being spidered.

Someone will no doubt correct me if I'm wrong, but it seems to have worked for me.

3:18 pm on Sept 14, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Aug 2, 2003
votes: 0

You could also put the PDFs in a directory & use htaccess to require a username/password.