homepage Welcome to WebmasterWorld Guest from 54.197.183.230
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Disallow pdf
FiRe

5+ Year Member



 
Msg#: 3741109 posted 10:00 am on Sep 9, 2008 (gmt 0)

We upload pdf's to our server and they are linked in the following way:

[download.ourwebsite.com...]

(download. is a subdomain and /pdf/ is a folder within it).

The pdf directory is closed, you just see a forbidden error. If you go on [download.ourwebsite.com...] it redirects you to our main site.

For some reason google has picked up one of our customer's pdf files, the only explanation I can think of (as these links are only shared through email) is that the customer has posted the link to somewhere public on the web.

Is it possible in robots.txt to stop google from picking this up? If so where do I place robots.txt, in the root of the subdomain or within the /pdf/ folder? Also what do I put in robots.txt?

Thanks in advance :-)

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3741109 posted 1:05 pm on Sep 9, 2008 (gmt 0)

hi Fire,

Google could have found that URL if someone posted it online or if someone visited and had the google toolbar installed. It is hard to prevent Google from knowing about any URL. Knowing about a URL is different from actually accessing it.

To prevent Google from accessing the pdf you should upload a robots.txt file to the root of your subdomain. You can place a wildcard to prevent Google from accessing any pdf file or you use the standard folder exclusion.

Whichever you choose make sure to validate it so you know it is doing the right thing. To be extra safe you should monitor it a few times for the first 2-3 weeks just to everything is how you want it.

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 1:29 pm on Sep 9, 2008 (gmt 0)

Is it possible in robots.txt to stop google from picking this up? If so where do I place robots.txt, in the root of the subdomain or within the /pdf/ folder? Also what do I put in robots.txt?

put the robots.txt in the root of the subdomain (it won't be read in a subdirectory)
and put
User-agent: *
Disallow: /pdf/

in there.

davesnyder

5+ Year Member



 
Msg#: 3741109 posted 3:00 am on Sep 23, 2008 (gmt 0)

You can also use the x-robots tag to disallow both the crawl and indexing of these documents. This tag is put into the http header. You can find info at NoArchive.net

You are also going to want to ask that the URLs are taken out of the index.

chewy

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3741109 posted 2:31 am on Dec 6, 2008 (gmt 0)

ok, this helps me part of the way --

how does one ask to have many files removed from the index?

I know I can do this through Webmaster Tools - but are there others way to do it other than naming all the discreet files?

ZydoSEO

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 12:02 am on Dec 7, 2008 (gmt 0)

Through Google's Webmaster Tools you can remove individual URLs, pages in a directory and all of it's subdirectories, or an entire site.

chewy

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3741109 posted 4:29 am on Dec 21, 2008 (gmt 0)

ok, so I remove the pdf files using robots.txt and or GWMT. (current tests show this is not yet working and it is a couple of weeks along - but hey, I can wait.)

Lets say there are dozens, maybe hundreds of these PDF files that are linked to from other sites.

Does disallowing PDF files from being spidered disable the off-site links from passing link juice to my site?

bouncybunny

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 9:25 am on Dec 21, 2008 (gmt 0)

I've been using

User-agent: *

Disallow: /*.PDF$

User-agent: Googlebot

Disallow: /*.PDF$

Which is supposed to work. Unfortunately Google and Yahoo appear to be ignoring this, which is a shame. MSN obeys it though.

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 9:39 am on Dec 21, 2008 (gmt 0)

Unfortunately Google and Yahoo appear to be ignoring this, which is a shame. MSN obeys it though.

The robots.txt is more of a "guidelines" thingy. If you want to properly block them you should setup a script that sends the pdf to the client instead of allowing a direct download. And in it check whatever is necessary (eg: session, headers, customer permissions)

bouncybunny

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 9:50 am on Dec 21, 2008 (gmt 0)

Well, in my example, the rule is intended to disallow the main three search engines from indexing PDF files. Which 'should' also help the opening poster.

robots.txt may indeed be guidelines related, but it is a guideline that Google, at least, claims to obey.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3741109 posted 12:20 pm on Dec 21, 2008 (gmt 0)

@bb
could it be a file name case issue?

in general you should have the specific bots disallowed before the wildcard user agent is specified.
in your case those disallows appear redundant.

jimbeetle

WebmasterWorld Senior Member jimbeetle us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3741109 posted 5:02 pm on Dec 21, 2008 (gmt 0)

Google and Yahoo appear to be ignoring this...

If your actual robots.txt is written as you did so above they well might. Blank lines in robots.txt indicate "end of record," so some bots might choke on the blank line between the user agent and the directive. All you should need is...

User-agent: *
Disallow: /*.PDF$

...with a blank line following it. But keep in mind that many bots don't follow the non-standard "standards" that the majors have implemented, so your PDFs might get into the wild anyway.

bouncybunny

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3741109 posted 11:34 am on Dec 22, 2008 (gmt 0)

could it be a file name case issue?

Do you mean .PDF or .pdf? I didn't know this was an issue. Otherwise, I don't understand.

If your actual robots.txt is written as you did so above they well might

No, that's just my BB code formatting going awry. ;) The full robots file is as follows.

User-agent: *
Disallow: /*.PDF$
Disallow: /*.DOC$

User-agent: Googlebot
Disallow: /*.PDF$
Disallow: /*.DOC$

keep in mind that many bots don't follow the non-standard "standards" that the majors have implemented

I'm only concerned with Google, and as I understand it the above disallows should be obeyed by Googlebot.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved