Forum Moderators: goodroi
(download. is a subdomain and /pdf/ is a folder within it).
The pdf directory is closed, you just see a forbidden error. If you go on [download.ourwebsite.com...] it redirects you to our main site.
For some reason google has picked up one of our customer's pdf files, the only explanation I can think of (as these links are only shared through email) is that the customer has posted the link to somewhere public on the web.
Is it possible in robots.txt to stop google from picking this up? If so where do I place robots.txt, in the root of the subdomain or within the /pdf/ folder? Also what do I put in robots.txt?
Thanks in advance :-)
Google could have found that URL if someone posted it online or if someone visited and had the google toolbar installed. It is hard to prevent Google from knowing about any URL. Knowing about a URL is different from actually accessing it.
To prevent Google from accessing the pdf you should upload a robots.txt file to the root of your subdomain. You can place a wildcard to prevent Google from accessing any pdf file or you use the standard folder exclusion.
Whichever you choose make sure to validate it so you know it is doing the right thing. To be extra safe you should monitor it a few times for the first 2-3 weeks just to everything is how you want it.
Is it possible in robots.txt to stop google from picking this up? If so where do I place robots.txt, in the root of the subdomain or within the /pdf/ folder? Also what do I put in robots.txt?
put the robots.txt in the root of the subdomain (it won't be read in a subdirectory)
and put
User-agent: *
Disallow: /pdf/
in there.
Lets say there are dozens, maybe hundreds of these PDF files that are linked to from other sites.
Does disallowing PDF files from being spidered disable the off-site links from passing link juice to my site?
Unfortunately Google and Yahoo appear to be ignoring this, which is a shame. MSN obeys it though.
Google and Yahoo appear to be ignoring this...
User-agent: *
Disallow: /*.PDF$
...with a blank line following it. But keep in mind that many bots don't follow the non-standard "standards" that the majors have implemented, so your PDFs might get into the wild anyway.
could it be a file name case issue?
Do you mean .PDF or .pdf? I didn't know this was an issue. Otherwise, I don't understand.
If your actual robots.txt is written as you did so above they well might
No, that's just my BB code formatting going awry. ;) The full robots file is as follows.
User-agent: *
Disallow: /*.PDF$
Disallow: /*.DOC$
User-agent: Googlebot
Disallow: /*.PDF$
Disallow: /*.DOC$
keep in mind that many bots don't follow the non-standard "standards" that the majors have implemented
I'm only concerned with Google, and as I understand it the above disallows should be obeyed by Googlebot.