Google appears to be ignoring robots file

Forum Moderators: goodroi

Message Too Old, No Replies

Google appears to be ignoring robots file

No change after three months

Lokutus

8:51 pm on Dec 16, 2005 (gmt 0)

Three months ago I put up a robots.txt file to stop Google from recording the pdf files on my site.

Today I did a search on the topic of one of these pdfs and sure enough it still comes up on Google.

The robots.txt file reads:

Disallow: /*.pdf
Disallow: *.pdf

Is this correct? Or is there a mistake in the file?

Lokutus

8:54 pm on Dec 16, 2005 (gmt 0)

What's even worse is that they crack open the pdfs and offer html versions.

jimbeetle

9:11 pm on Dec 16, 2005 (gmt 0)

Well, according to Google's robots.txt information [google.com] the correct syntax should be something along the lines of :

User-agent: Googlebot
Disallow: /*.pdf$

But, and this is a very big but, this is non-standard robots.txt syntax. As far as I know only Googlebot is supposed to support it. So, even if you happen to block Google, all others that are able to fetch pdf files will.

The best bet is to block by directory or, if that isn't feasible, by individual file:

User-agent: *
Disallow: /thisdirectory/

User-agent: *
Disallow: /thisdirectory/thisfile.pdf
Disallow: /thatdirectory/thatfile.pdf

Robotstxt.org [robotstxt.org] has the standards for the robots exclusion protocol.

Lokutus

9:30 pm on Dec 16, 2005 (gmt 0)

Thanks. I'll try blocking the files individually.

Another thread here suggests that any files spydered before the robots file went up, remain indexed. So maybe I need to change their names?

jimbeetle

9:50 pm on Dec 16, 2005 (gmt 0)

Yeah, they might remain indexed for awhile. You might consider G's url removal tool but be very careful when using it, maybe try removing one page at a time to see how it works.