homepage Welcome to WebmasterWorld Guest from 54.166.122.65
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Blocking PDFs For A Step Towards Better Quality
Pjman



 
Msg#: 4522602 posted 3:52 pm on Nov 26, 2012 (gmt 0)

I have a number of data only PDFs that users find highly valuable on a few of my sites. All of those sites were hit by Panda 1. The only real low-quality work on those sites are these data-only (numbers) PDFs, thousands of them.

Would a robot.txt disallow of the directory of PDFs re-establish quality?

With HTML I usually just “noindex” and robot.txt disallow, but PDFs don’t allow for that option.

Any ideas guys?

 

bwnbwn

WebmasterWorld Senior Member bwnbwn us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4522602 posted 5:11 pm on Nov 26, 2012 (gmt 0)

I have a number of data only PDFs that users find highly valuable

Why are they considered by u low quality when the user finds them of high quality, or are a few of them high but the vast majority are considered low?
To add them to a robots.txt file you will need to add the PDF's into a new folder called PDF then the whole folder can be blocked if this is the way you want to go.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4522602 posted 5:35 pm on Nov 26, 2012 (gmt 0)

Do you want no-crawl or no-index? If no-index, add this simple package to your htaccess or config file, changing the endings as appropriate:

<FilesMatch "\.(js|txt|xml)$">
Header set X-Robots-Tag "noindex"
</FilesMatch>

Also good for preventing them from indexing your robots.txt and sitemap, since you obviously can't block them from crawling those ;)

Pjman



 
Msg#: 4522602 posted 4:09 pm on Nov 27, 2012 (gmt 0)

Great ideas. Thanks.

If you block the bot from crawling files via "robot.txt" will G ever count those towards the quality score of your site? I'm just worried about sometime down the line getting hit by Panda if 1/10 of my site is data driven PDFs.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4522602 posted 3:46 am on Nov 28, 2012 (gmt 0)

Google also obeys wild card pattern patching in a robots.txt file, even though that syntax was not part of the original robots.txt standard. This means you can use:

User-agent: Googlebot
Disallow: /*.pdf$

1) The asterisk character [*] stands for any number of characters including directory names.
2) The dollar sign character [$] stands for "end of the character string".

This approach disallows any pdf file no matter which directory holds it, and also guards against any accidental use of the character string ".pdf" in a file path other than an actual PDF file.

If you block the bot from crawling files via "robot.txt" will G ever count those towards the quality score of your site?

If the files are not even crawled, then they cannot be directly evaluated for quality.

[edited by: tedster at 3:38 pm (utc) on Nov 29, 2012]

Pjman



 
Msg#: 4522602 posted 3:13 pm on Nov 28, 2012 (gmt 0)

@tedster

Thank you! As usual you save the day!

We need to have an appreciation day for you.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved