Google seems to be indexing PDF versions of my Joomla Pages . Bad?

Forum Moderators: open

Message Too Old, No Replies

Google seems to be indexing PDF versions of my Joomla Pages . Bad?

cmendla

2:22 am on May 25, 2009 (gmt 0)

I did a site:www.mysite.com in google and noticed that PDF versions of my pages were listed as well as the page.

That is probably an artifact of the PDF icon that Joomla 1.5 attaches to articles.

That seems like it could possibly cause a duplicate content penalty.

My questions are

1. Will this cause a duplicate content issue with google or other search engines?

2. If so, what is the best way to handle this? I'm not sure I can block a particular extension in Robots.txt (i'll check into that). I am thinking about turning off the PDF option as it doesnt' do too much in terms of the content on this particular site.

thanks

spadilla

2:41 am on May 25, 2009 (gmt 0)

The PDF URLs are indexable in the default Joomla installation and will cause you to have duplicates in the SEs. One way to handle it is to turn them off completely or if you would rather keep the PDF function you can block it via robots.txt. Also if you have the mail icon and print icon published, these can also cause issues. Assuming you're on J!1.5 aren't using a SEF extension you can put this in your robots.txt:

Disallow: /index.php?view=article*&format=pdf
Disallow: /index.php?view=article*&print=1*
Disallow: /index.php?option=com_mailto*
Disallow: /component/mailto/*

This should work with SEF URLs on or off. Then you can do a site: search and remove the bad URLs from google via webmaster tools once they are blocked.

ergophobe

4:35 pm on May 26, 2009 (gmt 0)

Keep in mind that blocking with robots.txt will keep the pages out of the index, but it won't do anything to stop bleeding off link juice to those pages and thus won't fully address the dupe content issue.

A better solution would be if you could have the links to the PDFs be nofollow in addition to the robots.txt fix.

cmendla

5:14 pm on May 26, 2009 (gmt 0)

THANKS! In addition to this question, you just solved something that has been bugging me for a while with another site. I had hadn't thought about link juice being passed to part of your site that you had blocked in Robots.txt. I didn't give it enough thought and assumed that blocking something in Robots.txt took care of everything

Anyway, that piece of info has helped me understand why a robots.txt block didn't have the effect I anticipated.

As far as Joomla, I removed the PDF and Print icons in the article manager parameters. I don't really need either. I also set the robots.txt to block the pdf, print and email for slurp and googlebot.

(Note to anyone reading this. I'm not an expert on this stuff so please verify things before applying anything to your site).

Anyway, thanks again ergophobe. Your reply was a real Two-Fer for me.

ergophobe

8:29 pm on May 26, 2009 (gmt 0)

Happy :-) It's fun to realize I've learned a thing or two that can be helpful to someone else. I think there is a huge amount of misunderstanding about robots.txt.

You'll still see people recommending it as a way to keep hackers and illegal scrapers out of your site. But the issue you were having is a lot more subtle and comes down to the difference between nodindex and nofollow, which is not at all obvious.

I don't know if you're a WebmasterWorld subscriber, but I just helped someone out (at least I hope I did) with a similar issue where his category pages were outranking content pages on a Wordpress site. There might be something in the Wordpress SEO Basics [webmasterworld.com] that would help, though of course, it's not Joomla-specific and the parts germane to your situtation are already discussed here.

>>Note to anyone reading this. I'm not an expert on this stuff

Yeah, I always say the same - if this works for you though, I'm going to quit disclaiming my SEO advice ;-) No reason it shouldn't work, I'm just saying, I don't guarantee my results!

[edited by: ergophobe at 6:39 pm (utc) on June 18, 2009]

Robert Charlton

9:58 pm on Jun 13, 2009 (gmt 0)

Keep in mind that blocking with robots.txt will keep the pages out of the index, but it won't do anything to stop bleeding off link juice to those pages and thus won't fully address the dupe content issue.
A better solution would be if you could have the links to the PDFs be nofollow in addition to the robots.txt fix.

Note that Google's treatment of the rel="nofollow" link attribute has changed since the above was posted...

Google Changes Treatment of PR 'Saved' by rel=nofollow Sculpting
[webmasterworld.com...]

What this apparently means (no official word yet) is that while the use of nofollow will block the spiders going to the pdfs, a mathematical equivalent of the "bleeding off (of) link juice" will continue.

Uneasy consensus seems to be that the PageRank distribution from a page will be divided among all outbound links on the page, rather than reserved just for those which aren't nofollowed.

Where you offer pdf versions of your pages, I'd recommend, as suggested above, using the nofollow attribute anyway.

You'll effectively lose the link juice in any case, but the nofollow attribute will keep the links to the urls of the pdfs from being displayed in the visible index, which might happen if there isn't a rel="nofollow" attribute on the link.

ergophobe

4:43 pm on Jun 15, 2009 (gmt 0)

I have to say, I'm not sure these are "new" rules or whether it's dispelling a long-standing misconception.

Anyway, effectively, it means that there's little difference between using a nofollow tag and a robots.txt disallow.

nofollow is still different from noindex in that noindex will keep your page out of the index (obviously).

One thing I haven't seen in all the PR discussions lately is how it affects the crawl. In other words, if you have a site that is not that active and does not get deep crawled, does having nofollow tags help focus the crawl?

So if Google has decided that your site is worth following for X pages, will you get more quality pages indexed if you take service pages and nofollow + disallow with robots.txt?

Robert Charlton

1:31 am on Jun 18, 2009 (gmt 0)

I have to say, I'm not sure these are "new" rules or whether it's dispelling a long-standing misconception.

"Dispelling a misconception" is in fact more accurate....

Matt Cutts on PageRank Changes
[webmasterworld.com...]

To quote the quote that was quoted...

So what happens when you have a page with �ten PageRank points� and ten outgoing links, and five of those links are nofollowed? Let�s leave aside the decay factor to focus on the core part of the question. Originally, the five links without nofollow would have flowed two points of PageRank each (in essence, the nofollowed links didn�t count toward the denominator when dividing PageRank by the outdegree of the page). More than a year ago, Google changed how the PageRank flows so that the five links without nofollow would flow one point of PageRank each.

The crawl question ergophobe raises here was brought up briefly in the above discussion, but was considered off-topic enough that it's a question for another thread.

I had vaguely remembered one other method that had been discussed for blocking pdfs, and I just found it via site search....

X-Robots-Tag - controlling Googlebot via HTTP headers
[webmasterworld.com...]

...this is very useful for non-HTML content such as PDF, Word or plain text files, where you cannot insert meta elements....

...If your only intention is to disallow access to a file or files then a robots.txt would work just fine.
However, you can't use noarchive, nofollow, nosnippet, or unavailable_after in a robots.txt file. The header X-Robots-Tag is a much more powerful tool. It allows us to use these directives without needing to edit files. It also allows us to use these directives for media files, pdf files, etc, that can't have meta tags directives inserted in them....