Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Will Panda Slap This Internal Linking Strategy?

         

Pjman

1:29 pm on Jul 3, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I have a site that is mostly PDFs.

I have HTML pages that point to about 15 PDFs each.

The first 5 PDFs listed on each HTML page are robots.txt readable and I write original descriptions for them on the HTML pages.

The remaining 10 PDFs are blocked from being read by robots.txt. For the descriptions on the HTML pages I use an excerpt from the PDF file themselves.

Will Big G (Panda) eventually see the last 10 descriptions as duplicate since they are part of the robots.txt blocked PDF files?

Doing it this way saves me tons of time.

Planet13

7:14 pm on Jul 3, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since no one has responded, I hope you don't mind if I ask this question:

Why do you block - via robots.txt - crawling of the last 10 pdf files that are linked from your html files?

Is it specifically so that you don't have to write unique descriptions for those .pdf files and then post them on your .html files (that link to those .pdf files)?

Or is there some other reason for blocking crawling of the .pdf files?

~~~~

I apologize if this is an off topic question for you.

Pjman

10:50 pm on Jul 3, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



When I start membership sites, I make them completely free at first.

I allow all users to download and see the PDFs. Once I setup the paywall those robot.txt blocked files are only available to paying members.

I receive tons of links because of the link bait.

This way goggle bot sees zero changes when I setup a pay wall.

Planet13

12:05 am on Jul 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the explanation.

I really don't know how google will react if were to somehow "see" both the descriptions on the html pages and the same descriptions in the .pdf files.

tangor

1:19 am on Jul 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If G can't see the "last 10 pdfs" there should be no problem, as they have NOT indexed them, thus no duplicate content.

Just make sure G never sees them!

I presume these protected pdfs are in a password protected directory, right?

Pjman

2:18 am on Jul 4, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



The PDFs that have the content are only currently blocked by robot.txt. Once I switch the site over to a membership, I block them via HTaccess files.

Pjman

2:30 am on Jul 4, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



The only thing that I find funny about this whole thing and Googlers should take note is that:

I'm worried about showing my users the exact content they will see when they click on a link to view my own content from my own site. (i.e. Taking away any doubt about what happens when they click. "Better Users Experience.")

Because G may slap me for it.

G on the other hand does this to every page of my site to my content when they show a link to my site, unless I give them a meta description and lots of times they ignore those.

Planet13

2:33 pm on Jul 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I really don't know if you have to worry so much about "duplicate" content.

Think of how excerpts work for wordpress sites.

The CATEGORY level page USUALLY contains the first paragraph (or two) of THE EXACT SAME TEXT as the blog posts that are linked from it.

Most bloggers DON'T write a separate different to show on their category level page than what appears on the blog post itself.

Heck, many wordpress websites show THE ENTIRE BLOG POST on the home page (or archive page), and then have it available as a separate blog post as well.

I don't think you have to worry so much about this.

Pjman

9:40 pm on Jul 4, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



That makes perfect sense. Never thought about it that way. Makes me feel better thanks.

vandelayweb

5:35 am on Jul 5, 2014 (gmt 0)

10+ Year Member



I agree. I don't see much of a worry with duplicate content here, but you may want to make sure those pages get purged from Google's search results once they are added to the robots.txt or that you manually set them for removal in GWT.

I had a similar setup where I had files on a site active for some time then I blocked them via the robots.txt. Unbeknownst to me, they were still hanging around in the index a year+ after they were added to the robots.txt. Just something to keep any eye out for.

tangor

6:05 pm on Jul 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just a reminder: G (or Bing and the others for that matter) never forget a URL they've met. If once indexed then disallowed via robots.txt does not prevent the already known url to be accessed by others. G may honor robots.txt, but way too many others won't.

If content is meant for paywall stuff, lock it down and prevent any access except via approved passwords.

Robert Charlton

10:04 pm on Jul 5, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Beyond the question of what Google sees now....

My emphasis added...
This way google bot sees zero changes when I setup a pay wall.

I'm thinking that what Google might see when you set up the paywall is a change in user behavior, probably time on page.

I've consulted for sites that have added registration requirements after several visitor views. I predicted there would be roughly a 50% loss of traffic past that point. It turns out that it was much larger than 50%. If you continue to keep the first 5 PDFs user-accessible, you may avoid this problem.

You need to be careful when setting something like this up that you're not intentionally frustrating users in an attempt to get them to pay. I've seen a site that offered links to paid PDFs to users, but when users clicked they were presented with tiny, unreadable thumbnails and a message that they could see more by paying. IMO, this as an extremely poor approach.

Regarding just keeping the PDF content out of the index, you might look into using X-Robots-Tag: noindex. Though I don't offhand see how you can use this tag for some PDFs on a page and not others, it's very possible you can figure something out. See this Google Developers article...

Robots meta tag and X-Robots-Tag HTTP header specifications
[developers.google.com...]

Pjman

1:30 am on Jul 7, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for the input guys. I have run something similar to this on previous sites. As long as I robots.txt blocked the PDF directory of the pay stuff, confirming G saw the robots.txt change and added the pay PDFs after that, they never were seen. When implement the paywall, traffic never dropped, only increased as I added new content and natural links start coming.