Are PDFs Dead End Links?

Forum Moderators: open

Message Too Old, No Replies

Are PDFs Dead End Links?

Does Google follow the links in PDFs?

geckofuel

3:11 pm on Mar 15, 2003 (gmt 0)

I've got a website with over 100 PDFs and was wondering whether Google follows the links in PDFs?

If so, then I'll need to go back through and place some links in the PDF files. Right now they are just dead ends.

takagi

4:33 pm on Mar 15, 2003 (gmt 0)

Try this [google.co.jp] and you will see some links from a PDF to google.com so links in PDFs can be found by G.

In the original document [www-db.stanford.edu] about PageRank, you can find this formula:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where C(T1) is the number of links going out of page T1.

It is likely that C(T1) will be set to 8 if page T1 has 10 links of which 2 to a dead end (page with no outbound links). In this case the PR will be spread over the other 8 outbound links.

It is better to place a link in a PDF. Especially if other sites link directly to this PDF.

doc_z

5:19 pm on Mar 15, 2003 (gmt 0)

I don't know if Google follows links in PDF, but the example 'filetype:pdf link:www.google.com' just show PDFs with "link:www.google.com" in it (you will get the same results searching for 'filetype:pdf "link:www.google.com"').

Also I don't see any evidence that dead ends are treated in a different way than any other link.

To avoid that that PR is lost in dead ends you can use JavaScript. However, if you want that the PDFs are indexed by Google I would put all links on a single page.

takagi

7:01 pm on Mar 15, 2003 (gmt 0)

Hello doc_z,

You are right, my example of links to Google from PDFs was incorrect. Sorry, I didn't look good enough at the SERP. Since G doesn't allow you to combine a search for "link:" with a keyword (or filetype, inurl, etc), the search command automatically changes into a search for the word "link".

I now found another example, but I'm not sure if the moderator will allow it. Search in G for

link:www.w3.org/People/Jacobs

and look at #9 in the SERP. It is a PDF that contains a link to the page of Mr. Jacobs.

I hope you can accept this as prove that G can find a link in a PDF.

For your second remark (about dead ends) I need some more time to find additional information.

doc_z

7:51 pm on Mar 15, 2003 (gmt 0)

takagi

I found a PDF within the results. Therefore, it seems that you are right: Google follows links in PDFs.

I think you can easily find examples of dead ends with PR if you are looking for sites using frames.

takagi

9:14 pm on Mar 15, 2003 (gmt 0)

Hi doc_z,

Sure, dead end pages can show PR. But I wanted to say something else.

Think about this:
A new site has 8 pages. On the homepage are 7 links to the 7 sub pages. G spiders the homepage and 2 sub pages before making a new index. I read somewhere that is such a case the PR of the homepage is not distributed over 8 pages (G found all the links so that number is known), but only over the 3 pages that are really spidered. The other 5 are dead ends.

Or something similar. A page has some small images with links to a bigger image. If the user clicks on the small image, a big image is shown. But this link is not to a .html page but directly to a gif-file. By its very nature, gif-files cannot contain links and are therefore dead ends.

But I cannot find that information anymore. Maybe some other member on this forum knows where to find it. Or maybe my statement about dead ends is just incorrect.

doc_z

10:20 pm on Mar 15, 2003 (gmt 0)

takagi

I would expect that the first scenario doesn't occur: Goolgle updates the index only after following all links. That could be reason why there is only one update a month. (But this is speculation.)

If these cases occur I don't know how G will handle these things.

Indeed, images are dead ends and it seems that they don't get any PR. Therefore, you are probably right in this case and these links don't count. However, I believe that PDFs are handled in the same way as HTML pages and that they get PR as normal even if there are no links.

takagi

11:35 pm on Mar 15, 2003 (gmt 0)

Hi doc_z,

In June 2000, G claimed to have 1.06 billion pages in the index. But only 53% was really spidered. At this moment G is doing much better, but still now you can sometimes see one or more results of HTML pages in the SERP with only a URL in blue and the text "Similar pages" in gray. Try this [google.com].

It is more likely for new sites with very few inbound links and not all pages being linked on the homepage. But also sites with a low PR on the homepage and lots of dynamic pages can have this problem.

I will stickymail you a better example, I don't want to show here.

doc_z

12:37 am on Mar 16, 2003 (gmt 0)

Hi takagi

indeed there are a number of pages which are not spidered by Google.

However, there is still the open question if these pages have a PR or not. This is hard to see since Google sometimes guesses the PR.

I'm looking for a page which is not spidered but shows a high PR. If this page doesn't show any backlinks I would assume that it doesn't have any real PR. On the other hand if G will show some backlinks I would believe that the PR is real. Unfortunately, I haven't found such a site so far.

I see three possibilities:

1) During the monthly deep crawl G is spidering all pages. (Therefore, the case that there are unspidered pages will not occur.) The unspidered pages result from fresh spidering during the crawls. In this cases these pages have no PR - since they doesn't exist during the last calculation - and no backlinks will be shown.

2) There are pages which are not spidered during the monthly crawl and
a) they are left out for PR calculation
b) the pages are treated as dead ends but get a PR

takagi

1:43 am on Mar 16, 2003 (gmt 0)

Hi doc_z,

Option 1 turns out to be wrong, since G does not spider all known links (see my example in message 8 and the stickymail). At this forum some members sometimes complain that the number of indexed pages of their site has gone down.

I think I go for your option 2a. Pages that are not spidered should not influence the spread of the PR over the page, nor get any PR. G will try to spider pages that deserve a high PR, so it will be very difficult to find such a page. But option 2b is also a possibility.

By the way, in my view the monthly update has basically 3 phases
1. spidering
2. processing the data (recalculating PR, which links exist, what keywords are found in what page, etc.)
3. spreading the data over the data centers
This has to be done in this order with no (or almost none) overlap. IMHO it is not phase 1 (cf msg #7) but phase 2 that takes most of the time. (But this is also speculation.)

To go back to the question of geckofuel, it is better to add the links in PDFs.

doc_z

2:04 am on Mar 16, 2003 (gmt 0)

yes, either preventing that G spiders the PDFs or adding at least one link.