Why Google couldn't index this page?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why Google couldn't index this page?

sohail009

5:25 am on Jul 14, 2009 (gmt 0)

I have a website and i am wondering why Google couldn't index most of the pages; < there's one page in particular > although Google had already indexed it's PDF version but HTML version is not being indexed? Could you please guide me into the solution?

I have doubt that the text in the body is under <pre> tag, i am not sure if google do not index the file because of using <pre> html tag?

[edited by: tedster at 6:43 am (utc) on July 14, 2009]

tedster

6:57 am on Jul 14, 2009 (gmt 0)

Hello sohail009, and welcome to the forums.

No, the <pre> tag is not a problem. But if there is a crawling problem, you will often see reports about it in your Webmaster Tools account.

Google does not guarantee to index every page that they spider - in fact they usually don't. This is especially true when the same text is available at more than one address, such as the situation you report with both HTML and PDf versions of the information.

How many total pages does your site have, and how many pages does Google show they've indexed?

sohail009

7:07 am on Jul 14, 2009 (gmt 0)

actually the site have more than 1000 pages and the sitemap.xml have 1075 URL's. The URL is already mentioned in xml sitemap, but still i am wondering why google dosent index or shows the result in search.

Although the url is not found under Google successful crawled urls

< snip >

[edited by: Robert_Charlton at 7:40 am (utc) on July 14, 2009]
[edit reason] removed specifics [/edit]

tedster

7:45 am on Jul 14, 2009 (gmt 0)

There are three different processes to be very precise about when you analyze a situation:

crawling >> indexing >> ranking

You can best see what is crawled from your own server logs
You can get an idea of how many URLs are indexed by entering site:example.com into the Google Search box - where example.com is your actual domain name
You can get a general idea of your best rankings from within Webmaster Tools. You also get more information from the Google search referers in your server logs.

Why any URL gets skipped at each of these three stages can be quite complex. But from what you described, I'd say the duplicate text is the major issue.

sohail009

7:56 am on Jul 14, 2009 (gmt 0)

Well i also dont know why URL's are skipped. Akthough i have already gone through step 1 and 2. and the intresting things is in step 2, google says

Results 1 - 10 of about 3,040 from example.com. (0.15 seconds)

When i reach at the last page which is 37, and now google says

Results 361 - 370 of 370 from example.com. (0.25 seconds) :) first they wrote 3,040 pages and when browsing till last page it says 370....

About duplicate text, no i don't have any duplicate text! but YES other domain do have the same content! but unfortunately their page also not indexed?

[edited by: Robert_Charlton at 8:31 am (utc) on July 14, 2009]
[edit reason] changed to example.com [/edit]

tedster

8:10 am on Jul 14, 2009 (gmt 0)

Perhaps I misunderstand - I was responding to what you wrote here: "...although Google had already indexed it's PDF version but HTML version is not being indexed." If the same text is included in the .html document and the .pdf document, then that is a duplicate content situation.

----

The site: operator usually begins by giving only a rough estimate of the total number, and sometimes it's a VERY rough estimate. That's why the word "about" is included until you drill down toward the final pages of results.

Over the years, several Google people have made mention of this - the way that Google shards the data and stores those fragmments over several hundred thousand servers makes an immediate count very difficult to return to the user.

sohail009

8:15 am on Jul 14, 2009 (gmt 0)

No you understood it right!... my mistake... So you suggestion is removing the PDF version will resolve the issue?

tedster

8:25 am on Jul 14, 2009 (gmt 0)

Yes, that's a step you can take. You don't need to remove the file, you can just use a Disallow rule in your robots.txt file. That way you still have it online for your visitors' convenience, but the search engines will not keep that URL in their public index.

g1smd

7:09 pm on Jul 15, 2009 (gmt 0)

Do change the search URL to add the &num=100 parameter so that you see 100 results per page.

It will save a lot of work.