Why isn't Google indexing all of my pages?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why isn't Google indexing all of my pages?

wakkawakka

1:59 pm on Aug 19, 2008 (gmt 0)

i have a site with 500,000 pages but can only get a third of them in googles site: index search why is this? i have my sitemaps in neat 50,000 links maps and still it only finds a third, i have tried descriptive urls but it has no effect can anyone help?

Receptional Andy

8:33 pm on Aug 19, 2008 (gmt 0)

Hi wakkawakka, welcome to WebmasterWorld :)

Most likely, you have a problem related to your site's links - either internal or external.

If your pages are not linked together appropriately, Google will either not index all of the content or will not show you the majority of pages via a site:search. Note also that there are quirks with the site search operator [webmasterworld.com] you should be aware of.

The other (perhaps more common) problem is if you may not have enough external sites referencing you to support the amount of URLs present on your site. You're going to need a fair amount of solid links to get half a million pages indexed.

Another possibility is that Google does not perceive much of your content to be of high enough quality to be included in common searches - this can be due to 'stub' pages, low volumes of content and a whole host of other problems related to content quality.

g1smd

10:55 am on Aug 20, 2008 (gmt 0)

They never index everything. I am looking at a 204 page site where Google only ever lists 196 to 202 pages. There are always at least two missing. Scale that up to big sites and the percentage missed is likely to be a lot bigger.

That said, there are many ways that you can help the bot get round your site, and many on-page things you can do to improve the chances of stuff being indexed.

wakkawakka

11:05 am on Aug 20, 2008 (gmt 0)

hi guys thanks for the help, its not the links issue as we have a fairly solid pr5 and navigation is static on every page with breadcrumbs, on google wm tools it shows that it has found all of the sitmeaps but olny indexed a small section of the links.

We could do with a few more backlinks but.... isnt that always the problem...

tedster

7:05 pm on Aug 20, 2008 (gmt 0)

Half a million urls is likely to require more than a PR5 home page to get anywhere near everything in the index. By the time that link juice gets split up and circulated, you're very likely to have MANY urls that Google's algo decides not to include.

Yes, more links will help, and especially deep links. But more than that, make sure your link structure and information architecture support the most important pages extremely well. Then accept that all your deep pages are not likely to be indexed right now - and even if they do get indexed, they are not likely to rank.

Make sure your server responses are good and fast, and that you respond to googlebot's if-modified-since requests appropriately. Don't force a full spidering for any url if its content hasn't changed, just send a 304 status. This will help Google to economize your crawl budget and spend more of each crawling cycle they've allotted to your site by discovering deeper pages.

Then watch your server logs to see which urls Google is requesting. You'll want to separate two different situations: urls that have been spidered but not put in the index, and those that are not yet spidered at all.

union_jack

10:31 am on Aug 21, 2008 (gmt 0)

Tedster: you talk about a 304 status. How do you implement this? Also i was wondering about the sitemap <changefreq> field. If this could be used more affectively but it is hard to know what to set this to with a large site.

tedster

4:35 pm on Aug 21, 2008 (gmt 0)

304 status. How do you implement this?

The technical details will vary with particular servers, so you should consult your own server's documentation. That discussion would go far outside the Google Search forum's topical area.

Here are two good threads to get you started:

Apache Forum: [webmasterworld.com...]
IIS Forum: [webmasterworld.com...]

tedster

4:38 pm on Aug 21, 2008 (gmt 0)

Also i was wondering about the sitemap <changefreq> field. If this could be used more affectively but it is hard to know what to set this to with a large site.

The thing is that <changefreq> is only a suggestion to Google, not a requirement such as a robots.txt disallow rule. The Google crawl team team still establishes the logic for crawl frequency, taking your suggestion into account as one of many factors.

So if you can't see how you want to modify that factor for your site, then you're probably best just to let Google have its way.

union_jack

7:16 pm on Aug 21, 2008 (gmt 0)

thanks tedster