Forum Moderators: Robert Charlton & goodroi
Most likely, you have a problem related to your site's links - either internal or external.
If your pages are not linked together appropriately, Google will either not index all of the content or will not show you the majority of pages via a site:search. Note also that there are quirks with the site search operator [webmasterworld.com] you should be aware of.
The other (perhaps more common) problem is if you may not have enough external sites referencing you to support the amount of URLs present on your site. You're going to need a fair amount of solid links to get half a million pages indexed.
Another possibility is that Google does not perceive much of your content to be of high enough quality to be included in common searches - this can be due to 'stub' pages, low volumes of content and a whole host of other problems related to content quality.
That said, there are many ways that you can help the bot get round your site, and many on-page things you can do to improve the chances of stuff being indexed.
We could do with a few more backlinks but.... isnt that always the problem...
Yes, more links will help, and especially deep links. But more than that, make sure your link structure and information architecture support the most important pages extremely well. Then accept that all your deep pages are not likely to be indexed right now - and even if they do get indexed, they are not likely to rank.
Make sure your server responses are good and fast, and that you respond to googlebot's if-modified-since requests appropriately. Don't force a full spidering for any url if its content hasn't changed, just send a 304 status. This will help Google to economize your crawl budget and spend more of each crawling cycle they've allotted to your site by discovering deeper pages.
Then watch your server logs to see which urls Google is requesting. You'll want to separate two different situations: urls that have been spidered but not put in the index, and those that are not yet spidered at all.
304 status. How do you implement this?
The technical details will vary with particular servers, so you should consult your own server's documentation. That discussion would go far outside the Google Search forum's topical area.
Here are two good threads to get you started:
Apache Forum: [webmasterworld.com...]
IIS Forum: [webmasterworld.com...]
Also i was wondering about the sitemap <changefreq> field. If this could be used more affectively but it is hard to know what to set this to with a large site.
The thing is that <changefreq> is only a suggestion to Google, not a requirement such as a robots.txt disallow rule. The Google crawl team team still establishes the logic for crawl frequency, taking your suggestion into account as one of many factors.
So if you can't see how you want to modify that factor for your site, then you're probably best just to let Google have its way.