Case Study: Why is Google spidering/crawling our website?

Google is not spidering our website very deeply. Less than 10% of our pages are being spidered. We have 7,000+ product pages, several category pages, and hundreds of technical documents, for a total of over 8,000 spider-able pages, but Google is currently only reporting approximately 1,000 URLs in its index, and most of those our redundant/duplicate URLs or outdated URLs. Furthermore, according to our website logs, Googlebot is only visiting 200-400 pages per day. ...and Yahoo!'s Slurp spider is even worse! What are we doing wrong?

Recently we implemented extensive efforts to improve crawling, but we've seen no improvement after two weeks. Here's a summary of what we've done:

1. Added a Site Map (see link at bottom of left navigation) to direct spiders to all of our 7,000+ product pages and all of our hundreds of technical documents.

2. Added meta robots tags to all pages to encourage spiders to go through the site map--simpler for spiders--and to not follow through our category pages, which would result in too much cross-linking and confusion for spiders. Likewise, we have our robots tag set to INDEX, FOLLOW on our home page, site map, and secondary site map pages, and we have our product pages and category pages set to INDEX, NOFOLLOW so that they get indexed, but not followed through, leaving only the site map for spiders to find our product pages--this has been recommended to us in the past by SEO "experts." Furthermore, on certain pages not appropriate for search engine indexes, we have set the robots tag to NOINDEX, NOFOLLOW.

3. Redesigned links to make them simpler by removing unnecessary query string parameters. Now we have only one query string parameter for maintaining state, UID.

4. We hide our state parameter (UID) when the visitor is a search engine spider by looking at the user agent. So, for all intents and purposes, our links are *very* spider-friendly. The only query string parameters spiders see parameters that indicate different content, like T1 for product and SKW for search keyword.

5. Page titles in the <title> tag on product pages now read the same as the product name/description in the copy of the product page. (Before the title was [snip] on every page.)

6. Meta descriptions on product pages now contain the same product copy that visitors see in the page's copy, except with all HTML tags removed, like bold, italics, bullets, etc.

7. The robots.txt file has been updated to more effectively encourage search engine spiders to crawl what we want them to crawl and block them from crawling what we don't want them to crawl. We are leaving open the folders necessary for our product pages and technical documents to be crawled, evidenced by the fact that a tiny few are being crawled. You can see the robots.txt file that search engines use by following this link:

[snip]

8. Added H1 (headline) tags to product pages to emphasize product names/descriptions in the copy, giving them more weight in search engine indexes. Used our style sheet to modify the H1 tag so that it does not appear like a headline normally does (huge text), but rather like what is consistent with our product page layout.

Please review www.example.com as a case study and provide explanations, advice, etc, regarding the fact that we are not being adequately crawled by Goolgebot and Yahoo! Slurp, among others.

Additionally, please answer the following questions if you happen to know. How often does Google index an entire site? Does google index an entire site every time it visits? ...or does it index only a portion each day and pick up where it left off the next day? Does Google only index a page if it's last-modified date has changed since the last indexing? How can we find out why we are not getting indexed? Is there a contact for Google and/or Yahoo! that can answer these questions? As for Yahoo!, does our participation in Yahoo!'s Overture SiteMatch limit Yahoo!'s crawling to just what we pay for through SiteMatch? Actually, we are using [snip] to manage our SiteMatch submissions, and they still refer to this indexing as "Inktomi." If we quit [snip] (and therefore SiteMatch) will Yahoo! automatically start/continue to spider our website?

Any advice/info would be greatly appreciated.

- John (Webmaster)

[edited by: pageoneresults at 3:43 pm (utc) on April 12, 2004]
[edit reason] Removed Specifics - Please Refer to TOS [/edit]

Case Study: Why is Google spidering/crawling our website?

Please review our website as a case study and provide spider advice.

vieth

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week