Goggle indexing
Getting Goggle to crawl my entire site.

 8:09 pm on Apr 4, 2005 (gmt 0)

I have had a website up for a while and Finally Google has decided to crawl/index it.

I have noticed in my logfiles for the past few months that everytime googleBot visits my site it only makes 2 hits. This happens almost twice a month. I have a few questions if anyone has the time to provide me with some insight I would greatly appreciate it. Not sure that this matters but MSN and slurp(yahoo) index my site like crazy. Why doesn't Google like me?

Is there a reason that it only makes 2 hits on my site per visit?

Is my site not 'special' enough yet for it to index the whole site?

What will it take to get google to go past my front page?



 1:27 pm on Apr 7, 2005 (gmt 0)

What makes Google come to a page is the presence of links to that page, and in particular, links to that page from pages that are themselves well-linked.

Think of the Internet as being like a web of corridors, with a new page on the end of every corridor. The more corridors to your home page, the more often Google will stumble down one.

If there are a hundred times as many corridors to your page, Google will come by 100 times as often, or at least, that'll do for a first approximation.

If you're being visited so infrequently, it suggests there are few decent links to your site. Try DMOZ and other directories and get links from authorative sites too.

As regards Google getting into your site, well, deep-links from outside will help, but so will more links to your home page.
The more often your homepage is visited, the more likely Google will be interested in those pages inside your site that have even less paths to them. Do check with the W3C validator that your HTML is error-free - different errors can derail some bots and not others...

Get more links, then get more links, and finally, get more links...


 2:03 pm on Apr 7, 2005 (gmt 0)

What does googlebot do? 2 hits on the home page and go away? What 2 pages does it request?
Do you have a sitemap?
This can help the bot 'get around' if it is running into a wall somewhere. Nothing fancy just a plain HTML page linking to the major areas.
Is your site straight HTML or is it all dynamic and other fancy stuff?
Some strange scripts can make googlebot choke.
Have any scripts?
Do you use session id's?
googlebot may choke on that
Do you have a robots.txt file?
Some entry in there not quite right?

Try looking at your site with a Lynx browser and see what googlebot see's


 5:09 pm on Apr 7, 2005 (gmt 0)

DerekH and Reid thank you both for your time in responding.

I did validate my HTML and other than a few things about how something is not a valid attribute for a tag, there is nothing major.

I had considered going to do some link exchange programs but I have heard that some of the bots don't like that and lower your ranking for having done it.

I do have quite a few link on my own site pointing back to my home page. For example in my site the header graphic is always a link back to the home page. Do the internal links I have make any difference?

In answer to some of Reid's questions.

What does googlebot do? 2 hits on the home page and go away? What 2 pages does it request?
Yes googlebot will register 1 visit with 2 hits. This will happen about every 20-30 days. I am unclear as to which 2 pages that it hits on. Judging by the cache that google shows for my site it definatly hits my front page and who knows what the second hit is.

Do you have a sitemap?

Is your site straight HTML or is it all dynamic and other fancy stuff? Have any scripts?
It is all dynamically generated using ZOPE, Python, and postgreSQL. Lots of Python scripts, but they all run backend before transferring data to the user agent/browser

Do you use session id's?

Do you have a robots.txt file? Some entry in there not quite right?
I do have the file although it is just empty.

At any rate MSN and Yahoo hit me at a very high rate. MSN bot this month (it's the 9th) so far has made 1841 hits. Slurp is not far behind. It is frustrating to be top 10 with them (MSN, Yahoo)for a few niche pages that I have and with the same keywords Google doesn't even have me 20 pages deep, and the top results for what I am searching for are for commercial/News sites that have very little to do with my search crieria.

Another reason this gets me going is, there are maybe a total of 5 pages on the entire web that have the same information on a very specific topic as I do in these pages. Google's results return none of them. I don't mind being out ranked by pages who contain the same or similar information, but when I am out ranked by 100s of pages completly unrelated to my search criteria it makes me wonder why Google remains the favorite search engine among web users.

Previous to posting these pages to my site I did a litle SEO investigating and I did some things I thought would help. I made the pages deticated. No query strings passed in, no auth required. No incoming arguments, nothing. The titles and the ids of the pages share all the same commen words. I think I did a great job and so does MSN and Yahoo. I think Google hates me. Also these pages have been up for over 10 months now.


 8:23 pm on Apr 7, 2005 (gmt 0)

[quote]I did validate my HTML and other than a few things about how something is not a valid attribute for a tag, there is nothing major. [quote]

That could be something right there. What invalid attribute?

Did you try using lynx or a googlebot sym?

poodle predictor is a good tool to try.

It really sounds to me that googlebot is not able to crawl your site for some reason. really it should be gobbling it up.

here is another possible scenario. somebody correct me if Im wrong.
if you have a dynamicaly generated homepage and for some reason the page changes every visit.

For a new site.
google will come and fetch the index page first.
It will make a few determinations about your site.
it exists for instance.
Then another googlebot will come and do the crawling.
Once you get to that stage then your ok but if googlebot 2 keeps getting a different page (updated) than googlebot 1 then it will resend googlebot 1 and start the process all over.
So if the page is different every visit (this is for the first few visits only) then google has trouble OK'ing your initial crawl. It goes into an endless loop of refetching the homepage.
the other hit is probably robots.txt


 8:26 pm on Apr 7, 2005 (gmt 0)

instead of an empty robots.txt try

user-agent: googlebot
allow: /

Maybe it doesn't like the "empty"


 9:23 pm on Apr 7, 2005 (gmt 0)

What invalid attribute?
There are a bunch like this:

there is no attribute "BACKGROUND"
<table class="Azcat_Top_Bar" cellspacing="0" background="/images/header_repeat.jpg">

Nothing to do with bad head or body tags. No complaints on the meta tags either.

I think what you said may have hit the nail on the head, in regard to:
Google will come and fetch the index page first.
It will make a few determinations about your site.
it exists for instance.
Then another googlebot will come and do the crawling.
Once you get to that stage then your ok but if googlebot 2 keeps getting a different page (updated) than googlebot 1 then it will resend googlebot 1 and start the process all over.

The front page of my site is a 'Whats New' styled page which has the last 10 message posts from a forum and the last 10 most recent news articles posted. From hour to hour you could get compeletly different page sources as far as the main content area goes on the front page. I will point out though that the skin and left/right hand navigation links remain the same. You don't think they would do something as silly as a straight string comparison to determine if it is the same page or not, do you?

Interestly enough based on this information from you I requested Google's cached version of my page and it's content is at least 6 months old. Which says to me that even though I have logs of googlebot hitting my site in the months following the time that cache seems to have been made, their cache has not being updated. Grrrr.

Any ideas to a possible work around. This behavior you are describing seems a little anti-intuitive. I can think of many sites who's front page content is diverse from day to day. Isn't that what makes a good site? Current content?

I will try your suggestion of adding in the allow code for googlebot.

Is this thing you are decribing known as the 'sandbox'? I have seen the 'sandbox' mentioned in some posts while I was looking through postings for similar complaints and I saw mention of it often although I don't know what it is.


 4:26 am on Apr 8, 2005 (gmt 0)

sandbox - a mythical place where new websites go before entering the google index.
many new websites will get crawled by google and then get sent to the proverbial 'sandbox' where they may get 10 visits a day for upto a year no matter what SEO they do. Google denies the existance of the sandbox but it is a well known phenomena among webmasters.

Another thing - in googles guidelines

They say 'every page should have at least one static link to it'

Make sure you have a sitemap of static links and a static link to the sitemap on your homepage.

'dynamic url's should be kept as short as possible.'

Googlebot can follow dynamic links but can also become confused by them - esp if 2 different dynamic url's point to the same page. - make the sitemap a nice static path thru your site - no dynamic links in the path. I read somewhere that &src= can make googlebot puke. same way session id's can send it into an endless loop (it gets a new id each visit).

I may be wrong about the changing content thing, maybe put a base href META tag on the home page just to make the point that 'yes this is the page' If it can get that static link (early on) then it can find the static sitemap and be on its way.


 4:29 am on Apr 8, 2005 (gmt 0)

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.

This could be why the cache is not updating.


 5:16 am on Apr 8, 2005 (gmt 0)

also that robots.txt entry - i got it from the google guidlines but it is not really proper.

here is a proper entry

user-agent: googlebot


 3:14 pm on Apr 8, 2005 (gmt 0)

We can support the If-Modified-Since HTTP header on the server machine, although I am unclear on how to use it on the site.

Thanks for the robots.txt code. I think I am going to spend some in putting in some code there.

Anyone know if bot1, bot2 loop I am stuck in is called the sandbox? What is the sandboz, I see posts about it but I am unclear as to exactly what it entails.


 8:11 pm on Apr 8, 2005 (gmt 0)

it is on the server 'if modified since' make sure its enabled - googlebot uses this feature.

I already explained the 'sandbox' .
That happens 'after you get crawled' (maybe).

Google may or may not put you in the sandbox after you get crawled, if they do it may take up to a year before you get any serious traffic, this mainly happens to new domains so you might not qualify for the sandbox. Either way the trick is to get crawled first.


 9:36 pm on Apr 8, 2005 (gmt 0)


Thanks for your help I am going to play around with the If-Modified-Since header, and hope that the next visit from googlebot goes a little deeper. I do have a sitemap with nice links with no query string args or anything like that. It is only in my message board and photo gallery that query strings are passed to the pages.

I am going to sticky you the URL of the site in question. And the niche pages I get top results for in MSN. Anyway if you want to check it out I welcome anymore input you may have. This domain name is about 15 months old.

Again I really appreciate the time you have given me regarding this.


 1:32 am on Apr 9, 2005 (gmt 0)

That's a pretty cool site, I don't see any reason it shouldn't rank well.( and you are already successful in Yahoo and MSN).
I think your only problem is getting crawled by googlebot in the first place.
Did you find any problems that you had to fix since starting this thread or was everything ok already?


 9:40 am on Apr 9, 2005 (gmt 0)

ok Demaestro I did a little digging.

If-Modified-since is not working.
page header:

Response 200 OK
Last Modified No data returned
Content Type text/html; charset=ISO-8859-1
Last Cached (Google) 28 Mar 2004 00:57:26 GMT

The cached page is a placeholder page.

Get the IF-MODIFIED-SINCE working and it should work.
Or the 'last modified' is not returning data for some
other reason.


 7:51 pm on Apr 9, 2005 (gmt 0)

Thanks for the compliment. If you play any XBL you should add my gamertag. Anyway, the changes I made since this thread started was to change my robots.txt to not be blank and to have a blanket allow statement.

For the header code I have to get my systems guy to do that. Apache config files scare me. I am a code geek not a really a great sys guy. I have one at my disposal. I mentionedc wanted to try it out and after a small eye roll he said ok.

I think it has been over 30 days since my last googlebot hit, I will make another post to report on what happens when it hits me again. Hopefully I will get that header change before the hit is made.

Again thanks Reid


 9:34 pm on Apr 9, 2005 (gmt 0)

Yeah no problem

We should do another test after you get the last modified thing fixed.

The googlebot simulator I was using on your web page was returning a 500 server error.

I suspect that is because it is not returning a last modified data (or related to that)

after you get that fixed I should try to access it again with that sim just to make sure it works.
unless you want to do it. goggle "poodle predictor" for the sim.
one more thing you want to ask your sys guy to check
make sure you server is not blocking user-agents with no referrer. this would block googlebot.


 12:30 am on Apr 11, 2005 (gmt 0)

Hi Reid,

I'm confused about the IF-MODIFIED-SINCE command/code. I'm a newbie (so I'm easily confused by all this) but is the IF-MODIFIED-SINCE added to your webpages and if so, where exactly - in the header area? If it's something you need to add/modify in your cPanel, what section would I find it in?



 8:13 am on Apr 11, 2005 (gmt 0)

EandA - it's not something you put in a web page.

When a user-agent request a file from the web-server, the server has a low level data exchange with the user-agent.


user-agent (internet explorer): request certain file
held in cache(history) date:XXXXXX IF-MODIFIED-SINCE my cache date.

date on my file is earlier than cache date-
304 the resource you have is current

date on my file is later than cache date
200 page found

This saves bandwidth because the browser doesn't keep reloading files it already has.
Googlebot also uses this to update it's cache.

So the answer is no - you cant affect this through cpanel - your tecnition running the server does this.
If you are having trouble getting googlebot to come or it won't update your pages you can use a 'server header checker' to make sure this feature is enabled.
If it's not enabled you can ask the your host to turn it on.


 5:04 pm on Apr 11, 2005 (gmt 0)


Well it seems in order for Apache to get/use the HTTP_LAST_MODIFICATION_DATE the files must be executable. The problem with the ZOPE platform I am using this is not something you want to do to all your files. Unless you feel like being exploted/hacked. This is not an Apache problem this is a Zope problem with the way it is serving up dynamic content. If you know how this model works it will make sense why the files must all be executable to retrieve this data.

So this is for anyone using Zope and wishes to have this header. There is a method someone added in as a workaround. There is a method that will return the last mod date, by calling:


Using a meta tag for last modified in your header, and populating it with the bobobase call you may dynamically add in the last mod time to all your zope pages.

Reid do you know if Google will 'like' a metatag to get this info, or is it expecting the server to return it in the header? From the info I have found it doesn't look indexers will. However the thing I am getting this info from is old and with all the changes SEs make maybe they will look to the metatags now to get this info. Although as I am typing this, this would seem like a waste because the user agent would have to request the whole page to grab any metatags. This would not help on saving bandwidth and what not.


 6:10 pm on Apr 11, 2005 (gmt 0)

Well it seems in order for Apache to get/use the HTTP_LAST_MODIFICATION_DATE the files must be executable

I dont understand that - I know nothing about zope.

That is why googlebot cannot access your site.

You can ask around WW forum for a workaround

check with the pages, discussion boards dedicated to zope

if all else fails send a letter to google support and ask if they have a solution for zope.

sdomebody must have had this problem and figured it out or else zope ain't much of a web platform if it can't be crawled by google.


 7:11 pm on Apr 11, 2005 (gmt 0)

Well the reason behind it is Zope by default will only assign that header to objects of type file or image.

The reason is most other Zope pages are dynamically generated and assigning this header would only serve to have a page cached when there is new content to view. Despite the fact that no one has actualy gone in and made a change to that page it's content may be new.

I have been doing lots of looking and there are several work arounds. I won't get into them to much because the solutions vary depending on what Zope portal types you are working with. If you want more info there is a lot to look at with a simple site search of 'last modified header' on Zope.org.

I will play around and see what works best. The problem is Zope is almost too dynamic for it's own good in this case. Even in a CMS framework, which I use, dynamic footers and headers, along with dynamic side widgets that live in the skin are so easy to add that almost all my sites contain them. How do I get around not having these cached along with the main content? I already deal with about 10 calls a month from clients who say they have made a DB change and it is not showing up in there dynamic widget/page only for me to tell them they have to force the browser to get a new version of the page because their browser has cached it ie. 'Shift + refresh' or 'Control + F5'

Not sure what I want to do yet or if there is an answer that solves both my issues.

Any input would be valued.


 11:12 pm on Apr 11, 2005 (gmt 0)

ok Demaestro
well at least you have the problem identified.
let me know how it works out in the end.
That poodle predictor is a good test, you notice now it returns a 500 on your site.
If that can crawl your site then googlebot can. It will also show you how googlebot will see it.

If you go to some of the server threads on WW you will find some guys that know their stuff if you need any help.


 8:35 am on Apr 12, 2005 (gmt 0)

Hmm- for some reason I can't seem to post a new topic- but this one is also about indexing- let me ask here:

I have a new CMS under development, and the programmer is using .vc pages for topics and author pages.

Do the bots index these as well as they do html?

Anyone using them?

If they're going to interfere with SEO, forget it!



 3:19 pm on Apr 12, 2005 (gmt 0)

Yes I think we have found the problem. Now I get to come up with a solution. Yay

I will post what I come up with here. Reid thanks for all your input, it really is appriciated.

I was up thinking about it last night and I was wondering if I set a last modified date to a time in the future, would that upset googlebot? It occurs to me that if I dynamically set this header to always be 11:59:59PM. This way it would always expect new content from pages that really haven't been modified, but whose content is new.

Anyone know if this will 'upset' indexers? I am thinking because of all the time zones that a site can live in, that setting a time in the future for last modfied would be ok. Thoughts?


 7:43 am on Apr 13, 2005 (gmt 0)

bbcarter - sorry I know nothing about .vc

demaestro - if you did that, at best googlebot would come asking for IF_MODIFIED_SINCE a future date.

also if you tricked googlebot into always re-caching the page you would be here asking "how do I get googlebot to stop eating up my bandwidth"?

that little sucker is relentless once it gets in.

cmon you cant tell me that this web-server CAN'T return a last modified date.

you should approach google support with this one - see what they say. They can be very helpful.


 3:10 pm on Apr 13, 2005 (gmt 0)

Well you are right I can't tell you that the server can't do it. The server can return a last_modified date. The trick is getting it to know when to alter the last modified date when the content has changed through DB changes and not because someone has come to modify the file itself.

I think that I am just going to try to get Google 'in'. Once I get it indexing me I will maybe modify some things to prevent the bandwidth issue.

No matter what I am going to neeed a robust solution because the CMS framework I am using for the site in question, we also license to clients for their sites. Most all of them had exsiting domains and have been indexed by Google for years. The clients who use this for a new domain though are going to run into the same problem I am now, and in fact 1 has mentioned that after 6 months google still hasn't paid him a visit and he, like me is seeing lots from MSN and Yahoo.


 7:42 pm on Apr 13, 2005 (gmt 0)

what if you just re-upload your site every month?
That way google will get a new modified date and crawl the new dynamic content. This won't change anything except the file dates.

If you search site:w*w.yoursite in google you will see only one page. A placeholder from when you first got the IP. It has no links on it.
If you can even upload that one file to give it a new date then googlebot might recrawl that page and get links and have no problem after that.

take a look at that page cached by google - hover your mouse over it and look at the dynamic url. That is the page google is looking for. W*w.yoursite.com&e=10141


 8:52 pm on Apr 13, 2005 (gmt 0)

I don't see the '&e=10141' part of the URL when I hover my mouse over. Regardles with the way my platform works, it isn't a simple as re-uploading files. Although I will point out that my current front page is the same file as the one you are viewing in the google results, I just modified it's content to be what it is today.

I will have something mocked up by this weekend as far as getting some value being returned in the last_modified header, and I will use that 'poodle predictor' (great tool BTW thanks for the tip) to see what I get returned. If bandwith becomes a problem with Google at some point I will deal with that then. For now I would love to see Google trash my server if it means getting indexed. The T1 will survive.

I have had a visit from 'Mediapartners-Google/2.1' although same behavior. 2 hits to the root, and that was it. Still no googlebot this month.

What I find interesting is there is usally a link with Google's results that says 'cached' and it is a link shows you the cached version of the page along wth some disclaimer header about how google is not the author of the content and it could be old, blah blah click here to see the live version. No such link appears with the result it is showing for my site. Is it just an issue of this cache being so 'stale' it doesn't even want to provide the cached link with my result.


 11:13 pm on Apr 13, 2005 (gmt 0)

I'm using IE 6 so when I hover the mouse I see the url in the bar at the bottom of the screen.

That's interesting, when I first saw it on April 9 the cached page was there. it was a pic of a typical placeholder page. Parked domain type page.

