Forum Moderators: open
In my experience Google will not crawl the MAJORITY of dyanmic content. They will crawl specific common file types such as asp, jsp, etc.... but (and here's the real kicker) they do not appear to crawl any java Servlets or CGI files. Or any lesser known or proprietary file types. So even common frameworks such as Struts for Java are not indexed.
This is a completely lacking aspect of Google indexing in my opinion and a pet peeve of mine. >:(
And, if you try '.jsp' you'll see that they index those just fine, too.
Perhaps next time, you should try searching - see if you are right - and then post before pouncing in a negative fashion - waiting to tell the next bloke he's not right.
Cheers, & happy new year.
by the way, the advanced search on google is here [google.com]
As for CGI files and SERVLETS not being indexed I've been watching the google toolbar for months and I have yet to see a single example of a cgi page or servlet being indexed ... always the gray bar. Regardless of what google docs say, this would appear very consistant. If It is just coincidence, it is a strong one ... my sample size from a couple of months is quite high and thus far its been 100% accurate.
However every time I see a servlet in use (ie referenced in the url) it always seems to have an *evil* amount of data passed to it through the querystring - normally some sort of sessionid plus whatever extras it's designers intended. Google, like most modern SEs, doesn't *like* URLs which use multiple large querystring parameter so this might be the stumbling block.
The only other stumbling block would be if your servlet required the user to deal with either a session cookie or a sessionid in the url in order to access subsequent "pages" within a servlet.
My advice is that if you think the content is indexable but aren't getting indexed, then do like everyone in the same situation does - create URLs the SEs will like (using mod_rewrite, custom 404 etc)...
- Tony
[edited by: Dreamquick at 9:35 pm (utc) on Jan. 1, 2003]
Well, I'm frankly puzzled then. I first noticed this because on my site all my pages were getting indexed *EXCEPT* any of the files with a /mdlx extension. .mdlx is the file extension for my proprietary framework which I wrote in java. it maps an xslt template to a servlet per "page". .... Anyway, none of those were getting indexed .. so I started looking around to find what was being indexed and so help me not a single cgi file or servlet that I was was getting indexed and so I presumed Google only indexed dynamic files if they were of the types on their finite list that they would index. Seemed to make since and it sure looked consistent ... until I saw those two servlets in the list just now. How interesting...and puzzling?
Does anyone know what header info Google looks for? I presume it needs Content-Type ... does it need Content-length? Others?
Thanks!
I've been watching the google toolbar for months and I have yet to see a single example of a cgi page or servlet being indexed ... always the gray bar.
Cabbagehead, I think your problem may be that you are using the toolbar to determine if a page is indexed. I have found that there are just some file types that the toolbar always displays a grey bar for.
For example, my entire site was developed with Microsoft's ASP.NET technology and almost all of my pages end in .aspx. My site ranks very well in Google but the toolbar displays a grey bar for every .aspx page within my site.
Based on what I have seen, I don't believe, however, that Google prefers any one extension over another. I have seen it have problems though with pages that depend on the use of a querystring.