Forum Moderators: open

Message Too Old, No Replies

Googlebot, dynamic pages and SSI...

Does google indexes *.php, *.asp, *.shtml pages?

         

gutabo

8:54 pm on Dec 31, 2002 (gmt 0)

10+ Year Member



Does the googlebot "crawls"/indexes/recognizes/likes/dislikes *.php, *.asp, *.shtml pages?
Thanks in advance!

jeremy goodrich

8:56 pm on Dec 31, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just about any dynamic format there is, google will index / rank / etc. the page - just keep the # of variables to a minimum, and don't use sessions id's so googlebot can crawl the whole site.

cabbagehead

10:46 pm on Dec 31, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ha! That's not true at all!

In my experience Google will not crawl the MAJORITY of dyanmic content. They will crawl specific common file types such as asp, jsp, etc.... but (and here's the real kicker) they do not appear to crawl any java Servlets or CGI files. Or any lesser known or proprietary file types. So even common frameworks such as Struts for Java are not indexed.

This is a completely lacking aspect of Google indexing in my opinion and a pet peeve of mine. >:(

jeremy goodrich

11:42 pm on Dec 31, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try the 'advanced search' on google. :) You'll see that, yes indeedy, they do index files ending in .cgi.

And, if you try '.jsp' you'll see that they index those just fine, too.

Perhaps next time, you should try searching - see if you are right - and then post before pouncing in a negative fashion - waiting to tell the next bloke he's not right.

Cheers, & happy new year.

by the way, the advanced search on google is here [google.com]

cabbagehead

8:39 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ahem ... Jeremy ... I didn't say that they don't support JSP files (read the original post) ... I said SERVLETS. Do you know the difference?

As for CGI files and SERVLETS not being indexed I've been watching the google toolbar for months and I have yet to see a single example of a cgi page or servlet being indexed ... always the gray bar. Regardless of what google docs say, this would appear very consistant. If It is just coincidence, it is a strong one ... my sample size from a couple of months is quite high and thus far its been 100% accurate.

Dreamquick

9:27 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As I understand it (well just read what the sun site says about them :) so forgive any mistakes), servlets are essentially server-side applets. If your servlet just spits out HTML then there is no reason why google shouldn't be able to index the content it produces (aside from it having a dislike of servlets).

However every time I see a servlet in use (ie referenced in the url) it always seems to have an *evil* amount of data passed to it through the querystring - normally some sort of sessionid plus whatever extras it's designers intended. Google, like most modern SEs, doesn't *like* URLs which use multiple large querystring parameter so this might be the stumbling block.

The only other stumbling block would be if your servlet required the user to deal with either a session cookie or a sessionid in the url in order to access subsequent "pages" within a servlet.

My advice is that if you think the content is indexable but aren't getting indexed, then do like everyone in the same situation does - create URLs the SEs will like (using mod_rewrite, custom 404 etc)...

- Tony

[edited by: Dreamquick at 9:35 pm (utc) on Jan. 1, 2003]

bcc1234

9:31 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



they do not appear to crawl any java Servlets or CGI files

Search for "/servlet/" (with quotes).
80% of the results are servlets dispatched through invoker.

GoogleGuy

9:32 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



General rule of thumb is that Googlebot is willing to ingest just about anything. The corollary is to keep the number of parameters small and to keep those parameters short (no session IDs, for example).

cabbagehead

10:11 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, "applet" is more contextual than anything. Yes, a servlet is basically a server-side application. It can be invoked by an http request when the servlet is mapped to a specific url and most often people just use keywords without file extensions for that mapping (e.g. www.xyz.com/abc). A JSP is a convention Sun tacked on afterwards to make template-based dynamic webpages easier to create. You write one as a *.jsp file and save it in the directory that you want the user to navigate to when the page is to be called. The JSP is then chruned into a servlet on first request and the app server takes care of the mapping behind the scenes.

cabbagehead

10:15 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting ... I searched for "/servlet" as suggested and you are right! there are a couple of apparent servlets in this list! I'm shocked. If course, only 2 of the 10 in the first page are that ... but nonetheless.

Well, I'm frankly puzzled then. I first noticed this because on my site all my pages were getting indexed *EXCEPT* any of the files with a /mdlx extension. .mdlx is the file extension for my proprietary framework which I wrote in java. it maps an xslt template to a servlet per "page". .... Anyway, none of those were getting indexed .. so I started looking around to find what was being indexed and so help me not a single cgi file or servlet that I was was getting indexed and so I presumed Google only indexed dynamic files if they were of the types on their finite list that they would index. Seemed to make since and it sure looked consistent ... until I saw those two servlets in the list just now. How interesting...and puzzling?

cabbagehead

10:18 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You know what ... I wonder if it has to do with certain header info. Perhaps by default CGI and servlets aren't omitting certain header info that Google requires for indexing whereas other more template-based technologies such as asp and jsp take care of this for the user. Perhaps that's the missing link.

Does anyone know what header info Google looks for? I presume it needs Content-Type ... does it need Content-length? Others?

Thanks!

OffTheRadar

10:23 pm on Jan 1, 2003 (gmt 0)

10+ Year Member



I've been watching the google toolbar for months and I have yet to see a single example of a cgi page or servlet being indexed ... always the gray bar.

Cabbagehead, I think your problem may be that you are using the toolbar to determine if a page is indexed. I have found that there are just some file types that the toolbar always displays a grey bar for.

For example, my entire site was developed with Microsoft's ASP.NET technology and almost all of my pages end in .aspx. My site ranks very well in Google but the toolbar displays a grey bar for every .aspx page within my site.

cabbagehead

11:17 pm on Jan 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was wondering about aspx.

How do those inidividual aspx pages place in Google? Are they actually showing up and ranking reasonably well in Google?

OffTheRadar

11:31 pm on Jan 1, 2003 (gmt 0)

10+ Year Member



I am extremely happy with my rankings. I am top 5 for most of the keywords that I am targeting and all of those pages are .aspx.

Based on what I have seen, I don't believe, however, that Google prefers any one extension over another. I have seen it have problems though with pages that depend on the use of a querystring.

jeremy goodrich

12:11 am on Jan 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey cabbagehead, now that GoogleGuy has jumped in and confirmed what I said, remember: read before posting somebody else is wrong :)

Cheers. And - the other folks here gave a good tip about the toolbar - it's useful tool - you just need to know how to use the thing.

brotherhood of LAN

1:46 am on Jan 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



/delete this post, it was already said :P

gutabo

2:38 pm on Jan 2, 2003 (gmt 0)

10+ Year Member



WOW. Thanks, guys!(and happy new year :p )
And what about SSI?

jamesyap

3:04 pm on Jan 2, 2003 (gmt 0)

10+ Year Member



My site has all .shtml files and get crawled all, rank 3rd with one of the most popular search term.