Forum Moderators: open

Message Too Old, No Replies

Does Google spider .cgi pages?

No spidering of .cgi pages

         

bytesize

5:42 pm on Oct 28, 2002 (gmt 0)

10+ Year Member



Hello all!

I seem to be getting daily visits from the Google freshbot to my websites portal page which is static html, but it never attempts to retrieve the entry link which points to a cgi script.

Does Google ignore any url which links to a page such as www.domain.com/script.cgi? <url snipped>

Thanks in advance for any advice!

Regards,

John

[edited by: NFFC at 6:23 pm (utc) on Oct. 28, 2002]
[edit reason] No site reviews as per TOS [/edit]

frontpage

2:19 am on Oct 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello bytesize

Can't answer for all websites...but my site has been listed in google forever. It is a mix of php and html. Googlebot devours it all except one section the ends with www.domain.com/cgi-bin/example.cgi.

This dynamic page gets thousands of hits a day but it is not in google. It is in dmoz directory but not google directory. Googlebot never calls on it despite it being linked to my index page for years.

Many other search engines and spam bots have it listed such as MSN, Askjeeves, etc. So...in summation...I don't know why google has a problem with .cgi when it indexes my php that ends with .php?cat=blah...

andreasfriedrich

10:37 pm on Oct 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld [webmasterworld.com], bytesize.

I believe Google does index documents with cgi in their URI as this search [google.com] shows.

However there is really no need to let anybody know that you are running some server side scripts via CGI. There exist a number of techniques to let your server treat certain filename or directories as cgi scripts. Parameters may be passed as path information in the URL instead of using the query string.

Hope this helps.

Andreas

bytesize

10:11 am on Oct 30, 2002 (gmt 0)

10+ Year Member



Thanks to you both for the replies!

I found another thread which covers this topic and tried a few tricks from that. My problem is that the server is a shared IIS with no admin access (yeuck!) so any fancy rewriting with Apache is out of the question; a lesson learnt for next time!

My apologies for posting my site url; guess I need to read the TOS next time ;)

Regards,

John

GoogleGuy

5:28 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We'll crawl it all. We love cgi, asp, jsp, php, cfm, swf, ummm...

(help me out here folks; what are some other file extensions?)

Let's see. html/htm/txt of course. wml? Yup, we crawl wireless markup language too, although for our wireless search. Then of course there's doc, xls, ppt, ps, ps.gz, pdf, wp (wordperfect), wri (write), tex, mdb (Access)..

Okay, I'm running out of file extensions I can think of. Maybe it would be easier to make a list of filetypes that we don't crawl? :)

bobriggs

5:31 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Didn't see .pl in there, GG.

ciml

8:19 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> list of filetypes that we don't crawl?

I would have thought that any file extension would be OK if the content was sent as text/html or another content-type that's known to Google.

I've just looked for a well known URL ending in .rob and it doesn't seem to be in Google. Maybe a fluke, I'll look further. If this means I can't put up pages called example.com/something.calum and get them crawled then I'll be quite upset:).

nell

8:34 pm on Oct 30, 2002 (gmt 0)

10+ Year Member



.js

jatar_k

8:36 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I think .js and .css would definitely go into the not crawled.

duckhunter

8:49 pm on Oct 30, 2002 (gmt 0)

10+ Year Member



I haven't had much luck with the googlebot crawling pages utilizing cgi querystrings (ie: mypage.asp?param1=abc) It seems to drop everything after the .asp and hit the page without the QueryString.

ciml

9:21 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jatar_k:
> I think .js and .css would definitely go into the not crawled

I think that the application/x-javascript and text/css media types should not be indexed.

I know, sometimes an engine has to be imperfect in order to deal with an imperfect Web.

</hobbyhorse>

jatar_k

10:47 pm on Oct 30, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I agree totally ciml, I was suggesting differently. They are th only ones I can really think of that aren't crawled.

duckhunter, there are sometimes glitches but for the most part googlebot crawls query strings exceptionally well. We just have to keep them to the shortest version possible and not go overboard.

IanKelley

11:21 am on Nov 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From experience (on various websites)

GoogleBot does not like .cgi.

.cgi pages (even root ones without query strings) seem to receive a lower pagerank than a html page at a similar depth with a similar number of links pointing to it.

If this is no longer the case then it is a recent change. I've watched Google avoid CGI (not completely, just seems to like it a lot less than anything else) for over 2 years.

Grumpus

12:19 pm on Nov 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've developed sites that get CGI scripts crawled fine. It won't go into a /cgi-bin/ directory, but if you rename it, it's fine. (It also won't cross into a secure socket, so if you want your shopping cart indexed, you'll have to make it browsable on the non-secure side).

Folks are right, though - in the past CGI pages, like any other dynamic pages, just didn't get the PR of other pages. That's a thing of the past, though. Google goes through just fine.

I believe, that if people are still having problems getting CGI pages crawled it's more likely the navigation layout that is a problem. (Google doesn't seem to like "dead ends" or "pockets" it wants to be able to sweep through and come out the other side with lots of new things to look at). When people put up their CGI pages, they tend to have completely different navigation controls than the rest of their site does. The main site will link to all the key areas of the site on just about every page, while the CGI pages often link back only to the homepage, thus creating a "pocket".

Fix the navigation inside the CGI pages so that it is set up just like the rest of the site, get it out of the /cgi-bin/, and get a link to it from many/most of the rest of the pages on your site and it'll get crawled in a month or two.

G.

Kerrin

1:47 pm on Nov 21, 2002 (gmt 0)

10+ Year Member



I think that the application/x-javascript and text/css media types should not be indexed

Totally agree with you there. Neither should be indexed but i've always thought google should crawl .css files to determine if tags are being altered in a spammy way.