Forum Moderators: open
I seem to be getting daily visits from the Google freshbot to my websites portal page which is static html, but it never attempts to retrieve the entry link which points to a cgi script.
Does Google ignore any url which links to a page such as www.domain.com/script.cgi? <url snipped>
Thanks in advance for any advice!
Regards,
John
[edited by: NFFC at 6:23 pm (utc) on Oct. 28, 2002]
[edit reason] No site reviews as per TOS [/edit]
Can't answer for all websites...but my site has been listed in google forever. It is a mix of php and html. Googlebot devours it all except one section the ends with www.domain.com/cgi-bin/example.cgi.
This dynamic page gets thousands of hits a day but it is not in google. It is in dmoz directory but not google directory. Googlebot never calls on it despite it being linked to my index page for years.
Many other search engines and spam bots have it listed such as MSN, Askjeeves, etc. So...in summation...I don't know why google has a problem with .cgi when it indexes my php that ends with .php?cat=blah...
I believe Google does index documents with cgi in their URI as this search [google.com] shows.
However there is really no need to let anybody know that you are running some server side scripts via CGI. There exist a number of techniques to let your server treat certain filename or directories as cgi scripts. Parameters may be passed as path information in the URL instead of using the query string.
Hope this helps.
Andreas
I found another thread which covers this topic and tried a few tricks from that. My problem is that the server is a shared IIS with no admin access (yeuck!) so any fancy rewriting with Apache is out of the question; a lesson learnt for next time!
My apologies for posting my site url; guess I need to read the TOS next time ;)
Regards,
John
(help me out here folks; what are some other file extensions?)
Let's see. html/htm/txt of course. wml? Yup, we crawl wireless markup language too, although for our wireless search. Then of course there's doc, xls, ppt, ps, ps.gz, pdf, wp (wordperfect), wri (write), tex, mdb (Access)..
Okay, I'm running out of file extensions I can think of. Maybe it would be easier to make a list of filetypes that we don't crawl? :)
I would have thought that any file extension would be OK if the content was sent as text/html or another content-type that's known to Google.
I've just looked for a well known URL ending in .rob and it doesn't seem to be in Google. Maybe a fluke, I'll look further. If this means I can't put up pages called example.com/something.calum and get them crawled then I'll be quite upset:).
duckhunter, there are sometimes glitches but for the most part googlebot crawls query strings exceptionally well. We just have to keep them to the shortest version possible and not go overboard.
GoogleBot does not like .cgi.
.cgi pages (even root ones without query strings) seem to receive a lower pagerank than a html page at a similar depth with a similar number of links pointing to it.
If this is no longer the case then it is a recent change. I've watched Google avoid CGI (not completely, just seems to like it a lot less than anything else) for over 2 years.
Folks are right, though - in the past CGI pages, like any other dynamic pages, just didn't get the PR of other pages. That's a thing of the past, though. Google goes through just fine.
I believe, that if people are still having problems getting CGI pages crawled it's more likely the navigation layout that is a problem. (Google doesn't seem to like "dead ends" or "pockets" it wants to be able to sweep through and come out the other side with lots of new things to look at). When people put up their CGI pages, they tend to have completely different navigation controls than the rest of their site does. The main site will link to all the key areas of the site on just about every page, while the CGI pages often link back only to the homepage, thus creating a "pocket".
Fix the navigation inside the CGI pages so that it is set up just like the rest of the site, get it out of the /cgi-bin/, and get a link to it from many/most of the rest of the pages on your site and it'll get crawled in a month or two.
G.