I'm optimizing a large, non-profit site with essentially all CGI pages. It's a well-established site, showing a lot of incoming links on all engines, and it ranks well on some very competitive stuff even though it's essentially not optimized. I attribute the ranking to a lot of relevant links coupled with chance page and title content.
When I started to look at the urls, I was completely puzzled, because there basically aren't any page names or extensions as I understand them. For the most part, the question marks in the query strings have been replaced by "/"s followed by the query (though not, I was told, for SEO purposes); and the "urls" look like
www.domain.org/modperl/go/ca
or:
www.domain.org/cgi-bin/ca/district_profile/286
I'm wondering whether, when I try to create links and page content for optimizing the interior pages, I can keep the pages and their urls in this form, or whether we need to create dupes of our most important pages with conventional extensions... or, whether there's any other trick that might work, if in fact a trick is necessary.
Google clearly can follow the existing links and can see these urls. I'm not sure about the other engines.
I just need to get enough of a handle on how the spiders will see things before making suggestions to the technical people on the site. Obviously, they'd like to leave things alone if they can... it's a big site.
If it would help, I can explain the "url" naming conventions as they were explained to me, but I suspect most of you are already ahead of me on this. It's definitely not what I'm used to.
www.domain.org/cgi-bin/ca/district_profile/286
^--this is the program ca, without a .cgi extension
>>No extensions are necessary as you aren't calling a real page but rather sending info to a program.
This is the heart of my question. Is it actually necessary to create the fake pagename with the html extension, as in the "Eliminating allergic symbols" thread? The urls without a filename seem to work just fine to take me to pages from my browser. The question is whether "sending info to a program" includes spiders as programs... ie, will spiders follow these links too?
I've seen posts on the forums suggesting that the engines might favor html pages over others, and Everyman's post also suggests an html extension is desirable. How true is that these days?
Creating the "fake" pages with extensions would be a lot of work... could conceivably mess up hundreds of external links... and if the pages were created as dupe pages for spidering only, there might be a dupe content problem as well.
Here's the explanation I was sent by the engineer on the site:
The normal way to pass parameters to a cgi program looks something like
this:www.domain.org/cgi-bin/cs_compare?state=ca
where the name of the program is 'cs_compare' and we are passing a variable that has a name of 'state' and a value of 'ca'. The web server takes everything after the question mark as name/value pairs and passes that info into the cgi program. Instead we do something like this:
www.domain1.domain.org/cgi-bin/cs_compare/ca
This looks kind of like a directory, but actually it is a cgi program named 'cs_compare'. The web server knows that it is supposed to execute anything in the cgi-bin directory, so it executes the 'cs_compare' cgi program and ignores the following slash and everything after it. Then the program parses the url itself and extracts the 'ca' and knows that that is supposed to be the state.
I should probably add that the actual form of the url described above for 'ca' ends up as:
www.domain.org/modperl/go/ca