Forum Moderators: open
I work with a company that has developed a Content Management System (CMS) that uses dynamic URLs and a database to build a web site. A typical URL generated by our software is: default.asp?mn=23.32.85.12 and this URL points to the same page every time unless the user updates the content. Also, the CMS builds simple standard links to this content.
From all my reading I have determined that Google (and other search engines) are able to handle URLs with a "?" in them. But what about the "." that I've used? If this is a bad thing, what can I use if it is a "no-no"? It takes such a long time to show up in the search engines that testing takes forever. I'd like someone, somewhere to tell me difinitively what various spiders will or will not handle.
These rules seem to be such a mystery but I don't understand why this question is so hard to have answered. It's not like we are asking for their rules for indexing pages, etc. I just want a rule set for determining whether or not the spider will follow the URL. Can't Google and other search engines post this information on their site so we can build URLs that can be followed safely?
A content management system by any other name is still a cloaking utility. If they post info such as what urls and styles they will follow, several thousand sites crank up the doorway page generators and lay waste to the G index within a few days.
Here is what we know/assume/ymmv/cobbled together about G's dynamic url behavior.
- shorter is better. Keep the total length of the string under 120 characters max. (best to keep it under 40 where possible.
- number of parameters. Keep it under 3. More than 4-5 have a hard time getting it indexed.
- don't use unicode or encoded parameters - keep it to ascii if possible.
- page that is generated best be close to a perfect match if gb dl's it twice within a short time frame.
- leave the session ids back in the high school comp sci class where they belong.
- do nothing based upon cookies or session ids or atleast present the same page of content over and over.
My preference for a CMS (and I'm in the market) would be that it offers an option to write static pages or provides an easy means to rewrite URLs to appear static.
Your single variable URL isn't too forbidding and should be readily spiderable with good linking, although I'd tend to substitute dashes for the dots if possible. Dots may or may not matter in the query string, but if you rewrite the URL, "23-32-85-12.html" would definitely be better.
I appreciate your reply, however, I really don't think I agree with your comment about CMS = Cloaking Facility or maybe I just don't understand it. Our clients are not trying to hide anything, they are trying to provide their marketing staff with the tools to quickly and easily maintain information on their web site so the information is current and accurate.
I also do not understand why providing content to the Google engine is considered "laying waste to the index". Isn't the engine's ultimate task to index ALL content on the Internet to make searching for information easier? Why should they want to ignore content just because it cannot be consumed via a static HTML page? If "default.asp?mn=23.32.85.12" and "company/management/team.html" returns the same information why should one be any more important to Google (or any other search engine) than the other?
URL rewriting to solve my problem as rogerd suggests might be fine if you have control over the hosting environment but many of our clients run their sites in a purchased hosting environment where they do not have control over the server to install URL filters/re-writers. We have developed a way for our clients to provide a promotional URL to particular content areas (eg. www.mysite.com/promoURL/ ) even though the directory does not really exist by utilizing a redirect. I suspect, though, that the Googlebot does not follow redirects.
All I am asking for is a set of rules to follow (and you've given me some, thank you) to help Google index my client's sites safely without the site becoming a "spider trap".
Regards, Iguanasan
PS: What does "ymmv" mean?
Googlebot does index deeper than it did a year or two ago, including many more 'dynamic' looking URLs. If you have plenty of PageRank, the "?" characters probably won't cause you a problem.
If you want your static URLs to look static, then it might be time to go talk to your programmers/software venders and the hosting companies to see how much they want your business, or your clients business.
I think that Brett's list really is what you're looking for. It is easier to get short, ASCII-only URLs indexed, without many "&" characters (and preferably no "?"), that are stable over time.
YMMV = Your Mileage May Vary
- page that is generated best be close to a perfect match if gb dl's it twice within a short time frame.
Now I am worried. All of my pages are dynamically generated because it is very easy for me to update across the entire site.
On my last update, I made some fairly heavy changes and it took me nearly 15 min to do error corrections. When I went to look at my logs GB had hit the page 3 times, each time just after a change. GB hasn't been back. :(
Is there a way to do includes with HTML files so that I can get away from the dynamic pages where they really are not necessary?