Dynamic URLs and Googlebot rules - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Dynamic URLs and Googlebot rules

What are the rules for a URL to allow it to be spidered safely?

Iguanasan

1:44 pm on Jul 11, 2003 (gmt 0)

10+ Year Member

I've been reading for hours about dynamic URLs and Google and I really hate starting yet another thread on the topic but I feel I must as nothing so far as answered my questions with any great degree of satisfaction and this forum seems like my best bet since my email to Google has, so far, gone unanswered.

I work with a company that has developed a Content Management System (CMS) that uses dynamic URLs and a database to build a web site. A typical URL generated by our software is: default.asp?mn=23.32.85.12 and this URL points to the same page every time unless the user updates the content. Also, the CMS builds simple standard links to this content.

From all my reading I have determined that Google (and other search engines) are able to handle URLs with a "?" in them. But what about the "." that I've used? If this is a bad thing, what can I use if it is a "no-no"? It takes such a long time to show up in the search engines that testing takes forever. I'd like someone, somewhere to tell me difinitively what various spiders will or will not handle.

These rules seem to be such a mystery but I don't understand why this question is so hard to have answered. It's not like we are asking for their rules for indexing pages, etc. I just want a rule set for determining whether or not the spider will follow the URL. Can't Google and other search engines post this information on their site so we can build URLs that can be followed safely?

Brett_Tabke

1:19 pm on Jul 22, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

> Can't Google and other search engines post
> this information on their site so we can
> build URLs that can be followed safely?

A content management system by any other name is still a cloaking utility. If they post info such as what urls and styles they will follow, several thousand sites crank up the doorway page generators and lay waste to the G index within a few days.

Here is what we know/assume/ymmv/cobbled together about G's dynamic url behavior.

- shorter is better. Keep the total length of the string under 120 characters max. (best to keep it under 40 where possible.
- number of parameters. Keep it under 3. More than 4-5 have a hard time getting it indexed.
- don't use unicode or encoded parameters - keep it to ascii if possible.
- page that is generated best be close to a perfect match if gb dl's it twice within a short time frame.
- leave the session ids back in the high school comp sci class where they belong.
- do nothing based upon cookies or session ids or atleast present the same page of content over and over.

rogerd

1:32 pm on Jul 22, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Welcome to WebmasterWorld, Iguanasan. I think Brett has covered the topic fairly completely. I'd add that dynamic pages appear to not perform quite as well as static pages SERPs, although the spiderability and performance gaps seem to be closing.

My preference for a CMS (and I'm in the market) would be that it offers an option to write static pages or provides an easy means to rewrite URLs to appear static.

Your single variable URL isn't too forbidding and should be readily spiderable with good linking, although I'd tend to substitute dashes for the dots if possible. Dots may or may not matter in the query string, but if you rewrite the URL, "23-32-85-12.html" would definitely be better.

toolkit

2:01 pm on Jul 22, 2003 (gmt 0)

10+ Year Member

leave the session ids back in the high school comp sci class where they belong.

That little quote has made my day Brett :)

Iguanasan

2:41 pm on Jul 22, 2003 (gmt 0)

10+ Year Member

Thanks for the information Brett and rogerd and thanks for the welcome. I was getting a bit discouraged about this site as I am new and had not even had someone nibble on this post for quite a few days.

I appreciate your reply, however, I really don't think I agree with your comment about CMS = Cloaking Facility or maybe I just don't understand it. Our clients are not trying to hide anything, they are trying to provide their marketing staff with the tools to quickly and easily maintain information on their web site so the information is current and accurate.

I also do not understand why providing content to the Google engine is considered "laying waste to the index". Isn't the engine's ultimate task to index ALL content on the Internet to make searching for information easier? Why should they want to ignore content just because it cannot be consumed via a static HTML page? If "default.asp?mn=23.32.85.12" and "company/management/team.html" returns the same information why should one be any more important to Google (or any other search engine) than the other?

URL rewriting to solve my problem as rogerd suggests might be fine if you have control over the hosting environment but many of our clients run their sites in a purchased hosting environment where they do not have control over the server to install URL filters/re-writers. We have developed a way for our clients to provide a promotional URL to particular content areas (eg. www.mysite.com/promoURL/ ) even though the directory does not really exist by utilizing a redirect. I suspect, though, that the Googlebot does not follow redirects.

All I am asking for is a set of rules to follow (and you've given me some, thank you) to help Google index my client's sites safely without the site becoming a "spider trap".

Regards, Iguanasan

PS: What does "ymmv" mean?

ciml

3:24 pm on Jul 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Although far from foolproof, the presence of the "?" character in a URL (especially with multiple arguments, separated by "&" characters) is a good indicator for content that changes based on parameters. This kind of content is often a spider trap.

Googlebot does index deeper than it did a year or two ago, including many more 'dynamic' looking URLs. If you have plenty of PageRank, the "?" characters probably won't cause you a problem.

If you want your static URLs to look static, then it might be time to go talk to your programmers/software venders and the hosting companies to see how much they want your business, or your clients business.

I think that Brett's list really is what you're looking for. It is easier to get short, ASCII-only URLs indexed, without many "&" characters (and preferably no "?"), that are stable over time.

YMMV = Your Mileage May Vary

too much information

3:45 pm on Jul 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

- page that is generated best be close to a perfect match if gb dl's it twice within a short time frame.

Now I am worried. All of my pages are dynamically generated because it is very easy for me to update across the entire site.

On my last update, I made some fairly heavy changes and it took me nearly 15 min to do error corrections. When I went to look at my logs GB had hit the page 3 times, each time just after a change. GB hasn't been back. :(

Is there a way to do includes with HTML files so that I can get away from the dynamic pages where they really are not necessary?

kpaul

6:11 pm on Jul 22, 2003 (gmt 0)

10+ Year Member

ymmv == your milage may vary?

GoogleGuy

6:33 pm on Jul 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

What everyone else has said, with a special emphasis on leaving out session-ids. I would avoid any parameter called "id" in case it looks like a session id.