Forum Moderators: open
Also while I'm on the subject, do spiders have problems with dhtml and shtml?
Thanks in advance :)
The only gotchas relate to how the site is coded rather than what it is coded with for example lots of query string data or not allowing users to browse without a session cookie are easy ways to get search engines to not index pages.
That said most problems of dynamic sites not being able to be crawled by SEs can be solved with a little thought and a little more code...
- Tony
In addition to making sure your ASP spits out clean HTML code that the spiders can digest, the query string is really the trick.
1) Don't put any tracking data into the string. In other words, session ID's, referer data, etc. is bad. Send only the variables that are needed to generate a unique page to the URL.
2) Be consistent. "...page.asp?ID1=2&ID2=1" will generate the same page as "...page.asp?ID2=1&ID1=2" but google will look at them as separate pages. It will then start to hate both of them because they are the same. Make sure your querystrings maintain a consistent order. I still have a few pages on my site that I find from time to time where I have that problem and I kick myself with each Google update.
3) Keep it Short and Sequential: Google hates thinks like "...page.asp?ID=1625te53632271ths6" It may not be the length so much as the "Oh, that's a RANDOM number and not a sequence. That means it's a RANDOM page and not a real page". I have NEVER been able to get google to index my pages that are generated using product UPC, ISBN, or Amazon ASIN numbers. They are long numbers and non sequential. Not sure what kills it, but if I generate a page using my own sequential database numbers or even calling the product by its name (which is LONGER than the number, but at least it's identitifiable) then it crawls it fine.
4) Make good use of Server.URLEncode and Server.HTMLEncode functions before passing and parsing those strings. Google's pretty good at it, but FAST and ALLTHEWEB sometimes have problems with "page.asp?dude=Joe%20Blow". When you use URLEncode, it changes it to "page.asp?dude=Joe+Blow" and it works fine. (AOL, though, often converts the? and = to their ascii codes. I haven't figured out what causes this and it doesn't happen all the time, but when it DOES do it, you can watch your 404's go through the friggin' roof. I HATE that!)
I guess that's it for the major stuff. The only other thing to deal with (if the site is big) is getting googlebot to index the important stuff first and leave the older, less important stuff until later. I've got about 3 million pages now, but only 3000-4000 are really hot topics (it's movies and soundtracks so some are hot, some are not). For months, googlebot would go through and pick up 40K - 50K pages - but it picked whichever ones it felt like. Now I've got it better where it's getting the hot stuff first and then filling the rest with whatever it happens to like. Let me know if you ever get to that point and I'll post a thread on how to do that. (Or at least how I THINK I managed to do that). ;)
G.