Forum Moderators: open
Allow search bots to crawl your sites without session ID's or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
If you decide to use dynamic pages (i.e., the URL contains a '?' character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them small.
---------------------------8<------------------------------------------
taken from [google.com...]
Note,he uses also @, not only &, i missed that too the first time.
I don't even see a reason for using '?' either ;)
I recommend url_hashing.
replace each (dynamic) url by a hash and use mod_rewrite to re-translate the incoming requests.
2. Implement if-modified-since. Google likes it, and browser-caches too ;)
I'm also a CMS developer, and every single sub-page of our customers gets indexed by google :)
Before I took the above steps things were different :¦
A nice side-effect is that ppl don't fiddle with the query-strings.
You can even disallow any query-string which didn't come originally from the cms.
hih
[ircache.net...]
There enter your url.
It will tell you how cachable it is.
AFAICT cachable (dynamic-)sites have a much higher possibility to be indexed by spiders.
cu
I tend to disagree. Most sites I deal with do not want to be cached due to the dynamic nature of the content and use things like must-revalidate, etc. to stay fresh.
I've never seen this impact the sites ability to be spidered. The URLs may have an impact (albeit not as much as years ago) but I doubt the cache-ness has any impact.
Do you hash the entire url, or just the querystring
ie:
www.mydomain.com/index.aspx?2439876345fhs324
or
www.mydomain.com/2439876345fhs324.aspx
which could also be made possible?
I think i'm having a similar problem, if it's ok could you give more detail, ie what happened b4, and how long did it take to change?
tnx
Sorry, i meant to say 'how long did it take google to see the change and spider properly'
[edited by: hitchhiker at 11:29 pm (utc) on Mar. 15, 2003]
Ok, then maybe it's not the caching that the spiders like, it's the presence of freshnes-validators.
Caching doesnt hurt, though :)
Of course you should fine-tune to let not cache pages that MUST always be fresh.
I would recommend the hash as subdirectory, so that the template's filename can still be seen, e.g. you want (dynamic-)images to be spidered too.
actually I have another subdirectory 1 level above:
because I didn't want the hashes to be at toplevel, but now when I think about it, it seems unnecessary.
BTW: wasn't there an issue with google PR and subdirectories?
Maybe this is even better, (although it is too ugly in my eyes)
Hopefully you have direct control over your apache.
You can put mod_rewrite code in .htaccess files, but max performance is only achieved using the conf files.
cu
We also mailed google and asked if there were penaltys on our site because it could think that this is cloaking, they told us that there were no penalties and that it could be related to the EverFlux.
After 3 Weeks or so, FreshCrawl picked up the pages and everything is OK now.