session id's, and creating alternate paths for crawler

Forum Moderators: DixonJones

Message Too Old, No Replies

session id's, and creating alternate paths for crawler

Trying to create alternate path w/ no session ID's for a bot to crawl

xbase234

7:35 pm on Nov 17, 2003 (gmt 0)

I'm working on a site that uses session ID's, but would like to ensure that the pages get crawled.

Can anyone recommend a method of creating a path specifically for a crawler, and whether or not new pages should be created for this?

My concern is that if duplicate 'spiderable' pages are created, and the bot somehow picks up a page with a session ID, there may be a duplicate content penalty assessed, or the pages may get kicked out of the index.

Anyone have any experience creating a "bot-friendly" path for a site that uses session ID's? Is there a way to use existing pages and 'turn-off' id's for the bots?

Also, the information is somewhat static (never changing, no variables), but the pages are dynamically delivered (.jsp). Is it possible that these pages could still get indexed with the session ID, considering the relatively smallnumber of pages (less than 100)?

xbase234

4:30 pm on Nov 18, 2003 (gmt 0)

post removed

xbase234

6:07 pm on Nov 18, 2003 (gmt 0)

hmm. No bites - Mod, could this post be moved to Google News?

jatar_k

9:05 pm on Nov 18, 2003 (gmt 0)

we use session id's on our site and the spiders have no problems crawling them. It is a php site and there are settings for what type of session handling it needs to do. It also depends on what happens to the pages when there is no session var.

Are there no settings for how the session is handled?

could this post be moved to Google News

it's not really google news though is it

Also, have you looked in your logs to see what urls spiders are requesting? Have you seen them requesting urls with a session?

xbase234

6:05 pm on Nov 19, 2003 (gmt 0)

No, this is a new site, and I'm mainly concerned about getting indexed in Google. I've run into several instances where Google will index urls containing sessions ID's, but would prefer not to take a chance if there is a better solution.

There are less than 100 pages in this site, so I may just take a chance with the ID's. If it doesn't get indexed, Plan B will be a site rewrite into HTML. I'm still wary of a rewrite because of the potential to get a duplicate content penalty if the session ID strings get indexed.

Have any members ever "turned off" cookies and sessions through one directory of a site to allow the bots to crawl? Is this even viable?

Also, has anyone ever delivered relevant content pages by IP? It seems like IP delivery may be the only viable solution, even though there are no intentions of deceiving searchers. This is still probably to risky for me to implement. Just wondering if anyone has tried it -

count_zer0

3:56 pm on Nov 25, 2003 (gmt 0)

You could also just turn off session IDs for known search engine bots. Identify these with the User-Agent string. Voila, no duplication issues.

mipapage

3:47 pm on Nov 27, 2003 (gmt 0)

Check this [webmasterworld.com] out xbase234, maybe it can help you...

richmondsteve

4:06 am on Dec 1, 2003 (gmt 0)

xbase234, three options I've used are 1. to whitelist IPs of bots that I don't want session_ids passed to and 2. disabling query string based URLs and using session variables only when a cookie is accepted (which bots don't/shouldn't) and 3. only implementing sessions after a a specific action (POST method submit on login pace) which a bot will never do.