Welcome to WebmasterWorld Guest from 188.8.131.52
I can understand that since most products have a direct link to this (checkout) page, that G would attempt to index it, but why the repeated hammering away on the same page? You'd think once it grabbed the page it wouldn't continue to read it, over and over again; sometimes as often as 1/sec.
Unfortunately, we can't use a NOINDEX instruction on this main checkout page because we are using a storefront that controls the header.
The main question now is, will this affect how we get indexed / ranked? If there is a negative effect, are there any suggestions as to how to fix the problem given our lack of control over the header?
Based on the unique Session_ID, it looks like googlebot is trying to follow the "checkout" or "buy one now" option.
I've gone to the log files and tracked back to where any given Session_ID first appeared and it was when googlebot was first looking at a product.
A normal user would have one session id for all of their activities, but with Google (I guess because each access from them looks like a new user, our storefront issues a unique session ID).
About 4 hours after getting the Product info, googlebot then hits the "checkout" page -- ONCE for EACH ONE of the products it looked at earlier (each hit to the checkout page having a different Session_ID, associated with the earlier Product hit).
Is this something to be concerned with? Is it something googlebot should be smart enough to know about, or maybe it does know about it but finds looking at the same page important for some reason?
From Google's Technical Guidelines for Webmasters [google.com]
Allow search bots to crawl your sites without session IDs or
arguments that track their path through the site. These technique
are useful for tracking individual user behavior, but the access
pattern of bots is entirely different. Using these techniques may
result in incomplete indexing of your site, as bots may not be
able to eliminate URLs that look different but actually point to
the same page.
Essentially, you are creating a situation where a potentially infinite number of urls are possible for a single bit of content -- and as the bot starts trying these, the pages actually active in the index will start to erode. Track this kind of thing with a cookie if you must, but best practice is to keep it out of the url.
Unfortunately for us, our storefront software (do I dare say miva?) does the session_id assignment. On invoking a page (e.g. product info retrieval), you usually don't see/use the Session_ID as a parameter/option, it's only when https:// is invoked (such as for account setup or purchase processing "checkout").
On each store page, there are links to store functions: "Storefront", "Account", "Search", "product List", "Basket Contents" and "Checkout". The "Account" and "Checkout" links use https and use Session_ID. So, whenever G looks at a product, it sees the "Account" and "Checkout" link (each is a single page and would only "look" different to G because of the Session_ID being different).
Given the above, do you think that G will be "..able to eliminate URLs that look different but actually point to the same page."?
Is it time to change our storefront software?
Unfortunately, the offending links are part of the storefront code and we have no control (other than the image displayed) over their structure. The specific links are for setting up a username/account and for checkout, and on these links the SessionID gets set (again, by the storefront code).
We can set the main/base part of the URL for the two functions, but if you try to add anything, it gets embedded before the options (one of which is the &Session_ID).
If there were no workaround, then no Miva sites would succeed in the search engines. Although they have had trouble, nevertheless approaches have been found that allow Miva sites to rank.
My understnding about session ID was that it was required to maintaining the integrity of the shopping basket and ensure security during the checkout / purchase process. I will give them a call to ask about any options of using cookies vs a url parameter/option, and will post back here if i find out anything new.
Knock on wood, we're doing OK in rankings with this m store (albeit we're not happy with the last week or so with the flux / update or whatever you want to call it). It's been a real PAIN to do some of the things we have done to get m store rankings.
It's just that G hammering away on the same page had us concerned and what might happen to the rest of our rankings if we can't get them to ignore the two pages with the problem.
Trying to keep on topic: will G hurt our store's current rankings if we have a situation where all (almost all) product pages in our store point to a "checkout" page using a unique session id. And, where G is now repeatedly crawling the same page ("checkout") because of the unique session id.
tedster pointed earlier to G's technical guidelines for webmasters and if I read correctly: "Allow search bots to crawl your sites without session IDs or arguments that track their path through the site...".
Well in our case, G CAN crawl our site without the session IDs (i.e. they can use the url and just not use the session id). Actually, we'd PREFER that they not use the sessionID as part of the url they crawl. So I think in our case, since we're not PREVENTING them from crawling without SessionIDs, we're OK w.r.t. to G's guidelines?
I think that's where the 3rd party consulting comes in, IF the problem can be fixed.
The session ID comes in as follows: Every store page that's viewed has a store navigation bar near the top of the page. In the navigation bar there are 2 "offending" links: "Account" and "Checkout". The store software, (beyond our control), builds the page and as part of that build adds a Session_ID to the "Account" and "Checkout" links.
We can set the url for these links in our administrative area for the store, but we cannot change the options/parameters added to the URL or add any Link Specific Regulation (such as was suggested earlier, to use the REL=NOFOLLOW).
There's also one other area where on some product listings, there's a "buy one now" option and that as well has a session_id added to the link. But, I think that the most offensive links are the "Account" and "Checkout" links found on virtually EVERY page of the store since it's part of the store function navigation bar.
For a forum, for example, you would stop robots trying to crawl any "log in" and "new reply" and "new post" and "edit post" and "edit profile" and "send PM" and "bookmark this thread" and "print friendly" pages as they should never appear in search results anyway.
For a shopping cart, bots should index product pages, but should never be allowed to try to access "buy" links.
With all the hammering away on the "checkout" page (which is [oursite.com...] a site search on G now returns [oursite.com...] i.e. site:oursite.com returns [oursite.com...] The only way I can think that's possible is from all the [oursite.com...] links followed to "checkout" from our product pages.
In checking mcdar, https shows up as #1 in about 80% of the dc's with http on the others