Forum Moderators: Robert Charlton & goodroi
If a SE bot comes by, I have the session manager remove the session id, simply by looking at the user-agent, if it contains "bot", then don't append session id. This should work for Google.
To my big surprise however many many pages of the gallery have their session-id appended in the google cache.
Of course this can happen when a user links a picture with its session-id appended from somewhere.
In my case however the (many) cached pages with session-id all origin from a certain period, march - april 2006.
I wonder how this may have happened....
As far as I can see I haven't touched the session code since 2005, so this is not the problem.
Googlebot didn't identify itself as bot; very unlikely.
Many people linked to my pictures with session-id appended; very unlikely.
Someone has been busy entering pages with session-id appended to deliberatly create duplicate content; my paranoia?, but I can't think of another reason how this could have happened.....?
I recently created a sitemap for the gallery, and will think of a solution to prevent this from happening in the future, something like: if ((user-agent == bot) AND (session-id) then {301 to same page without session-id}
So if mediabot grabs a url for an Adsense page that is also in the main search index, then regular googlebot will not waste your bandwidth and theirs by grabbing the same url in the same time interval.
However, if the robots.txt excludes that particular url, then the cached content should not cross over to regular search.