Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Session ID Spam?

Or plain paranoia?

         

Vienix

3:40 am on Oct 2, 2006 (gmt 0)

10+ Year Member



On one of my sites I have a picture gallery which uses session id's.

If a SE bot comes by, I have the session manager remove the session id, simply by looking at the user-agent, if it contains "bot", then don't append session id. This should work for Google.

To my big surprise however many many pages of the gallery have their session-id appended in the google cache.

Of course this can happen when a user links a picture with its session-id appended from somewhere.

In my case however the (many) cached pages with session-id all origin from a certain period, march - april 2006.

I wonder how this may have happened....

As far as I can see I haven't touched the session code since 2005, so this is not the problem.

Googlebot didn't identify itself as bot; very unlikely.

Many people linked to my pictures with session-id appended; very unlikely.

Someone has been busy entering pages with session-id appended to deliberatly create duplicate content; my paranoia?, but I can't think of another reason how this could have happened.....?

I recently created a sitemap for the gallery, and will think of a solution to prevent this from happening in the future, something like: if ((user-agent == bot) AND (session-id) then {301 to same page without session-id}

g1smd

10:35 am on Oct 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Does the MediaPartners UA include the letters "bot"?

Both that and normal Googlebot spidering are used to make the SERPs.

Vienix

10:59 am on Oct 2, 2006 (gmt 0)

10+ Year Member



As far as I know the Adsense bots user-agent is:

Mediapartners-Google/2.1 (http://www.googlebot.com/bot.html)

So it contains "bot" ....

Vienix

2:59 am on Oct 3, 2006 (gmt 0)

10+ Year Member



g1smd...

Are you sure that the "Adsense Bot" is also used for listing pages in the serps?

According to the robots.txt analysis tool the Mediapartners bot doesn't obey the ordinary Googlebot entries in the robots.txt....

regards,

Bert

tedster

3:16 am on Oct 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can be sure of it. That was one of the spidering changes that came along with the new Big Daddy infrastructure -- the spiders all share a common cache. This was discussed a lot at the last PubCon.

So if mediabot grabs a url for an Adsense page that is also in the main search index, then regular googlebot will not waste your bandwidth and theirs by grabbing the same url in the same time interval.

However, if the robots.txt excludes that particular url, then the cached content should not cross over to regular search.

g1smd

10:57 pm on Oct 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Matt Cutts did a post on his blog about the new Google proxy caching at least several months ago.

I'd guess at least as far back May or June timescale, if asked.