My httpd rewrites URLs for googlebot, but duplicates are indexed

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

My httpd rewrites URLs for googlebot, but duplicates are indexed

helpnow

1:04 pm on Jul 7, 2008 (gmt 0)

I know googlebot is officially the only bot that crawls for the SERPs...

However...

My httpd.conf rewrites URLs for googlebot to strip off session ids, etc.

Now, I see all sorts of URLs WITH sessionids showing up in the SERPs again.

I tested my httpd.conf by including my own IP as a bot - my rewrites work perfectly.

So, the only thing I can think of is that Google SERPs are using bots other than googlebot to gather data.

Is anyone else noticing this? Perhaps Mediapartners-Google is also feeding the SERPs. Maybe Adsbot-Google too? I don't rewrite URLs or these 2 bots, but I guess I have to now...

I cannot figure out any other explanation for how rewritten googlebot URLs are getting into the SERPs in a pre-rewritten format...

P.S. I am a victim of the June 4 ranking debacle, and and have been searching for an explanation. Not sure if anyone else suffering from June 4 may discover this is a problem for them too.

tedster

4:50 pm on Jul 7, 2008 (gmt 0)

Perhaps Mediapartners-Google is also feeding the SERPs. Maybe Adsbot-Google too? I don't rewrite URLs or these 2 bots, but I guess I have to now...

Ever since Google built the Big Daddy infrastructure, all Googlebots use a shared crawl cache. Here's a post from Matt Cutts [mattcutts.com] about that.

Now according to that post, when you are crawled by a spider from another Google service, that crawling "doesn’t queue up pages to be include in our main web index." However, Matt's blog post is also over two years old - and things may have changed, or at least become crossed up somewhere.

helpnow

5:03 pm on Jul 7, 2008 (gmt 0)

Thank you, tedster, for your reply. Since I posted this, and until you replied, I had continued my research wherever I could. It isn't often spoken of. But I did see a few other webmasters who posted proof elsewhere that it occurs.

I don't understand why this isn't a bigger issue? I lost my rankings on June 4th, and have spent 5 weeks trying to figure out why, and fixing things.

Today, I trip over this, and I am now more resolute than all past possible issues that THIS is the cause. This has a bigger effect than anythign else I'e looked at. I am unwillingly filling the SERPs with hundreds of thousands of pages of duplicate content. That's bad for google, though I know they will filter it out of the SERPs anyway, but it sure as hell is awful for me and the people I've had to lay off.

I've fixed it now, a 5 minute fix to add ADSBOT and MEDIAPARTNERS all throguh my httpd.conf... Now I wait for 10-15 days for google to grab this fix. But it really should be addressed better by Google in their help topics, and I wish more people here knew of it so it could surface from time to time as a possible problem for other webmasters to consider.

I will try to keep this problem alive as a possible thing for other webmasters in trouble to look at.

From where I am standing, this is HUGE!

g1smd

5:50 pm on Jul 7, 2008 (gmt 0)

I would rely on a solution that is acceptable to all bots, and hides session IDs from users too (because users can cut URLs from their browser address bar and paste them into the content of other sites). Cookies might be the way forward for you.

helpnow

6:26 pm on Jul 7, 2008 (gmt 0)

g1smd - thank you for your thoughts! I've always shied away from cookies because my shopping cart depends on it - if someone has cookies disabled, a cookie-dependant cart won't work. I've always used session ids, and tried to work out all the issues that come with that. <sigh> Right now, I am unsure what is the lesser of the two evils. It's not just session ids, I have other navigational things that are great for the user, but duplicate content for Google, like breaking category subsubs into user-generated product formats, etc. - again, without cookies, I have to bury them in the URL /xyz/ so the user can carry them around the site - I strip out /xyz/ if I detect it is a bot. My system programs won't let a bot get a session id/ or /xyz/, and if it does get one from another source, my httpd strips it off the bot and 301s to the bare URL. It seemed to be working lovely, and has been for a couple years now, except now google is letting non-googlebot bots do its SERP work, and now I need a new httpd.conf. You're right, the next time Google adds a new bot, and let's it into the SERPs, I am in trouble again. Cookies would solve that so I never deal with this again...

tedster

6:50 pm on Jul 7, 2008 (gmt 0)

You might also be able to use robots.txt to disallow any urls that include the tracking codes - if the pattern is easy to identify using wild card pattern matching. The major search engines all support pattern matching in robots.txt now