Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

googlebot crawling my site just after my subscribers?

         

thefa

12:03 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



Hi,

I have noticed a strange behaviour of googlebot:

I have a newsletter that I send weekly to a subscribed members. The links inside this newsletter are passing parameters back to the site with this simple format :
?ref=#*$!&id=yyyy where
- ref=#*$! will identify the issue of the newsletter
- id=yyy will identify the member that has received the newsletter and allow me to build stats on who is clicking on what link...

I then log every click on the newsletter links and have noticed the records like the ones below:
05:30:58 /index.php yyy my-visitor's-IP-here dummyhere@hotmail.com #*$!
05:31:35 /index.php yyy crawl-66-249-66-145.googlebot.com dummyhere@hotmail.com #*$!

For every click, I seem to have few seconds after a googlebot visit, reusing the same parameters than the one that were passed by the member clicking on a link in the newsletter!

Is this normal? How does this work? Any way for me to avoid that?

Thanks in advance for your help,

tedster

4:08 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have exactly the same situation - and I've been using robots.txt to disallow googblebot from urls with that query string parameter. You could also use a script to insert a noindex robots meta tag if the query parameter is present in the requested url.

Just yesterday, I also found other websites publishing the email I send out - not the actual website content, but the email with a link including the query string. To get some value from those backlinks, I am considering changing over to a 301 redirect that removes the query string. I just need to be sure that my analytics program will still still give me the data I need.

thefa

6:58 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



Hi tedster,

Thanks for the quick reply.

I wonder both how and why there is such a behaviour from googlebot... In any case, I too need to take some counter measures.

The robot.txt way would prevent googlebot from crawling pages when such parameter is present.

Do you understand well, that your second idea would be to process the parameters passed in the URL and then simply 301 redirect to the same URL without those parameters?
Is this correct?

In that case, I might not even need the robot.txt thing, the redirect would take care of everything?

Thanks.

Robert Charlton

7:10 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



How does this work?

Google could be picking up these urls with tracking strings from server logs or a stats page somewhere. See this discussion...

Why is Google indexing my entire web server?
[webmasterworld.com...]

In addition to implementing the fixes suggested by tedster, you might want to check to see whether your stats pages are open to spidering. As tedster points out, though, there are many places where these links might be picked up, eg....

...other websites publishing the email I send out - not the actual website content, but the email with a link including the query string. To get some value from those backlinks, I am considering changing over to a 301 redirect that removes the query string. I just need to be sure that my analytics program will still still give me the data I need.

With tagging-based analytics programs, I've lost tracking information when using .htaccess rewrites to redirect these. How would you preserve tracking info with tag-based tracking?

Hoping this additional question regarding such redirects isn't too far afield... on sites where a lot of such urls are floating around in the wild (and now going to landing pages that no longer exist), I've been tempted to redirect them to the home page (ie, to the default canonical). I've hesitated, though, wondering whether or not Google would see the sudden redirection of many urls to the home page as "spammy." Thoughts?

Robert Charlton

7:15 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



PS... Was posting at the same time as the previous poster.

...your second idea would be to process the parameters passed in the URL and then simply 301 redirect to the same URL without those parameters?

Can this be done if you're using .htaccess to do the redirect?

thefa

7:34 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



I was thinking about something like this:
<?
if (paramsAreSet()) {
processParam(...); // log what needs to be logged
header("Status: 301 Moved Permanently", false, 301);
header("Location: http://www.mysite.com/thepage.php");
exit();
}
?>

Would that work?

Coming back to how it works:
What is strange is that googlebot is not visiting days after the first clik occured from my subscribers.
It is occuring few seconds after. In my logs the records are nearly next to each other : one of my visitors followed by his shadow googlebot... How can they find this so quickly?

Robert Charlton

8:02 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What is strange is that googlebot is not visiting days after the first clik occured from my subscribers.
It is occuring few seconds after.

A similar question occurred in a recent thread, and the conjecture was that Google might have seen the site in question as a news site, and that some sort of "burstiness" of terms might have been involved....

Very fast indexing on a specific page
What does all this mean about my site, if anything?
[webmasterworld.com...]

If you're being considered a topical news source, it may be that Google is watching traffic to you closely. I was initially skeptical, but the fact that you have supporting log evidence makes the discussion in that thread perhaps applicable here.

thefa

8:53 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



I checked my logs and I could find out the following:

1. I started to see googlebot visiting my site with my newsletter URLs from August the 11th on (2007). Before nothing, after systematic.
Yes, it is only now that I take the time to investigate...

2. It is not related to the freshness of my pages or if my pages have been indexed already or not. There is a systematic googlebot visit after (nearly) every click from my visitors - even several days after my newsletter has been sent.
I say "nearly" because I can not say if really every single visitor click is doubled by a googlebot visit, I'd need to spend more time on the logs to be sure to be sure.

If any one is interested I can provide some very simple daily logs with:
click time; page visited; IP; couple of parameters passed;

I start wondering if I could have done something that would generate this strange thing?

One additional note : as a difference with the thread you refer to, Robert, I don't know if my pages are inserted as fast in the G index. What I know is that there is a visit from googlebot.
Ironically, I some time ago had an issue with Google taking FOR EVER to index my new pages - gone now, I never undestood why I got punished like this and then why I got out of the mess...

thefa

10:04 pm on Dec 29, 2007 (gmt 0)

10+ Year Member



Ah, one thing : this could be simply the crawling from AdSense?

Actually, I did notice this sometime ago and tried to filter them out by checking for this:

$_SERVER['HTTP_USER_AGENT']!= "Mediapartners-Google/2.1"

And by looking at the server's log, I just found out that user agent seems to be now: "Mediapartners-Google"...
That could explain the reappearance in August if they have changed the name of the user agent around that time?