Googlebot hits site 75,000 times in two days?

Forum Moderators: open

Message Too Old, No Replies

Googlebot hits site 75,000 times in two days?

Googlebot

Swamper

4:15 am on May 20, 2003 (gmt 0)

Has anyone seen where a googlebot gets stuck on indexing a site and posts over 75,000 hits to a site in two days and is still going? The site contains approximately 150 pages.

I wonder if this is going to affect my ratings at all.

It is also slowing down my server. I have emailed Google a couple of times now with no reply. I imagine they are quite busy at times.

Anon27

4:16 am on May 20, 2003 (gmt 0)

When did this happen?

Critter

4:17 am on May 20, 2003 (gmt 0)

75K on 150 pages?

You, my friend, have set a spider trap.

Did you email googlebot@google.com? That's the email to use for this type of issue.

Peter

Swamper

4:17 am on May 20, 2003 (gmt 0)

It started Sunday and is still on going.
I have looked at some of the other posts and they say that the indexing is slow. I am just wondering if there are a few bots doing the same thing and is slowing down the rest of the bots.

Swamper

4:18 am on May 20, 2003 (gmt 0)

emailed them twice.
What is a spider trap?

Critter

4:25 am on May 20, 2003 (gmt 0)

A spider trap is a logical quirk in a web site that causes a spider to re-crawl the same pages because it perceives these as new pages.

Peter

Swamper

4:29 am on May 20, 2003 (gmt 0)

I would take it that this is not a good thing obviously.

Any thing in particular that I can look for to correct the problem.
The site has been on line for a year and half. It was redesigned over 2 months ago and have never seen this happen before.

It is a php site. I have some links at the bottom of the pages to stories that otherwise would be long urls. The pages they link to are simple urls ie. /53.php that have includes of the longer urls.

Any ideas?

Clark

5:34 am on May 20, 2003 (gmt 0)

If you so choose, sticky me the url and I'll check it out and see if I can find something

Skylo

6:14 am on May 20, 2003 (gmt 0)

Swamper is your code validated. Read quite a cool thread the other day. Make sure your code is clean otherwise their is more likely to be spider traps

Swamper

2:43 pm on May 20, 2003 (gmt 0)

Do you have the url for the thread that you seen?

trillianjedi

2:45 pm on May 20, 2003 (gmt 0)

Sounds to me like you are issuing googlebot with a session ID.....

creative craig

2:54 pm on May 20, 2003 (gmt 0)

Invalid HTML wouldnt cause a spider trap, loads of sites have invalid code and this type of thing does not happen, the problem would lie else where.

trillianjedi

3:03 pm on May 20, 2003 (gmt 0)

Sounds to me like you are issuing googlebot with a session ID.....

Sorry I'll clarify now I have a little more time, as from what you said in your original post I'm 99% certain this is your problem.

Your rewritten short URL's contain session ID's (are you using PostNuke?).

This means that each hit from a bot creates a new SID in the URL, and therefore the bot thinks each time that it's a new link...

You need to amend your php scripting to check the useragent (in this case it's "googlebot") and remove the SID for those rewritten URL's.

hth,

Swamper

3:25 pm on May 20, 2003 (gmt 0)

You are close with the Post Nuke - We are actually using phpnuke. The only thing I can see that may have happened was that the amazon block that came came with the install somehow got turned on. It wasn't on before this.

Once we turned the block off, the bot seemed to subside and is not going crazy anymore.

When viewing the logs, there is a session id on each of the bot's crawled pages and it was trying to crawl all of the pages. It was even trying to crawl extinct pages.

I don't think it is our short url pages. It looks to be the amazon affilate block that may have done the job.

I would be interested in learning how to turn off session id for the googlebot. Is there a web site (or quick explanation) that may help us with this?

Thanks

trillianjedi

3:34 pm on May 20, 2003 (gmt 0)

Keep an eye on it anyway - sometimes with spider traps it subsides and then goes crazy again.

I'd be very surprised if the Amazon block in phpNuke caused this - unless you had rewritten the URL's. How many variables are in the Amazon block URLs?

The way you described the problem, I'm still convinced it's a SID issue.

I can't remember now how to remove the SID's for a particular user agent, but I do know that I got the answer in these very forums!

Maybe worth looking at my profile and the threads I have been involved in and go back about 1 month (if you can do that?). Alternatively search this forum for "googlebot session ID" or something similar.

I had a similar problem - PostNuke with some re-written URL's. Although our problem was subtley different to yours, the basic cause was session ID's.

The rest of your phpNuke site, including the Amazon block, I don't believe can cause this problem because the URL's contain at least two variables

Swamper

3:44 pm on May 20, 2003 (gmt 0)

looking deeper into the server logs - It looks like the bot got stuck on the phpbb2 hack for phpnuke.
It is something with the session id and the forum.

Does that make more sense?

trillianjedi

3:49 pm on May 20, 2003 (gmt 0)

Yeah makes perfect sense.

I guess the phpBB2 hack (I don't actually know it) makes short spiderable URL's like:-

www.domain.com/forum_index_view_2.html

Or something similar?

It's also serving up a session ID and googlebot is going round in circles.

I'm not a php coder so can't really help you from a coding point of view, but I know if you search the net or this forum you will find the way of removing session ID's from URL's by user agent.

That's what you need to do. And if you pay for your bandiwdth - do it fast because I can guarantee that the bots will be back - they think they have not finished the job yet!

Swamper

4:20 pm on May 20, 2003 (gmt 0)

Thanks for all the help.
Fortunately I work at the ISP that it is hosted at. I am lucky that way. I reported the problem last night to the boss so he is well aware that it wasn't actual bandwidth.

I will post back here when I find a good solution.

Thanks again

trillianjedi

4:26 pm on May 20, 2003 (gmt 0)

Yes do. Or just link to the thread on here where I got the solution a month ago when you find it.... ;-)

Swamper

7:17 pm on May 20, 2003 (gmt 0)

after much searching I did find something unrelated that seems to have fixed the problem.

It looks like the phpbb2 mod for phpnuke was the cause.
This is the first time I have seen googlebot actually get stuck on session id's there.

The server log was showing google running around in the forum, each with a new session id. I found a post on PHPbb.com that seems to have somewhat fixed the problem. I had to modify a file that checked to see if googlebot was the user agent, then disabled session id's for it.

[phpbb.com...]

Although the hack is supposed to only make it easier on google to index the forums, it seems to have worked in my case to stop the trap.

I am still looking for a possible htaccess hack that might work the same as the above to perform this feat though as it would be nice to make a site wide google friendly site with no session id's.

Thanks

Swamper

7:36 pm on May 20, 2003 (gmt 0)

It has also been brought to my attention that the possibility that google will look at the changes as being a cloaking device.

My reply is a question. If google penalizes my site for getting stuck with 80,000 session id'd pages of indexing and bogs down the site at the same time, am I better off having my forums penalized for cloaking?

trillianjedi

7:57 pm on May 20, 2003 (gmt 0)

You won't get penalised for the Session ID thing so I wouldn't worry about that - you fixed it and that's the main thing.

I can't answer the cloaking question though I'm afraid.

Swamper

7:59 pm on May 21, 2003 (gmt 0)

Well I got a reply from google. They apoligized for the server load and asked if I wanted Googlebot to not index my site anymore. Of course I told them not to stop!

Here is their reply about session id and stopping googlebot from calling a session id.

Shutting off session IDs for Googlebot will allow for a more efficient
crawl of your site by our robots. In addition, shutting off session IDs
should have no effect on the ranking or the inclusion of your site in the
Google index.
In addition, the Google index does include pages that have question marks
in their URLs, including dynamically generated pages. However, these
pages comprise a very small portion of our index. Pages that contain
question marks in their URLs can cause problems for our crawler and may be
ignored. If you suspect that URLs containing question marks are causing
your pages not to be included, you may want to consider creating copies of
those pages without question marks in the URL for our crawler. If you do
this, please be sure to include robots.txt on the pages with the question
marks in the URL that block our crawler to ensure that these pages are not
seen as having duplicate content.

I hope that this will help some of you. I am still trying to see if I can find some code to put into an htaccess file that will disable php session id's for USER_AGENT googlebot.
If anyone knows if there is a way to do this, please let me know.

shrirch

10:08 am on May 22, 2003 (gmt 0)

The Amazon block is *NASTY*.

With every book you've added, the block shows another 3 "also purchased by other users" books. It also allows you to search for other books by name of the author / publisher.

This ends of with a ton of wasted bandwidth not just for googlebot which sucks every book / author and their books and more books purchased by people who bought those books... you get the picture.

See if you can use robots.txt to nail this one down. I'm not sure a *Amazon* type disallow would stop it.

Jesse_Smith

1:17 am on May 23, 2003 (gmt 0)

:::In addition, the Google index does include pages that have question marks in their URLs

I think that is outdated BIG time! Google has 8,420 listing of my message boards, and about half of the URLs have atleast one? in the URL. GoogleGuy himself says they index cgi/pl and php files.

Swamper

3:56 pm on May 30, 2003 (gmt 0)

Well I thought we had it figured out but Googlebot completely brought down our site and the server today.
I shut the site down completely and moved all pages into a different folder.

C'est la vie.

trillianjedi

4:01 pm on May 30, 2003 (gmt 0)

Sounds to me like googlebot still has those session ID's stored and is coming back to crawl them.

Can you check your logs (I'm sure they'll be huge now) and look for session ID's in the hits from googlebot?

Then let us know,