Forum Moderators: open
I wonder if this is going to affect my ratings at all.
It is also slowing down my server. I have emailed Google a couple of times now with no reply. I imagine they are quite busy at times.
Any thing in particular that I can look for to correct the problem.
The site has been on line for a year and half. It was redesigned over 2 months ago and have never seen this happen before.
It is a php site. I have some links at the bottom of the pages to stories that otherwise would be long urls. The pages they link to are simple urls ie. /53.php that have includes of the longer urls.
Any ideas?
Sounds to me like you are issuing googlebot with a session ID.....
Sorry I'll clarify now I have a little more time, as from what you said in your original post I'm 99% certain this is your problem.
Your rewritten short URL's contain session ID's (are you using PostNuke?).
This means that each hit from a bot creates a new SID in the URL, and therefore the bot thinks each time that it's a new link...
You need to amend your php scripting to check the useragent (in this case it's "googlebot") and remove the SID for those rewritten URL's.
hth,
TJ
Once we turned the block off, the bot seemed to subside and is not going crazy anymore.
When viewing the logs, there is a session id on each of the bot's crawled pages and it was trying to crawl all of the pages. It was even trying to crawl extinct pages.
I don't think it is our short url pages. It looks to be the amazon affilate block that may have done the job.
I would be interested in learning how to turn off session id for the googlebot. Is there a web site (or quick explanation) that may help us with this?
Thanks
I'd be very surprised if the Amazon block in phpNuke caused this - unless you had rewritten the URL's. How many variables are in the Amazon block URLs?
The way you described the problem, I'm still convinced it's a SID issue.
I can't remember now how to remove the SID's for a particular user agent, but I do know that I got the answer in these very forums!
Maybe worth looking at my profile and the threads I have been involved in and go back about 1 month (if you can do that?). Alternatively search this forum for "googlebot session ID" or something similar.
I had a similar problem - PostNuke with some re-written URL's. Although our problem was subtley different to yours, the basic cause was session ID's.
The rest of your phpNuke site, including the Amazon block, I don't believe can cause this problem because the URL's contain at least two variables
TJ
I guess the phpBB2 hack (I don't actually know it) makes short spiderable URL's like:-
www.domain.com/forum_index_view_2.html
Or something similar?
It's also serving up a session ID and googlebot is going round in circles.
I'm not a php coder so can't really help you from a coding point of view, but I know if you search the net or this forum you will find the way of removing session ID's from URL's by user agent.
That's what you need to do. And if you pay for your bandiwdth - do it fast because I can guarantee that the bots will be back - they think they have not finished the job yet!
TJ
It looks like the phpbb2 mod for phpnuke was the cause.
This is the first time I have seen googlebot actually get stuck on session id's there.
The server log was showing google running around in the forum, each with a new session id. I found a post on PHPbb.com that seems to have somewhat fixed the problem. I had to modify a file that checked to see if googlebot was the user agent, then disabled session id's for it.
[phpbb.com...]
Although the hack is supposed to only make it easier on google to index the forums, it seems to have worked in my case to stop the trap.
I am still looking for a possible htaccess hack that might work the same as the above to perform this feat though as it would be nice to make a site wide google friendly site with no session id's.
Thanks
My reply is a question. If google penalizes my site for getting stuck with 80,000 session id'd pages of indexing and bogs down the site at the same time, am I better off having my forums penalized for cloaking?
Here is their reply about session id and stopping googlebot from calling a session id.
Shutting off session IDs for Googlebot will allow for a more efficient
crawl of your site by our robots. In addition, shutting off session IDs
should have no effect on the ranking or the inclusion of your site in the
Google index.In addition, the Google index does include pages that have question marks
in their URLs, including dynamically generated pages. However, these
pages comprise a very small portion of our index. Pages that contain
question marks in their URLs can cause problems for our crawler and may be
ignored. If you suspect that URLs containing question marks are causing
your pages not to be included, you may want to consider creating copies of
those pages without question marks in the URL for our crawler. If you do
this, please be sure to include robots.txt on the pages with the question
marks in the URL that block our crawler to ensure that these pages are not
seen as having duplicate content.
I hope that this will help some of you. I am still trying to see if I can find some code to put into an htaccess file that will disable php session id's for USER_AGENT googlebot.
If anyone knows if there is a way to do this, please let me know.
With every book you've added, the block shows another 3 "also purchased by other users" books. It also allows you to search for other books by name of the author / publisher.
This ends of with a ton of wasted bandwidth not just for googlebot which sucks every book / author and their books and more books purchased by people who bought those books... you get the picture.
See if you can use robots.txt to nail this one down. I'm not sure a *Amazon* type disallow would stop it.