Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot using excessive bandwidth

Googlebot is using a third of my bandwidth

         

nigelt74

2:39 am on Sep 5, 2006 (gmt 0)

10+ Year Member



Hi all

The problem
Listed below are my site stats for spiders/robots
(This is just for the last 5 days)

Googlebot12192+16428.12 MB
MSNBot60+311.05 MB
Unknown robot (identified by 'spider')39+33626.64 KB
Inktomi Slurp25+23353.72 KB
Unknown robot (identified by hit on 'robots.txt')0+102.27 KB
Unknown robot (identified by 'crawl')3+164.23 KB
AskJeeves1+16.51 KB

as you can see Googlebot is eating up a lot of bandwidth, and i would like to reduce this

The Site
First of the site is ...
has a dynamically generated store
has phpBB on it
has Coppermine photo gallery
and web calendar

What i have done to fix the problem
To try and solve the problem
I have used my robots.txt to block access to all unneccesary parts of the site (including images and web calendar),
I have also put a seperate robots.txt file in the phpBB directory to cut access to everything apart from those files i want indexed (index.php, viewForum.php, viewtopic.php)
I have also done the hack to remove the session ID on the phpBB

I am at my wits end, i don't want to affect my ranking, but this bandwidth usage is really quite high accounting for about a third of the sites bandwidth on average, (although this month it is about 50%)

Things i have heard about but am too chicken to try
Contacting google
apparently you can contact google and they can dial down the crawling, but I have read that this can really adversly effect rankings
Setting a crawl delay
How does this work,
if I set it to a week does that mean googlebot will index the whole site and then dissappear for a week, or does it mean it will crawl one page then wait a week before crawling the next page? -- sorry this delay thing confuses me

Any help would be greatly appreciated

Cheers
Nigel

nigelt74

3:35 am on Sep 5, 2006 (gmt 0)

10+ Year Member



Whoops something went wrong there
Statistics should read
spider --hits --bandwidth
Googlebot --12192+16 --428.12 MB
MSNBot --60+31 --1.05 MB
Unknown robot (identified by 'spider')--39+33 -626.64 KB
Inktomi Slurp --25+23 --353.72 KB
Unknown robot (identified by hit on 'robots.txt')--0+10 --2.27 KB
Unknown robot (identified by 'crawl') --3+1 --64.23 KB
AskJeeves --1+1 --6.51 KB

tedster

3:42 am on Sep 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have also done the hack to remove the session ID on the phpBB

How long ago did you take this step? Is much of the spidering bandwidth still going to "sessionID" urls?

nigelt74

4:18 am on Sep 5, 2006 (gmt 0)

10+ Year Member




How long ago did you take this step? Is much of the spidering bandwidth still going to "sessionID" urls?

A bit over a day ago, and yes a lot of the spidering is still going to spidering urls with session IDs, how long should it take for these sessions to die out, as i had assumed it would be virtually instantaneous because i thought googlebot wouldn't retain the phpBB cookies (and therefore the sessions) from my site

Thanks

nigelt74

6:15 am on Sep 5, 2006 (gmt 0)

10+ Year Member



Ok I have now purged my sessions table, so hopefully that will make a difference

nigelt74

9:48 am on Sep 6, 2006 (gmt 0)

10+ Year Member



Well that seems to have made no difference, it is still eating bandwidth, and it is ignoring the seperate robots.txt file i put in the phpBB root

any help would be greatly appreciated

motorhaven

12:46 am on Sep 7, 2006 (gmt 0)

10+ Year Member Top Contributors Of The Month



robots.txt in a sub-directory is pointless, it won't be read according to the spec. robots.txt must be in the URL root directory. Put your phpbb exclusion rules in your main (and should be - only) robots.txt file.

nigelt74

12:57 am on Sep 8, 2006 (gmt 0)

10+ Year Member



Thanks for that,

I actually appended my root robots.txt file with the contents of my phpBB root robots.txt file yesterday and Am hopeing to see a change in traffic soon

Cheers

nigelt74

2:02 am on Sep 8, 2006 (gmt 0)

10+ Year Member



Update:

I am still having googlebot problems

This is the latest stats

Googlebot --16938+25 --597.46 MB
MSNBot --122+61 --1.91 MB
Inktomi Slurp --60+53 --1.21 MB
AskJeeves --4+3 --140.58 KB (thats a record for jeeves on my site more than the previous 2 months put together :) )

I had thought the problem was the phpBB as that was where the majority of site googlebot hits were coming from but now I am not so sure, in the last 4 hours googlebot ate 50Mbs of bandwidth
Is there a way to work out what parts are responsible for the bandwidth

My alteration of the robots.txt file to exclude parts of phpBB is now being recognised and none of the bots are trying to index any of the excluded parts,
However session Ids are still appearing in a small portion of the googlebot results, will this gradually decrease over time?

When i look at my Webmaster tools (part of the google Gmail thingie), i have an option to reduce the frequency of indexing carried out by googlebot, will this effect my ranking if i do this?

jdMorgan

2:41 am on Sep 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I actually appended my root robots.txt file with the contents of my phpBB root robots.txt file yesterday and Am hopeing to see a change in traffic soon

You can't do that, either. You must merge the per-robot records from both files into one file.

Review this doc: [robotstxt.org...]

Read it conservatively: If it doesn't say you can do something, then you cannot do it.

Jim

nigelt74

3:47 am on Sep 8, 2006 (gmt 0)

10+ Year Member



Sorry i explained that badly, i have acually merged the two robots.txt files into one

Bewenched

4:57 am on Sep 8, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Maybe you need to set your content expiration in your http headers and enable compression. It will save alot of bandwidth.

nigelt74

2:16 am on Sep 9, 2006 (gmt 0)

10+ Year Member



How do i go about Enabling compression and setting content expiration in headers?

Does the meta tag "revisit" work?

jdMorgan

2:42 am on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Does the meta tag "revisit" work?

No. See the bolded phrase "completely worthless" in this FAQ: [code.google.com...]

That tag was invented and used by one small directory about ten years ago, and its use spread like wildfire among Webmasters who believed they could tell robots when to come back. It's like telling the tax man how much you'll pay and when... :)

Jim

jdancing

2:58 am on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bandwidth is cheap, just be happy Google is visiting :-)

theBear

3:38 am on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



By any chance is Google still requesting pages with a sessionid in the request? Just because you stoped phpBB from issuing them Google will still have them in its url database.

nigelt74

4:19 am on Sep 9, 2006 (gmt 0)

10+ Year Member



Yes some of the requested pages still have a session ID in them, probably around 20% of the phpBB requests made by googlebot but it does seem to be falling
Any suggestions on how to eliminate these requests?

vincevincevince

4:24 am on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you need to wait a few weeks. Google doesn't even read the robots.txt every time it comes.

Bewenched

6:31 am on Sep 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



apparently you can contact google and they can dial down the crawling, but I have read that this can really adversly effect rankings

Any way to get them to pump up the volume for a site?