Forum Moderators: Robert Charlton & goodroi
I have a new website, and generally speaking, I'm new to webmastering. I'm a software developer and have also developed many webtools, I've just never before administrated a site for general consumption... so, though I've read lots, I've never personally experienced implementing SEO techniques, etc...
A Brief History
My wife has started a small business and we have built a corresponding website. We have never submitted anything directly to any SE. The only point of exposure has been through a single link from the local chamber of commerce site for our town. Though our site is still not complete, over the last 3-4 months, several search engines have found us and have started indexing us.
At first it was exciting to see the engines slowly dig deeper into our site but now I'm starting to see a trend that I would like to have more control over.
Google has now reached a frenzied rate of indexing our site every single day. (Though I have indicated that our site is not complete, really, it's 90% done, I have only yet to implement online transaction services... Our products, forum, etc... are all functioning.) This means that Google is hammering about 1500 pages daily (dynamically driven product descriptions, etc...). This is leaving me with mixed feelings... I want Google to know we exist, but our content is only updated about once every week to 2 weeks. Likewise, I suspect that by the time we level out adding new product to our inventory, we will have approximately double our number of unique products (we currently have about 1200). Likewise, a specific product will never have it's description or image changed substantially.
Is there a way for me to limit Google's spidering of our site to once a week, without suffering some type of penalty?
OR...
What if I detected whether or not it was Google's spider and then served identical pages but excluded images (non-malicious-cloaking), would Google penalize me?
I'm ok with Google spidering us but either it doesn't need our images or it doesn't need to be doing it every day. I've already edited our robots.txt to exclude our images folder and I've also completly banned Google Images from indexing us (Googlebot-Image). This appears to be making no difference. My assumption is that though Google will not directly follow into our images folder, any pages using resources within it still get delivered.
Anyone got any suggestions?
Thanks
PebĪ
If that doesnt work, you might look into mod_throttle from apache, if you're using that webserver. The purpose of that module, afaik, is to throttle users that consume too much resources.
Good luck.
If your site is based on PHP have a look at this Content-Negotiation class [webmasterworld.com]. It will help you to implement the IMS, etc headers that will keep bandwidth/server-load within reasonable limits.
I will just stand by and wait a week or two before I try anything else.
Oh, my site is running on an IIS webserver (asp.net). Currently we're using a PHP forum but that will soon be changed over to a .NET based forum. The crazy indexing is happening on my aspx pages.
PebĪ
The crazy indexing is happening on my aspx pages.
Nothing against .ASP pages, btw (although my site is Linux, and I do not have the experience to help you directly - try a question in the MS forums on this site for help). All dynamic sites have the same problem. Comments below are based on Apache + PHP, but are the same for all dynamically-produced/database pages; it is the solutions which change, not the problem.
All webservers are setup by default to provide a Status 304 Not Modified response for a static HTML page if (and only if) the client sends an If-Modified-Since request for a previously-requested page. A 304 Response means that only a page-header is sent (no page-body) which reduces bandwidth astronomically. There are also numerous other content-negotiation Request-Headers (If-Unmodified-Since, Accept-Encoding, If-None-Match - it goes on & on) all of which are for the same, bandwidth-saving reasons.
As said above, *all* webservers transparently perform this negotiation, by default, for static files. All webservers do NOT do *any* such negotiation for dynamic pages - you are expected to place the coding into the pages yourself. Almost no-one does this. Consequently, every web-client sees each page as brand-new each time it is requested. Thus, no content-negotiation, no bandwidth-savings.
The end-result of the above is that (in the absence of any content-negotiation) every single byte of every single page on your site will be re-requested by every single search-bot every single day. Your bandwidth will go through the roof.
My AWStats robot.pm file currently lists 417 different bots, 62 of which were active on my site in February 2006.
Erm, perhaps you are beginning to think that content-negotiation may be a good idea?
I have some information that appears to contradict one of your points though... (I don't have my stats in front of me so I'm going from memory and they are rounded.)
Google is the only robot that is nailing me. (I only have 5 regular spiders right now. As explained above, we actually haven't even submitted to anyone yet.)
Google has consumed 7200 pages and approx 300Mb in just the last week. The second highest bot (which has been indexing us longer than Google) has only consumed 300 pages and the average size of those pages is under half the size of the pages taken by Google.
This seems to tell me that the other bots are at least caching portions of the pages they are indexing. This is why I viewed this as a "Google" issue but at this point I'm willing to concede that it may in fact be a MS/aspx problem.
Anyway, that's enough speculation. I will definately dig into this. Our site is still small enough and has low enough traffic that I can micro-manage it quite easily. If this really is what's happening, it would be excellent to rectify it right at these early stages before it's a big problem.
If we could award points, here's 100 for you!
PebĪ
Google has consumed 7200 pages and approx 300Mb in just the last week. The second highest bot (which has been indexing us longer than Google) has only consumed 300 pages and the average size of those pages is under half the size of the pages taken by Google.
It depends on which (other) bot and--suprisingly--which Google-bot.
Let's pick out 2 Request-headers which can each have a vast impact on bandwidth:
Accept-Encoding:
Depending on the web-server setup (this one is not usually set up by default although, IMHO, it should be) this can reduce bandwidth on text/html pages (not graphics) by more than 80%. It can also work on dynamic as well as static pages (the PHP-Class for Content-Negotiation [webmasterworld.com] will give better compression than a web-server, since it is both dynamic and load-balanced). It depends on the client sending the Request-Header and the web-server being able to respond to that header. Most browsers do send the header, but many bots do not. Where Google is concerned, the original G-bot does not send the header, whereas the Mozilla G-bot [webmasterworld.com] does send it.
If-Modified-Since:
Naturally, a browser or a bot needs to have already requested a page before they can send this Header. The key crunch with this Header comes if you think how often a visitor may use the Back button (this Request-Header is sent everytime it is used). If a dynamic page, and therefore your webserver cannot respond to it, the full page will be re-sent. Where Google is concerned, at times preceding an update (ie now) it often makes page-requests without ever sending the header, deliberately (just to "make sure"?) although in normal times all bots will send the header, since it is in their interests to do so. It is this header--or, rather, the inability of a webserver to respond to this header for dynamic pages--that leads to webmasters reporting that Google is downloading their multi-thousand site every week, even though hardly any pages have changed.
Finally, just to try and give some inkling of how desperate Google is at this moment to get every webpage that it can, I prevent any and every visitor from getting more than 1,000 pages/day [webmasterworld.com] on my site, serving up a 503 Server Busy if they exceed this total. In the last 28 days Google has received 28,000 200s and 250,000 503s from my server!
PS
Another 100 points, please.