Forum Moderators: open
Put a new site up 30 hours ago with one PR5 link to it. Gbot crawled it and cache showed up within 12 hours. Then new fresh tag 8 hours after that.
In that time period it asked 4 times for robots.txt and still crawls a DIR that's off limits.
Whatever their 'panic crawling' they better get their act together.
Why are they hitting everybody that hard? With Billions of websites in their index one could assume they would be able to easily spread it out over enough websites that a single site doesn't get so strong at one point in time.
If they spider a site with 1 request in 2 seconds, they'd still be able to get > 2.5 Million pages over 24 hours from a single site. Why request 5-6 pages per second from the same website? Are we at the point where there are more webcrawlers than webservers?
I really hope this won't become standard-behaviour. Some sites have scripts that prevent people from downloading complete sites by serving 503s if too many requests come in during a short time period...
If they spider a site with 1 request in 2 seconds, they'd still be able to get > 2.5 Million pages over 24 hours from a single site.Huh?
There's 86,400 seconds in a day. One page every two seconds would be 43,200 pages in a day, not 2.5 million.
You're a little off. :)
I think the keyword here is site and not page?
He said spider a site with a single request every 2 seconds. If there could be 43,200 sites spidered at an average of 57 pages per site that would equal 2.5 million pages.
Having said that, I myself am a little confused about the statement and its true meaning :-\
I presume an entire site can not be indexed through a single request... :-)
Googlebot blasted my site yesterday after not visiting the site since the 14th of Sept. We were starting to get a bit worried about it but I kept coming here and reading about others that were getting the same treatment (not visiting or visiting very little then out of the blue a very deep crawl).
Just wanted to chime in and more or less introduce myself to everyone and share what Googlebot did to our site today.
Googlebot is trying to spider pages that don't exist on the domain but do exist on a domain on the same server and thus the same IP I think.
I can even see this happening between my sites.
Site 1 is getting requests for documents which only exist on Site 2. Sites 1 and 2 are on completely different servers in different servers/ISPs/locations/IP ranges.
What I can say, is that the links are of the type: link.php?url=www.abcd.com
provided that the bots were inappropriately attributing these dynamic redirects as belonging to site 1 and not site 2 where the content actually resides on site 2... then now the bots now have to determine *if* the page is on site 1 or not. I suspect the site 1 serp will be delisted and *hopefully* site 2 will now show the serp *and* get the appropriate p.r. transfer it is rightfully due from the incoming link from site 1.
provided that the bots were inappropriately attributing these dynamic redirects as belonging to site 1 and not site 2 where the content actually resides on site 2..
Just to confirm - The link from Site 1 is to site 2 root only, not to any document on site 2, but gbot is looking on site 1 for documents which exist on site 2, but are not linked directly.
If I had a link to webmasterworld, I would be getting requests for control panel, site search, glossary etc.
A page modified on Sept 8th, and cached daily, showed up in the index for new search terms from new content on the page within 48 hours, but was still findable for search terms no longer on the page until only 3 days ago (even though the cache reflected the new content, the snippet still contained the old content, when running a search for the old content).
On the day that it was no longer findable for old content, Googlebot had shifted one hour earlier in its spidering compared with the time of arrival daily for the previous 3 weeks or more. Additionally, a fresh date was included on the day of the change (even though the fresh content had been online for 3 weeks, and had been indexed and cached daily for 3 weeks). Until then the new content result had not included a fresh date.
Over 10k pages cached, and 320MB of bandwidth used in 4 or 5 days straight of constant crawling. My entire website uses phpBB2 as a backend, and the forums reported this as of yesterday:
Most users ever online was 238 on 27 Sep 2004 06:18 pm
So its been going at me quite hard, it seems. Hopefully the new linking of the forums has caused the crawler to get my whole site, instead of the little 35 pages it crawled before.
DS
provided that the bots were inappropriately attributing these dynamic redirects as belonging to site 1 and not site 2 where the content actually resides on site 2...
This is the crux of what is happening I believe.
So its been going at me quite hard, it seems. Hopefully the new linking of the forums has caused the crawler to get my whole site, instead of the little 35 pages it crawled before.
I have the exact same scenario. I have a forum on one of my sites that has never been fully crawled. Recently every page in the forum was crawled. In the past, a few pages of the forum would actually make it in the SERPs, but only after a month or 2. I check the SERPs now and it seems that about 30% of the forum pages are listed. That is way up since last month. Also some of these pages are very recent.
Stand by for the holiday cheers and jeers...if history is any indication, google's about to shake things up in a big way ;-)
My site has also moved up in many of our targetted results, which is nice. Hopefully I can track this change tonite and determine exactly where I moved up, and if any moved down, etc.
DS