| This 72 message thread spans 3 pages: < < 72 ( 1 2  ) || |
|How long for Google to spider 29.5 million pages?|
length of Google spider
| 9:22 pm on Sep 10, 2007 (gmt 0)|
We launched a new site about 2 weeks ago that has about 29.5 million unique, search engine friendly, URLs.
Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files.
My question is, how long do you think it would take Google to spider the entire site and get them into their index?
| 3:18 pm on Sep 13, 2007 (gmt 0)|
Auto-generated does not necessarily mean spam or non-unique.
There is a ton of data offline that still isn't online. Imagine just collecting all openly published scientific papers, and re-publishing them online, page by page. A ton of pages? Check. Pretty damn unique content? Check.
What about Project Gutenberg? Also a ton of unique content.
Still - the local space is pretty much as hard as you can get. Yellowpages.com has has roughly 300k. Superpages.com otoh has 1.9 million pages. And it isn't always PR - Yelp has 300k pages with a PR7, whereas Yellowbot has 450k with only a PR5. How you implement it can make a huge difference too.
| 4:03 pm on Sep 13, 2007 (gmt 0)|
|Moti: "i push about 2,000 articles per month = 24,000 per year" Is that all your content, written by a real person and exclusive to you? Or is it, as one suspects, regurgitated? |
written by real people. press material, newsletters, user generated content. agency releases aren't unique by default, but they're undergoing a quality check and some editorial work. that's enough to constitute a worthwhile search engine result. if it wasn't, google would devaluate my pages. if i had more time, i'd put out ten fold the amount. my sector is highly competitive - we're talking about a news site here. you have to deliver a serious amount of fresh useful content at a constant rate to your users to keep them on the platform.
| 6:26 pm on Sep 13, 2007 (gmt 0)|
A poster at Search Engine Roundtable by the name of Matt Cutts had this to say on 2007 September 13 at 13:45...
|29.5 million pages? Wow. It must have taken forever for them to write that many pages. :) |
| 6:27 pm on Sep 13, 2007 (gmt 0)|
Lol, this is a naive statement :)
| 6:45 pm on Sep 13, 2007 (gmt 0)|
Evidently Yahoo Local must have also written its 6 million pages in the index.
| 7:01 pm on Sep 13, 2007 (gmt 0)|
I think Tedster's post (the second in this thread) is well worth re-reading.
It all comes down to statistical probability: If you're going to dump 29.5 million pages onto the Web, the burden of proof will be on you to convince the search engines that your content is worth indexing, bacause the odds that most of your pages contain useful and unique content aren't very high.
| 6:54 am on Sep 14, 2007 (gmt 0)|
|29.5 million pages? Wow. It must have taken forever for them to write that many pages. |
I guess the same can be said for Google, Google News, Google Images, Google Search Results, Google Books, Cached Pages in Google Search, etc. (Just poking fun - Nobody get their Google Underoos in a bunch please).
|It all comes down to statistical probability: If you're going to dump 29.5 million pages onto the Web, the burden of proof will be on you to convince the search engines that your content is worth indexing, bacause the odds that most of your pages contain useful and unique content aren't very high. |
Sums it up right there.
| 1:22 pm on Sep 14, 2007 (gmt 0)|
AhmedF, the numbers you have mentioned are not correct regarding the regular index, more than 60% of those pages are in supplemental index and worthless.
| 4:52 pm on Sep 14, 2007 (gmt 0)|
Lol, this is a naive statement :)
I think Matt Cutts was being sarcastic, rather than naive. He knows what goes on and I suspect he's implying that those pages are auto-generated and probably worthless.
| 2:51 am on Sep 15, 2007 (gmt 0)|
|Someone here commented on my post with: |
"There was another post earlier in the thread, claiming that some millions of pages, although less than 29,000,000, were also original content. What is that, a joke?"
No joke (although I said it was reasonably original content). The difference is in how the data is assembled and presented along with some tools for the user which I haven't seen anywhere else. I believe this makes the "reasonably original" content into original content.
That would be me :-)
I'll try to explain it the way I see it: Even though I'm a "webmaster" with a few sites (one of which, the main site, totally rocks and brings in loads of traffic), I'm also a user of the internet. It seems whenever I'm trying to find information that's important to me, I have to wade through many pages that offer nothing, that just get it in the way of the real content. This has been going on for years, of course, and the last while Google is complicit because of the MFA side of things.
So, when I read about 29 million pages going up for a brand new site, I see it as 29 million more pages that will come between me and what I'm trying to find. The original poster's website is not going to be Gutenberg, or anything similar - it's going to be "yellow pages". I do not believe for a moment that it will offer anything original - it's just there to scoop traffic from websites that actually deliver the real goods. The other one mentioned with however many million, same deal. With regard to the post about the 2,000 pages a month: For sure, I can see that it's totally legit. But 29,000,000, or even 6,000,000? No, man. Even if you were Hyundai and putting up a separate page for every nut and bolt in every car, it still wouldn't be that much.
| 3:01 am on Sep 15, 2007 (gmt 0)|
|...I posted a question about an idea I had for the site and it was immediately squashed by much of the same small mindedness I see here today.... |
I looked at all the threads you posted in since 2002 and saw no evidence of anyone "squashing" any idea that you had, except for the idea that you had back in 2003 that Alexa rankings really meant anything.
Which is beside the point, since we are talking apples and widgets here.
Yes, it IS possible to have 30 million pages with content. But it is highly unlikely that more than 5% of those will ever see the light of day on the search engines. And the question is, what kind of content? Something like Project Gutenberg is not the same as having 30 million auto generated pages. Nor are scraped RSS feeds, Wikipedia content, or phone book listing.
| 9:42 pm on Sep 23, 2007 (gmt 0)|
The answer I think is NO!
With Google's new wand of power waving about there is a trust issue now with big instant sites....over a couple thousand pages....
[edited by: Robert_Charlton at 10:38 pm (utc) on Sep. 23, 2007]
[edit reason] removed specifics [/edit]
| This 72 message thread spans 3 pages: < < 72 ( 1 2  ) |