| This 72 message thread spans 3 pages: < < 72 ( 1  3 ) > > || |
|How long for Google to spider 29.5 million pages?|
length of Google spider
We launched a new site about 2 weeks ago that has about 29.5 million unique, search engine friendly, URLs.
Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files.
My question is, how long do you think it would take Google to spider the entire site and get them into their index?
29.5 million pages. Let's give that number some serious thought here. That is a lot of freakin' pages, and I do mean a lot! Think about the resources required to index even one half that amount. Bot activity alone would probably take your server down quickly once it became indexed. Do you have the resources available to handle that kind of activity? I'd guess you will probably need about 60-100 servers?
Google is not going to let you dump that many pages into the indice, or at least I don't think so. Others may have a different opinion but things have changed considerably in the last 12 months for large scale sites. Just ask anyone around here who used to have 5 million pages indexed and now has less than a million. Those types of numbers are staggering and I doubt very seriously that any one, two or even three man operation would be able to effectively harness and maintain a site of that size.
Is your server, or are your servers, capable of handling a string of bots that are going to be latched on to your site 24/7/365? Are you prepared for all the scraping? Have you thoroughly thought out your site architecture? Does it follow what everyone else is doing? Are you using any software that others are using? Any footprints of the automation? With that many pages, yes, there are many footprints.
If you've got pages and all that is changing is a city, state, country, etc., you are most likely earmarked for the "invisible" Supplemental Index. Google is not going to waste its resources, nor are the others, on content that it already has in its index. Unless of course you've done something totally different, totally "outside the box", I think your mission is going to be an extremely challenging one.
Excellent - our backyard :)
We do a lot of 'local' stuff - we sell YP data (like Axciom/InfoUSA/etc do), we have YP listings, weather, etc etc.
We've found that Google starts to 'choke' (to use that word) at roughly 300,000 pages indexed. We have subdomains for every core function - for YP, for weather, etc. Each has roughly 300k pages indexed.
Now - the YP subdomain gets hit roughly 125,000 times a day, peaking at 200,000. We've also seen a parallel between 'time spent downloading pages' and how hard we get hit (our average is roughly 140 ms - our lowest was 51 ms, which is the day we were hit 200,000 times and had 4 gigs of data pulled).
This has been going on for six months, and yet Google has remained pretty steady at ~300,000.
At the same time, one of our sites does have 450,000 pages in Google. It has a ton of deeplinks and PR7 (our other sites are PR6s).
So really - what we focus on now is getting deep links. At the same time, we have found that if pagerank is useful for anything, it is how fast and hard you get indexed.
I will say that 'local data' per se is everywhere out there. Unless you are doing something to truly bring about unique content (user reviews, pictures, etc) - your site won't get very far.
|We've found that Google starts to 'choke' at roughly 300,000 pages indexed. |
exactly my experience. on my site with >300,000 pages, number of indexed pages remains steady at this point. now if we'd achieve a higher page rank, i assume we'd climb one step further in max indexed pages.
Just my personal opinion, but I think this is insane.
30 million pages. 300 million people in the US, counting babies and a few dogs.
10 people per page. Assuming that every one of those babies and dogs has a computer, internet access, and a desire to go look for.... hmm... look for WHAT, exactly?
How can anyone launch so many pages at once? Are they really unique pages with useful information for humans?
Google - the world's information landfill!
When you talk about indexed pages, do you mean both the supplemental index and the searchable index?
Google has also had storage issues in recent years. That could explain limited indexing, esp. on mega sites...
How well indexed are similar and competing sites for your niche? How many pages?
You will never get 29.5 million pages indexed...
I have a PR5 site with millions upon millions of pages and the maximum number of pages indexed was 301,000, but Google has trimmed that to 198,000 .. It will probably continue to add and drop pages throughout the years, but I'd be very surprised to see 1,000,000 pages indexed..
The city-state thing has been beaten to death as well.. You might worry about getting -30'd or even -950'd if your pages aren't vastly different from each other.. I would actually count on it.. having too many pages on a site is detrimental I believe unless Google trusts you (which they don't)
|Just my personal opinion, but I think this is insane. |
Couldn't agree more. Twenty-nine million pages, and this is all somewhat unique, is it? There was another post earlier in the thread, claiming that some millions of pages, although less than 29,000,000, were also original content. What is that, a joke?
Sorry, but it's nothing but spam, and all that dross just makes it harder for users to find real content. Of course, that assumes that your giant load of dross will actually get listed, which is very unlikely to happen. It's the slightly sub-million ones that are the real problem.
My apologies for being harsh, but this is simply ridiculous.
|My apologies for being harsh, but this is simply ridiculous |
buy widgets in houston, texas
buy widgets in austin, texas
This strategy was genius in 1997
It screams filter me in 2007
The site could probably be replaced by 5k well researched, no-nonsense original content.
I bet 99% of the pages you have there, are not visited by you as well. If you say you did, you would have posted this question not earlier than 2010.
Happy Surfing :)
It is very difficult to make use of any data that's not exclusively yours, and still generate pages that are worthwhile to the visitor - and unique enough to make it into the Google index.
I worked with a very savvy team of ten last year who hoped to succeed in this kind of effort, just in another market. They were leasing access to data that maybe ten other players are using. The found a very clever way to tweeze keyword rich sub-categories out of the XML that their well-established competition was not using. Then they coupled their informational section, generated from that data, to a user generated "web 2.0" area.
And after a full year, they've got 8,000 urls indexed out of several million. So yes, I'm offering real caution about this whole area - especially in the local markets. I'm not saying it can't be done,but Iam saying it's far from a no-brainer.
The play that I was asked to consult on was in a much more limited area, although potentially very lucrative. They were hoping for a "set it and forget it" type of income stream, and clearly that is not what they got.
There's a related thread with Adam Lasnik's comments about "thin affiliates". Even if you're not an affiliate, the concept of "thin pages" and "adding value" can really apply here, especially when the competition is already established and offering something quite similar.
Everybody and everybody's dog has done this YP-thing in the past and there is no money in it anymore, one reason being that the SE algos filter the staggering amount of duplicates out of the main index.
If I was G and reading this thread right now, I'd be hiring 3 extra college interns tomorrow to MANUALLY review for usefulness any domain with over 200,000 pages indexed. Couldn't take more than a week or two. And then keep one on staff permanently to examine new ones as they are detected. Wouldn't be the FIRST time they did so.
At my calculation (using average file sizes) youíre asking Google to crawl and save 725 Gigs or text.
If these yellow pages are for users, why not create area targeted landing pages with a user friendly interface for people to search your database? As has been said, they are not going find these 29 million pages from Google anyways.
Let's take a look at how many pages the top 3 search engines have of wikipedia.org
Amazon.com have 19,500,000 pages in Google.
Is your site going to be able to attract links like these sites do? Is your site unique enough for search engines to care about listings? If the answer to these questions is no, then I see that the vast majority of these pages never being crawled or ending up in the supplemental index.
I have an idea.
Most here would probably agree that it's just brilliant, including Google officials/employees (possibly) reading this.
For all the sections, sub sections, sub sub sub sections... well not 29.86 million pages, but let's say, the more important ones ( ~298,600 pages )... write up a list of the most desired keywords, phrases and their variations... you know, those you'd like to be found for. Add a short, approx 70 character description for each group of queries you'd like to compete in.
Match them up with the URLs.
Open your browser, preferrably IE 6.0 or Firefox,
And then copy paste this URL into the address field:
The negativity on this board is astounding.
Here you have a new member, who has developed a new website. It may be a great site - it may be just like thousands out there you don't know.
No matter the case, he has no doubt worked hard on his site and is applying his business model and everyone here seems eager to watch him fail due to some preconceived notion that he is a spammer.
Maybe he is, maybe he isn't. He asked a fairly simple and straightforward question. Why not help him with answers instead of value judgements?
|to some preconceived notion that he is a spammer. |
29.5 mln pages on a new site is a pretty damning statistic - there can be some exceptions, for example, say, he had a census website with a page for each person in it, but how likely this is the case here? This is not really a question of negativity, but of common sense.
|29.5 mln pages on a new site is a pretty damning statistic |
Exactly my point. You have made a judgement about is website with no information other than the number of pages.
He said it's a YP site. Perhaps he has developed a unique way to cross reference the data, or has a brilliant new user interface which makes the data more useful.
You don't know.
Someone here commented on my post with:
|There was another post earlier in the thread, claiming that some millions of pages, although less than 29,000,000, were also original content. What is that, a joke? |
No joke (although I said it was reasonably original content). The difference is in how the data is assembled and presented along with some tools for the user which I haven't seen anywhere else. I believe this makes the "reasonably original" content into original content.
So far G seems to agree and has crawled over 120K pages on the site which is less than 1 month old. Those pages are also being indexed.
|You have made a judgement about is website with no information other than the number of pages. |
We deal here with extreme - if he said that his site had 10000 pages then I would not have an idea whether they are good pages or spam, or even 100000 pages is not a lot - maybe he generated a page for each product in database, however when someone clearly generates more pages than Amazon then it raises red flag even before looking at actual site.
What do you think search engines will think? They ARE making judgements based on quantitative factors like high number of pages or backlinks, so the feedback that he gets here is more or less in line with what search engines will think - maybe his site is the next top destination on the Internet, we don't know it, however the first though any decent search engine will have when finding a site with so many pages is that this is a spam site: only very high number of quality links may change that.
Yes we do know, actually.
We know what Google and the other SE's will do with that many pages. We know that it is physically impossible to have any relevant useful content on that many pages. We know that 3% or less will ever get indexed. We know that equals one page for every 10 man woman and child in the US.
Some of us have been around here for a while, and we do not have to jump off a cliff to verify that hitting the ground will hurt a lot - we assume that the data from the last 5000 people that jumped is enough to prove the theory.
[edited by: Wlauzon at 12:59 pm (utc) on Sep. 13, 2007]
Yes, some of us have been here a long time. I remember when I was new here working on my first website. I posted a question about an idea I had for the site and it was immediately squashed by much of the same small mindedness I see here today.
I ignored the advice from the 'experienced experts' here and pursued my idea. That little website went on to generate more than $40 million in revenue over the next 3 years.
My advice to the original poster: Follow your business model.
My advice to the experts here: Don't be so negative. If it's a spam site G will figure that out and act accordingly.
|The negativity on this board is astounding. |
it's because people only compare to their own - mainly hand-crafted and self written - websites. they don't think about under which conditions a huge number of pages could arise. every site with several hundred thousands of pages must be spam, right?
one of my sites is an event database. i push about 2,000 articles per month = 24,000 per year = 120,000 in five years. add loads of user generated content. maybe now again combine these in some senseful manner. see? that's only one example. others are imaginable.
well ok, 29.5 million for a new site is yet a different matter.. that sounds indeed spammy.
[edited by: moTi at 1:31 pm (utc) on Sep. 13, 2007]
|well ok, 29.5 million is yet a different matter.. |
Only good 250 times more than you generate over 5 years? :)
*for a new site*
uah, you were faster than i could edit :)
Moti: "i push about 2,000 articles per month = 24,000 per year"
Is that all your content, written by a real person and exclusive to you? Or is it, as one suspects, regurgitated?
dberube: Google doesn't give two hoots about your stupendous new website with 29.5 million pages if all your content is actually just feeds and stuff borrowed from somewhere else, which is what it sounds like you're doing. And, have you really got 29.5 million different feeds?
Google's own advice:
Avoid "doorway" pages created just for search engines, or other "cookie cutter" approaches such as affiliate programs with little or no original content.
I think the emphasis here is on original content. Drop a few feeds into a page isn't original.
C'mon, be honest, isn't this basically what you're doing?
|they don't think about under which conditions a huge number of pages could arise. |
Actually they do, thatís really the issue at hand. At almost 30 million pages, pre-launch no less, your talking about something that had to be auto generated. The challenge with that method is getting each page to pass the sniff test of being unique. So, that means adding and adding variables to the script such as the title, H1, meta descriptions, targeted key word plopped here and there in the body surrounded by text that doesnít read like a third grader wrote it. If you can do that, next up your going to need some SERIOUS Page Rank just to get it crawled.
Your site may well be a great model and your onto to something great. But as far as the singular question;
|My question is, how long do you think it would take Google to spider the entire site and get them into their index? |
The answer is literally forever IMHO if the site was auto generated using some sort of script. It wonít have enough links to keep the bot drilling into it; it wonít have enough PR to keep it from drowning in the supplemental hell, and it wonít get by the sniff test for uniqueness. Itís just not going to happen; bad news I know but its better to change how your going to launch something thatís obviously very important to you, than to just plow forward with a plan that Google has gone to great lengths to prevent.
Good luck and let us know what happens.
Auto-generated does not necessarily mean spam or non-unique.
There is a ton of data offline that still isn't online. Imagine just collecting all openly published scientific papers, and re-publishing them online, page by page. A ton of pages? Check. Pretty damn unique content? Check.
What about Project Gutenberg? Also a ton of unique content.
Still - the local space is pretty much as hard as you can get. Yellowpages.com has has roughly 300k. Superpages.com otoh has 1.9 million pages. And it isn't always PR - Yelp has 300k pages with a PR7, whereas Yellowbot has 450k with only a PR5. How you implement it can make a huge difference too.
| This 72 message thread spans 3 pages: < < 72 ( 1  3 ) > > |