Forum Moderators: Robert Charlton & goodroi
Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files.
My question is, how long do you think it would take Google to spider the entire site and get them into their index?
Google is not going to let you dump that many pages into the indice, or at least I don't think so. Others may have a different opinion but things have changed considerably in the last 12 months for large scale sites. Just ask anyone around here who used to have 5 million pages indexed and now has less than a million. Those types of numbers are staggering and I doubt very seriously that any one, two or even three man operation would be able to effectively harness and maintain a site of that size.
Is your server, or are your servers, capable of handling a string of bots that are going to be latched on to your site 24/7/365? Are you prepared for all the scraping? Have you thoroughly thought out your site architecture? Does it follow what everyone else is doing? Are you using any software that others are using? Any footprints of the automation? With that many pages, yes, there are many footprints.
If you've got pages and all that is changing is a city, state, country, etc., you are most likely earmarked for the "invisible" Supplemental Index. Google is not going to waste its resources, nor are the others, on content that it already has in its index. Unless of course you've done something totally different, totally "outside the box", I think your mission is going to be an extremely challenging one.
We do a lot of 'local' stuff - we sell YP data (like Axciom/InfoUSA/etc do), we have YP listings, weather, etc etc.
We've found that Google starts to 'choke' (to use that word) at roughly 300,000 pages indexed. We have subdomains for every core function - for YP, for weather, etc. Each has roughly 300k pages indexed.
Now - the YP subdomain gets hit roughly 125,000 times a day, peaking at 200,000. We've also seen a parallel between 'time spent downloading pages' and how hard we get hit (our average is roughly 140 ms - our lowest was 51 ms, which is the day we were hit 200,000 times and had 4 gigs of data pulled).
This has been going on for six months, and yet Google has remained pretty steady at ~300,000.
At the same time, one of our sites does have 450,000 pages in Google. It has a ton of deeplinks and PR7 (our other sites are PR6s).
So really - what we focus on now is getting deep links. At the same time, we have found that if pagerank is useful for anything, it is how fast and hard you get indexed.
I will say that 'local data' per se is everywhere out there. Unless you are doing something to truly bring about unique content (user reviews, pictures, etc) - your site won't get very far.
I have a PR5 site with millions upon millions of pages and the maximum number of pages indexed was 301,000, but Google has trimmed that to 198,000 .. It will probably continue to add and drop pages throughout the years, but I'd be very surprised to see 1,000,000 pages indexed..
The city-state thing has been beaten to death as well.. You might worry about getting -30'd or even -950'd if your pages aren't vastly different from each other.. I would actually count on it.. having too many pages on a site is detrimental I believe unless Google trusts you (which they don't)
Just my personal opinion, but I think this is insane.
Couldn't agree more. Twenty-nine million pages, and this is all somewhat unique, is it? There was another post earlier in the thread, claiming that some millions of pages, although less than 29,000,000, were also original content. What is that, a joke?
Sorry, but it's nothing but spam, and all that dross just makes it harder for users to find real content. Of course, that assumes that your giant load of dross will actually get listed, which is very unlikely to happen. It's the slightly sub-million ones that are the real problem.
My apologies for being harsh, but this is simply ridiculous.
I worked with a very savvy team of ten last year who hoped to succeed in this kind of effort, just in another market. They were leasing access to data that maybe ten other players are using. The found a very clever way to tweeze keyword rich sub-categories out of the XML that their well-established competition was not using. Then they coupled their informational section, generated from that data, to a user generated "web 2.0" area.
And after a full year, they've got 8,000 urls indexed out of several million. So yes, I'm offering real caution about this whole area - especially in the local markets. I'm not saying it can't be done,but Iam saying it's far from a no-brainer.
The play that I was asked to consult on was in a much more limited area, although potentially very lucrative. They were hoping for a "set it and forget it" type of income stream, and clearly that is not what they got.
There's a related thread with Adam Lasnik's comments about "thin affiliates". Even if you're not an affiliate, the concept of "thin pages" and "adding value" can really apply here, especially when the competition is already established and offering something quite similar.
[webmasterworld.com...]
Yahoo: 94,193,343
MSN/Live: 4,052,389
Google: 12,900,000
Amazon.com have 19,500,000 pages in Google.
Is your site going to be able to attract links like these sites do? Is your site unique enough for search engines to care about listings? If the answer to these questions is no, then I see that the vast majority of these pages never being crawled or ending up in the supplemental index.
Most here would probably agree that it's just brilliant, including Google officials/employees (possibly) reading this.
...
For all the sections, sub sections, sub sub sub sections... well not 29.86 million pages, but let's say, the more important ones ( ~298,600 pages )... write up a list of the most desired keywords, phrases and their variations... you know, those you'd like to be found for. Add a short, approx 70 character description for each group of queries you'd like to compete in.
Match them up with the URLs.
Open your browser, preferrably IE 6.0 or Firefox,
And then copy paste this URL into the address field:
adwords.google.com
...
Here you have a new member, who has developed a new website. It may be a great site - it may be just like thousands out there you don't know.
No matter the case, he has no doubt worked hard on his site and is applying his business model and everyone here seems eager to watch him fail due to some preconceived notion that he is a spammer.
Maybe he is, maybe he isn't. He asked a fairly simple and straightforward question. Why not help him with answers instead of value judgements?
to some preconceived notion that he is a spammer.
29.5 mln pages on a new site is a pretty damning statistic - there can be some exceptions, for example, say, he had a census website with a page for each person in it, but how likely this is the case here? This is not really a question of negativity, but of common sense.
29.5 mln pages on a new site is a pretty damning statistic
Exactly my point. You have made a judgement about is website with no information other than the number of pages.
He said it's a YP site. Perhaps he has developed a unique way to cross reference the data, or has a brilliant new user interface which makes the data more useful.
You don't know.
Someone here commented on my post with:
There was another post earlier in the thread, claiming that some millions of pages, although less than 29,000,000, were also original content. What is that, a joke?
No joke (although I said it was reasonably original content). The difference is in how the data is assembled and presented along with some tools for the user which I haven't seen anywhere else. I believe this makes the "reasonably original" content into original content.
So far G seems to agree and has crawled over 120K pages on the site which is less than 1 month old. Those pages are also being indexed.
You have made a judgement about is website with no information other than the number of pages.
We deal here with extreme - if he said that his site had 10000 pages then I would not have an idea whether they are good pages or spam, or even 100000 pages is not a lot - maybe he generated a page for each product in database, however when someone clearly generates more pages than Amazon then it raises red flag even before looking at actual site.
What do you think search engines will think? They ARE making judgements based on quantitative factors like high number of pages or backlinks, so the feedback that he gets here is more or less in line with what search engines will think - maybe his site is the next top destination on the Internet, we don't know it, however the first though any decent search engine will have when finding a site with so many pages is that this is a spam site: only very high number of quality links may change that.
You don't know....
Yes we do know, actually.
We know what Google and the other SE's will do with that many pages. We know that it is physically impossible to have any relevant useful content on that many pages. We know that 3% or less will ever get indexed. We know that equals one page for every 10 man woman and child in the US.
Some of us have been around here for a while, and we do not have to jump off a cliff to verify that hitting the ground will hurt a lot - we assume that the data from the last 5000 people that jumped is enough to prove the theory.
[edited by: Wlauzon at 12:59 pm (utc) on Sep. 13, 2007]
I ignored the advice from the 'experienced experts' here and pursued my idea. That little website went on to generate more than $40 million in revenue over the next 3 years.
My advice to the original poster: Follow your business model.
My advice to the experts here: Don't be so negative. If it's a spam site G will figure that out and act accordingly.
The negativity on this board is astounding.
it's because people only compare to their own - mainly hand-crafted and self written - websites. they don't think about under which conditions a huge number of pages could arise. every site with several hundred thousands of pages must be spam, right?
one of my sites is an event database. i push about 2,000 articles per month = 24,000 per year = 120,000 in five years. add loads of user generated content. maybe now again combine these in some senseful manner. see? that's only one example. others are imaginable.
well ok, 29.5 million for a new site is yet a different matter.. that sounds indeed spammy.
[edited by: moTi at 1:31 pm (utc) on Sep. 13, 2007]
Is that all your content, written by a real person and exclusive to you? Or is it, as one suspects, regurgitated?
dberube: Google doesn't give two hoots about your stupendous new website with 29.5 million pages if all your content is actually just feeds and stuff borrowed from somewhere else, which is what it sounds like you're doing. And, have you really got 29.5 million different feeds?
Google's own advice:
"
Avoid "doorway" pages created just for search engines, or other "cookie cutter" approaches such as affiliate programs with little or no original content.
"
I think the emphasis here is on original content. Drop a few feeds into a page isn't original.
C'mon, be honest, isn't this basically what you're doing?
they don't think about under which conditions a huge number of pages could arise.
Actually they do, that’s really the issue at hand. At almost 30 million pages, pre-launch no less, your talking about something that had to be auto generated. The challenge with that method is getting each page to pass the sniff test of being unique. So, that means adding and adding variables to the script such as the title, H1, meta descriptions, targeted key word plopped here and there in the body surrounded by text that doesn’t read like a third grader wrote it. If you can do that, next up your going to need some SERIOUS Page Rank just to get it crawled.
Your site may well be a great model and your onto to something great. But as far as the singular question;
My question is, how long do you think it would take Google to spider the entire site and get them into their index?
The answer is literally forever IMHO if the site was auto generated using some sort of script. It won’t have enough links to keep the bot drilling into it; it won’t have enough PR to keep it from drowning in the supplemental hell, and it won’t get by the sniff test for uniqueness. It’s just not going to happen; bad news I know but its better to change how your going to launch something that’s obviously very important to you, than to just plow forward with a plan that Google has gone to great lengths to prevent.
Good luck and let us know what happens.