As long is it takes you to get the strong backlink profile that provides a PR 9 or so for your Home Page - and even if you get that far, all the urls will still never make it into the index. Just a fact of life.
Food for thought:
site:amazon.com shows 18,700,000 urls indexed
site:microsoft.com shows 11,600,000 urls indexed
If you want to play in that league, you'll need a parallel level of quality and trust.
[edited by: tedster at 12:22 am (utc) on Sep. 11, 2007]
>>We launched a new site about 2 weeks ago that has about 29.5 million unique, search engine friendly, URLs
Do you mean to say unique pages or URLs? Because if you're not talking unique content I'd give you a decade - maybe longer.
|Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files. |
Based on the current indexing routines of your site (143 pages per day), I'd say you have about 56.52 years in the queue. ;)
PageRankô is the determining factor when working with that volume of pages. Your best bet is to release groups of pages for indexing as you garner more PR. Block all but the most important ones right now. Without the PR, you're going to have 29.49 million pages in the Supplemental index at the end of those 56.52 years.
I know that it'll never be fully indexed, I posed the question wrong I guess. I actually did the same thing, site:amazon.com to see how many were indexed.
My real question was, will the speed/rate at which Google is crawling the site now stay consistent, or pickup once it realized that there are a lot more pages to crawl?
The content is all unique, as it pertains to my site, it's a Yellow Pages type of site, with extra content such as City news, classifieds, and weather. I hope that will be enough extra content for Google to not think it's a dupe of another site.
I've also compared similar and the % of the same is not that great. Obviously the business name, address, etc is the same, but that's about it.
You bring up an interesting point with only showing Google the most important pages right now, what advantage does that give me? Wouldn't I want to expose as much as possible to Google?
|Wouldn't I want to expose as much as possible to Google? |
You could. I can't really give you any real world experience on launching a site with 29.5 million pages. I do know from experience that launching a new site with even 1,000 pages is a challenge to get fully indexed and ranking these days. Its not an overnight task.
29.5 million pages of unique content? That's somewhat rare these days. There is nothing really unique out there when it comes to that number of pages, especially yellow page stuff which is a dime a dozen. You've got your work cut out for you competing on that level.
Ever consider focusing on a regional market first (local) and then branching out from there?
Personally, if I had that many pages to work with, I'd be putting them into logical groups and performing A/B/C/D testing. I'd determine what my top level pages were based on taxonomy and click path. I'd then block everything but those click paths that are of the utmost value to the user. You're probably going to say they all are. I would too. But, you are facing a technological challenge that requires extreme finesse with handling how the bots are indexing your content for maximum exposure.
Let me put it to you this way, I'd guess that Amazon is seeing upwards of 5+ million Googlebot visits daily. That should give you some idea of what you'll need to get your 29.5 million pages noticed. ;)
Ok. I know the content isn't 100% unique, but it's not being duplicated by more than 1 URL on my domain.
What would you say about this. The link structure goes like to domain.com/city-state/ - for the City's homepage. This is done with every city in the United states.
Beyond that is city-state/list/Keyword - this is obviously the businesses in that city that fall within that specific keyword.
Then is city-state/Business_Name/ID - this is the business page.
Taking in what you are saying, do you think maybe keeping Google away from the last 2 would be a good go at it? Just showing google all the City pages and nothing beyond that?
My strategy going into this was to hopefully get most of my traffic from specific searches for, say, "Tommy's Country Store Baton Rouge, LA" or "Joe's Gun Store Durham, NH". Not neccesarily someone searching for "Durham, NH City Guide" persay.
But you might have opened my eyes to something.
Google Adds More Date Options to Advanced Search Page
Try the above new advanced search option from Google with some of those sites you'll be competing with. What you'll want to see is your site pulling Minty Fresh results.
10 Minutes Ago. :)
Okay, you've gotten a hold of some regionally specific data to populate your pages. You've created your templates and you're ready to rip. Many have been down this path before and I'm going to allow them to take this topic over. You've got a challenge. How do you take all that data and get it onto a page in a logical manner and make it look unique? Unless that data is "yours only", you can be assured that someone else is using it. What have you done to break the footprint of duplication with that data?
You're asking the million dollar question too. My gut instinct tells me not to recommend a full fledged dump into the indices. No, I'd be a bit concerned doing that. I've seen too much of this stuff end up in what used to be called the Supplemental Index (SI). It gets indexed once and wham, right into the SI not to be seen again for quite some time, if at all. When you have 29.5 million pages, an ebb and flow of a million pages coming in and out of the index is probably natural. Imagine keeping track of all that? ;)
Has Google made 2000 file requests, or has google requested 2000 unique files?
Google will keep requesting pages that are already indexed to make sure they haven't changed. Googlebot has fetched more than 2000 pages so far this month on one of my sites with only 300 pages.
To get much more than a couple thousand pages into the index, you will need PageRank and you will need deep links and you will need exceptional internal link structure.
Just as a rule of thumb, I never expect googlebot to go more than 3 clicks from an incoming link. Googlebot may go farther than that, but I never expect it to.
So stop watching the bot and start getting deep links.
Creating a site with this number of pages is not a bad idea, prepare for the -950 penalty and the supplemental index to hit you hard.
I'll give you 0,10% success rate.
I think I'm going to do that.
I've placed <META NAME="ROBOTS" CONTENT="NOFOLLOW"> in the head of the city homepages... I'm going to concentrate on getting those included first.
|I think I'm going to do that. |
But wait! Don't rush into this just yet. Let's get some more feedback from those who have handled something of this level within the past 12 months. You don't just want to dive into this, really, you don't. Trust me as they say. ;)
I'll hold off! I'll wait for some others to chime in :)
I've been thinking about this. Are you a "one man" operation? Or, are there a group of you keeping an eye on things? I ask that because this is not a typical question here at WebmasterWorld and I have a strong interest in the topic.
Have you fully tested your URI architecture? Are you positive that you are not returning duplicate content at different URIs? Are you sure that your 404s are returning a 404 and not a 200? Have you run any spider routines on your site to detect any anomalies?
What type of server? Apache or Windows?
We're not a one-man operation, but this is my personal project, meaning I'm flying solo with this one, but if need be, there are others.
I've never, personally, done any indepth SEO. I usually just implement good URL structure, external CSS, JS files, H1 tags, and meta tags.
With this project, because I see so much potential with it (because of the size of the site) I wanted to get into the SEO more than usual.
I haven't run any spider routines or anything like that. I've done some QA testing and all the URLs, or atleast 99% of them SHOULD work and SHOULD display the correct data.
Now with this 5 GB database I'm sure there are some duplicate listings and incomplete listings that may cause some errors. But with something this massive it's going to happen, and it seems almost impossible to detect.
Google will likely raise a flag at any new site / url's of this size. Matt Cutts has been specific on this and suggested launching 5k of pages a week is the upper limit.
If [ as others say here ] you have plenty of "trust" you might be able to move faster. "Trust" means a site with an age of greater than say 18 months approx and plenty of backlinks. The older the site the more trust.
Then you need plenty of deep link juice well disbursed across your site. Otherwise, crawled pages given very low or nil PR will drag your overall site down the "plug hole" meaning storing them as supplemental pages. The effect is that you are given the appearance of never being indexed, because the page sit so low in the SERP's - maybe 950+ .
Why do you need 29.5 million pages? How many sites are there out there of this size, as an indication of the need?
|How many sites are there out there of this size? |
Probably a lot.
|My question is, how long do you think it would take Google to spider the entire site and get them into their index? |
Until the Fifth of Never.
Am I right in saying Google will allocate a certain amount of crawl time for each website depending on PageRank, back links, etc. Googlebot will spider your site for the allocated time. New sites occasionally get a temporary surge in rankings and crawl rate. You might be experiencing this.
If (and itís a big if) your going to achieve this, you want Google to visit different sections of your website on each crawl. If youíre using Google sitemaps Iíd advise you to set the change frequency to never so Google wonít re-crawl the same page. In addition Iíd have a random hot section / category on the homepage each day. This might send the spider in a different direction on each crawl. Itís worked for me in the pass but on nothing of this magnitude.
I have been helping a friend with a new site that has 6.8 million uri's of reasonably original content.
The site went live 3 weeks ago.
G-Bot started grabbing about 1K pages/day after 10 days. It kept on that schedule until a few days ago when it increased to 30K pages/day.
I expect that speed will increase once again before too long. The fact is that G-Bot could crawl the entire site in less than a day if it wanted to. G-Bot is very conscious of whether or not it is going to crash a server and from what I have seen it won't take more than about 150 pages/second at peak - but right now its throttled down to 1 page/2 seconds.
I suspect it also grabs a small set of pages and checks them for spam/duplicate content before deciding to crawl a site more aggressively.
One interesting observation as G-Bot crawls the site:
Sitemaps were submitted with all URI's as well as update frequency and priority. G-Bot has focused primarily on 2 types of pages thus far. One is the only page that is updated daily and thus has very unique and fresh content. But the page it seems to be most interested in is a map page with a very low priority and update frequency of never.
Is it possible G is using these pages to harvest latitude and longitude data for the subject of the content?
G-Bot seems relatively uninterested in the pages that would be searched for most frequently.
[edited by: mbennie at 11:20 am (utc) on Sep. 11, 2007]
Seems to me that the SE's would see this as a huge red flag, simply because it so easy to generate millions of identical pages where nothing but the town name is different.
And to be honest, I just don't see 99% of the pages ever getting any hits at all.
Matt Cutts discusses the possibility of triggering a penalty if too many pages are released at once.
That being said, I'd go for it because if you've created a quality site you're going to attract the links you need to start climbing the SERPs.
Best of luck. Remember, if everyone listened to the doomsayers...
|Remember, if everyone listened to the doomsayers... |
What doomsayers? We're talking 29.5 million pages. Don't you think that requires a bit of finesse and planning? Or, would your suggestion be to unleash all 29.5 million at once?
Me? I'd rather take the route with the least risk involved. Releasing them in bits and pieces would be my strategy. And, I do the most important pieces first.
FYI, I've released up to 30K new pages at once, indexed in about 3 days, no penalty triggered.
|FYI, I've released up to 30K new pages at once, indexed in about 3 days, no penalty triggered. |
I'd like to focus on timeframes too. When were those 30k new pages released? And, was it an existing site or a "brand new" site such as the OPs? I'm more interested in seeing what has happened to sites of this size over the past 1-6 months. I don't care if it was a year ago or more, that doesn't count today.
|FYI, I've released up to 30K new pages at once, indexed in about 3 days, no penalty triggered. |
Not on a new domain though, right? On an established website?
Good luck with indexing as i think you're gonna need it. I have similar yellow pages style sites and what worked a couple of years ago (getting million+ pages in the index in a few months) just doesn't happen any more. I had one site trying to list about the same amount of pages as yours and pointing PR5+ link after link it just wouldn't index it all.
Half of the issue is the amount of navigation levels you would need. either a very deep structure - index/state/county/city/businesstype/listing is 6 levels deep from homepage, or a broad structure - index/state/city/listing which would result in thousands of links per page.
The other half of the issue is googles clearing up of spam in the index over the past few years. You say that your pages are unique but i'm pretty sure i'd have the same type of information as you have, with the same company contact details, optimised for the company name etc. There are sites with exactly the same info as you which have been around a lot longer.
Think of it from googles angle, it sees a brand new site with say 60% similar content to other sites, 20% similar to other pages on your site (looking for info on <businesstype> in <cityname> etc) and the whole site has 29 million pages. Do you not think this is going to sound the alarm bell for googlebot?
It took our site about 2 years to be fully spidered and we have about 70,000 pages. Our site is about 7 years old, but re-worked our url structure 2 years ago. We also have a PR4-PR5 rank.
|We launched a new site about 2 weeks ago that has about 29.5 million unique, search engine friendly, URLs. |
How, if you don't mind saying, was the site created? At 30 million URL's; I'm guessing it was auto generated somehow?
>> We launched a new site about 2 weeks ago that has about 29.5 million unique, search engine friendly, URLs.
We launched a new site about 2 weeks ago that has about 29.5 million unique, hand edited, peer reviwed, search engine friendly, URLs. :)
To add to Tester: Amazon, MSFT etc also have dozens of links to each page from many blogs, online forums as they discuss movies, tech issues etc.
If you ask me: you are or will be blaclisted soon. There is something suspicious about 29.5 million pages coming online overnight.
| This 72 message thread spans 3 pages: 72 (  2 3 ) > > |