Forum Moderators: Robert Charlton & goodroi
Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files.
My question is, how long do you think it would take Google to spider the entire site and get them into their index?
Food for thought:
site:amazon.com shows 18,700,000 urls indexed
site:microsoft.com shows 11,600,000 urls indexed
If you want to play in that league, you'll need a parallel level of quality and trust.
[edited by: tedster at 12:22 am (utc) on Sep. 11, 2007]
Since we launched Googlebot has visited us 22 times and has hit just over 2,000 files.
Based on the current indexing routines of your site (143 pages per day), I'd say you have about 56.52 years in the queue. ;)
PageRank™ is the determining factor when working with that volume of pages. Your best bet is to release groups of pages for indexing as you garner more PR. Block all but the most important ones right now. Without the PR, you're going to have 29.49 million pages in the Supplemental index at the end of those 56.52 years.
My real question was, will the speed/rate at which Google is crawling the site now stay consistent, or pickup once it realized that there are a lot more pages to crawl?
The content is all unique, as it pertains to my site, it's a Yellow Pages type of site, with extra content such as City news, classifieds, and weather. I hope that will be enough extra content for Google to not think it's a dupe of another site.
I've also compared similar and the % of the same is not that great. Obviously the business name, address, etc is the same, but that's about it.
You bring up an interesting point with only showing Google the most important pages right now, what advantage does that give me? Wouldn't I want to expose as much as possible to Google?
Wouldn't I want to expose as much as possible to Google?
You could. I can't really give you any real world experience on launching a site with 29.5 million pages. I do know from experience that launching a new site with even 1,000 pages is a challenge to get fully indexed and ranking these days. Its not an overnight task.
29.5 million pages of unique content? That's somewhat rare these days. There is nothing really unique out there when it comes to that number of pages, especially yellow page stuff which is a dime a dozen. You've got your work cut out for you competing on that level.
Ever consider focusing on a regional market first (local) and then branching out from there?
Personally, if I had that many pages to work with, I'd be putting them into logical groups and performing A/B/C/D testing. I'd determine what my top level pages were based on taxonomy and click path. I'd then block everything but those click paths that are of the utmost value to the user. You're probably going to say they all are. I would too. But, you are facing a technological challenge that requires extreme finesse with handling how the bots are indexing your content for maximum exposure.
Let me put it to you this way, I'd guess that Amazon is seeing upwards of 5+ million Googlebot visits daily. That should give you some idea of what you'll need to get your 29.5 million pages noticed. ;)
What would you say about this. The link structure goes like to domain.com/city-state/ - for the City's homepage. This is done with every city in the United states.
Beyond that is city-state/list/Keyword - this is obviously the businesses in that city that fall within that specific keyword.
Then is city-state/Business_Name/ID - this is the business page.
Taking in what you are saying, do you think maybe keeping Google away from the last 2 would be a good go at it? Just showing google all the City pages and nothing beyond that?
My strategy going into this was to hopefully get most of my traffic from specific searches for, say, "Tommy's Country Store Baton Rouge, LA" or "Joe's Gun Store Durham, NH". Not neccesarily someone searching for "Durham, NH City Guide" persay.
But you might have opened my eyes to something.
Try the above new advanced search option from Google with some of those sites you'll be competing with. What you'll want to see is your site pulling Minty Fresh results.
10 Minutes Ago. :)
Okay, you've gotten a hold of some regionally specific data to populate your pages. You've created your templates and you're ready to rip. Many have been down this path before and I'm going to allow them to take this topic over. You've got a challenge. How do you take all that data and get it onto a page in a logical manner and make it look unique? Unless that data is "yours only", you can be assured that someone else is using it. What have you done to break the footprint of duplication with that data?
You're asking the million dollar question too. My gut instinct tells me not to recommend a full fledged dump into the indices. No, I'd be a bit concerned doing that. I've seen too much of this stuff end up in what used to be called the Supplemental Index (SI). It gets indexed once and wham, right into the SI not to be seen again for quite some time, if at all. When you have 29.5 million pages, an ebb and flow of a million pages coming in and out of the index is probably natural. Imagine keeping track of all that? ;)
Google will keep requesting pages that are already indexed to make sure they haven't changed. Googlebot has fetched more than 2000 pages so far this month on one of my sites with only 300 pages.
To get much more than a couple thousand pages into the index, you will need PageRank and you will need deep links and you will need exceptional internal link structure.
Just as a rule of thumb, I never expect googlebot to go more than 3 clicks from an incoming link. Googlebot may go farther than that, but I never expect it to.
So stop watching the bot and start getting deep links.
Have you fully tested your URI architecture? Are you positive that you are not returning duplicate content at different URIs? Are you sure that your 404s are returning a 404 and not a 200? Have you run any spider routines on your site to detect any anomalies?
What type of server? Apache or Windows?
I've never, personally, done any indepth SEO. I usually just implement good URL structure, external CSS, JS files, H1 tags, and meta tags.
With this project, because I see so much potential with it (because of the size of the site) I wanted to get into the SEO more than usual.
I haven't run any spider routines or anything like that. I've done some QA testing and all the URLs, or atleast 99% of them SHOULD work and SHOULD display the correct data.
Now with this 5 GB database I'm sure there are some duplicate listings and incomplete listings that may cause some errors. But with something this massive it's going to happen, and it seems almost impossible to detect.
If [ as others say here ] you have plenty of "trust" you might be able to move faster. "Trust" means a site with an age of greater than say 18 months approx and plenty of backlinks. The older the site the more trust.
Then you need plenty of deep link juice well disbursed across your site. Otherwise, crawled pages given very low or nil PR will drag your overall site down the "plug hole" meaning storing them as supplemental pages. The effect is that you are given the appearance of never being indexed, because the page sit so low in the SERP's - maybe 950+ .
Why do you need 29.5 million pages? How many sites are there out there of this size, as an indication of the need?
If (and it’s a big if) your going to achieve this, you want Google to visit different sections of your website on each crawl. If you’re using Google sitemaps I’d advise you to set the change frequency to never so Google won’t re-crawl the same page. In addition I’d have a random hot section / category on the homepage each day. This might send the spider in a different direction on each crawl. It’s worked for me in the pass but on nothing of this magnitude.
The site went live 3 weeks ago.
G-Bot started grabbing about 1K pages/day after 10 days. It kept on that schedule until a few days ago when it increased to 30K pages/day.
I expect that speed will increase once again before too long. The fact is that G-Bot could crawl the entire site in less than a day if it wanted to. G-Bot is very conscious of whether or not it is going to crash a server and from what I have seen it won't take more than about 150 pages/second at peak - but right now its throttled down to 1 page/2 seconds.
I suspect it also grabs a small set of pages and checks them for spam/duplicate content before deciding to crawl a site more aggressively.
One interesting observation as G-Bot crawls the site:
Sitemaps were submitted with all URI's as well as update frequency and priority. G-Bot has focused primarily on 2 types of pages thus far. One is the only page that is updated daily and thus has very unique and fresh content. But the page it seems to be most interested in is a map page with a very low priority and update frequency of never.
Is it possible G is using these pages to harvest latitude and longitude data for the subject of the content?
G-Bot seems relatively uninterested in the pages that would be searched for most frequently.
[edited by: mbennie at 11:20 am (utc) on Sep. 11, 2007]
That being said, I'd go for it because if you've created a quality site you're going to attract the links you need to start climbing the SERPs.
Best of luck. Remember, if everyone listened to the doomsayers...
Remember, if everyone listened to the doomsayers...
What doomsayers? We're talking 29.5 million pages. Don't you think that requires a bit of finesse and planning? Or, would your suggestion be to unleash all 29.5 million at once?
Me? I'd rather take the route with the least risk involved. Releasing them in bits and pieces would be my strategy. And, I do the most important pieces first.
FYI, I've released up to 30K new pages at once, indexed in about 3 days, no penalty triggered.
I'd like to focus on timeframes too. When were those 30k new pages released? And, was it an existing site or a "brand new" site such as the OPs? I'm more interested in seeing what has happened to sites of this size over the past 1-6 months. I don't care if it was a year ago or more, that doesn't count today.
Half of the issue is the amount of navigation levels you would need. either a very deep structure - index/state/county/city/businesstype/listing is 6 levels deep from homepage, or a broad structure - index/state/city/listing which would result in thousands of links per page.
The other half of the issue is googles clearing up of spam in the index over the past few years. You say that your pages are unique but i'm pretty sure i'd have the same type of information as you have, with the same company contact details, optimised for the company name etc. There are sites with exactly the same info as you which have been around a lot longer.
Think of it from googles angle, it sees a brand new site with say 60% similar content to other sites, 20% similar to other pages on your site (looking for info on <businesstype> in <cityname> etc) and the whole site has 29 million pages. Do you not think this is going to sound the alarm bell for googlebot?
We launched a new site about 2 weeks ago that has about 29.5 million unique, hand edited, peer reviwed, search engine friendly, URLs. :)
To add to Tester: Amazon, MSFT etc also have dozens of links to each page from many blogs, online forums as they discuss movies, tech issues etc.
If you ask me: you are or will be blaclisted soon. There is something suspicious about 29.5 million pages coming online overnight.