|How does Google determine which pages to crawl?|
I recently created a lyrics database and it has about 150,000 web pages. I have submitted to yahoo and some directories and estimate that my page rank will be between 2 and 3.
How does google determine the number of pages to crawl. Is it based on page rank? Will i need more links coming into my site or will google get to all my pages eventually.
How can you say - "estimate that my page rank will be between 2 and 3." Only Google can say so.
Crawling of your pages depends on the site navigation and site architecture and design.
> How does google determine the number of pages to crawl. Is it based on page rank?
My short answer would be yes, it is largely based on PR but this may not be as simple as that. Please read on for a longer version:
In my experience, the initial crawl of your site will/should begin shortly after the GoogleBot detects an external inbound link pointing to your domain.
> I have submitted to yahoo and some directories
This should start getting your site crawled nicely, IMO.
> it has about 150,000 web pages.
> ...estimate that my page rank will be between 2 and 3.
I doubt if anything close to 150K pages will get crawled with a PR of 2 or 3. You might want to get it up to 4 or 5, or even higher ;-)
After this, the internal navigation structure of your site will play an important role. This issue is (much) more involved than what I can mention here in brief. There have been some excellent posts and discussions here on this matter and I'm sure someone will be kind enough to post the link/s here shortly. (I'm not too good at 'searching' here yet :-)).
To put it in a nutshell, you might want to link to the important sections of your site from your home page and/or make these links a part of your global navigation. Then, link to less important (secondary) sections of your site from those pages which are directly linked from the home page, and so on. A well-structured site map section should also come in handy in terms of crawl maximization.
Assuming you publish pages off the database using a script, it'd pay to keep your query strings as short as possible i.e. with the minimum possible number of parameters/arguments etc. Alternatively, you can mod rewrite the query URLs to a static form. Again, excellent guidence on this available here too.
Finally, some strategically placed external inbound links pointing to deep / internal sections of your site should come in handy too.
Web_Savvy has summed it up quite nicely I think.
It might also worth bearing in mind that if you're covering popular songs, all the many other lyrics sites out there will already have the same content as you (you can't really go changing the words to a song, right?).
If GoogleBot is finding consistently duplicate content throughout a site, I'd imagine it'd move on to fresh meat fairly quickly. So it might also be worth thinking about how you can make your site stand out from the crowd contentwise if you've not already done so in order to help keep GoogleBots attention.
[edited by: Bones at 11:40 am (utc) on Dec. 27, 2006]
|I doubt if anything close to 150K pages will get crawled with a PR of 2 or 3. You might want to get it up to 4 or 5, or even higher ;-) |
I'd also like to know more on that. Has this really been discussed and evaluated here? I have come across a number of threads, where people complain about not getting fully indexed with their one-page-per-product- shops, but I cannot remember anyone having pointed towards that PR-argument.
Though it seems quite likely: such a pagerank/crawl-debth relation would explain why for some time the SERPs used to be full of outdated ebay-auctions, whereas many other sites with mid/low PR did not receive the attention they desired.
Is this relation confirmed, tested, proven or still a myth?
to reply to abacus i can estimate my page rank because i know who is linking to me and what the PR of these pages are. it will be low.
Oliver, from my experience there is a correlation between PR and pages indexed in google. I have several PR 2 or PR 3 sites and i don't seem to have trouble indexing pages in the order of several hundred. But to index over 100K i think as most ppl suggest in this forum you need like PR 5.
|Is this relation confirmed, tested, proven or still a myth? |
- Matt Cutts has stated sometime during the release of Big Daddy that indexing depends partly on PageRank
- Both Matt Cutts and Adam Lasnik has said the main factor in getting into the main index (and avoid the suppemental index) is PageRank.
Keep in mind, PageRank has to do with individual pages. Just because the home page is TBPR 8 doesn't mean thousands of pages will make it to the main index automatically without a decent internal linking structure.
[edited by: Halfdeck at 5:29 pm (utc) on Dec. 28, 2006]
OK, getting into the index with uncompetitive, niche-specific product-pages aiming at the long tail is one thing. Taking the 100 links per page thumb rule it should be no big problem to get almost 10k pages indexed even with a PR3 page, provided the content is sufficiently unique and interesting.
Another thing is to get rid of that ugly green bar in your webmaster central saying "pagerank not yet assigned."
I find Matt cutts blog quite hard to follow for a non-native-speaker of the English language, so have you come across anything new on that issue?
Do some test sites if you really want to know how much "linkvalue" it takes to get pages indexed. It's really easy to test and will give you much more information than any post on a forum can. Get a cheap, never used before, domain name, generate 20k pages of unique content (it doesn't have to be readable) and slowly add links to it from known pages (where you can also be certain not to get any traffic from, perhaps even almost hidden links).
One thing that is going to be different however is the reaction to duplicate content. If you are going to put 150k pages of lyrics online, then I'd have to assume that most (or all) of that has been put online before. How is Google going to react to a site where 99% is duplicated? Will that change the indexing strategy? Will it change crawler priority for the site (or just some pages)? How can you work against that (other than the obvious: unique content)?
|I find Matt cutts blog quite hard to follow for a non-native-speaker of the English language, |
Oliver ..that is because anyone who works for googles PR dept ( whatever they say their primary job is ) isn't using natural english anyway ..double speak and misdirection are much closer to what is put out ..so you arent really missing much of any import ..and there are many here ..( similar to the would be shamans at delphi ) who are only to pleased to "interpret" the words of G reps to us "unenlightened" ..many even claim to do SEO for a living :)
and your english is fine ..puts many "native english speakers" here to shame ..it's the evidence of the thinking process functioning unblinkered that counts anyway :) you do fine there ..
Launching 150k of pages in one hit is risky. Matt Cutts has declared on his video's [ launching big sites ] the need to restrict releases to under 5000 a week. For big sites this is a potential problem.
If your site is original content, then a reinclusion request may help, but if it's old content or on a feed you may encounter problems.
Then there's the issue of firing deep, relevant links into your site from other sites, preferably with good PR. Good PR disbursement throughout your site will certainly help.
I think Google is good at picking up "click activity" now, so if users are searching particular pages, then i suspect Google validates these as part of the "votes" that help you to be scored, ranked and crawled more frequently. There is a bit of chicken and egg here as you try to achieve one before the other.
I guess the key is, that if you have good unique content that users love Google will crawl it, index it, rank it and users will love it and you're sitting in clover [ success ] .
Not all of us have achieved this utopia yet :)
[edited by: Whitey at 6:45 am (utc) on Dec. 31, 2006]