Forum Moderators: open
A typical visit gets robots.txt and then one page, apparently at random. This pattern is repeated with rarely more than one page being taken. In the last few weeks it has only once grabbed a site map, and this may have been a random hit.
It rarely takes the index page, although when it has it followed links to new pages and indexed them. But taking the index page also appears to be at random.
Does anyone know if this is truly a random sampling, or is there some strategy I am missing? Why does it pick one page rather than another?
(Please note I do not need any advice on getting more inbound links, or being patient, etc. :) )
Harry
Do you know raytracing in decentralized environments? There is the task to calculate all the pixels of picture frame. Every workstation that has time, asks the main server to be allowed to compute a line of the picture. No two workstations start at the same time, nor will ever two workstations finish their line at the same time, except for mere coincidence.
There is not one Googlebot running through all pages one at a time, there is a ordered, probably prioritized queue of pages to visit and every working bot grabs one from the queue and processes it, delivering the result back to some group leader. Since there are many many Googlebots and probably even many queues with similar content, there will be some redundance, and pages in serialized order in the queue will be processed on different days by different bots.
This is my view of it :)
... there is a ordered, probably prioritized queue of pages to visit and every working bot grabs one from the queue and processes it ...
Hi Dirkz,
What you say makes sense. But is it true? Does anyone know?
Yes, still patiently building those links... And waiting for Googlebot to notice them... :(
Harry
or is there some strategy I am missing?... I do not need any advice on getting more inbound links
Sorry, but that's precisely what you need but with a twist: Some good deep links. The paths a spider takes are links. You have to build those paths into the website. To do this you have to set up some mutually beneficial link partnerships.
None of this guestbook, blog, or forum sig stuff, but an actual partnership with a website that shares an audience but not a market. Some websites accept content contributions and that's a good way to dictate your link.
When the link buddy thing gets old and ineffective you have to look to other opportunities and ways of doing things. Building paths straight into your deep content is probably the best way to encourage deep spidering. I can't really suggest any other solution.
(Please note I do not need any advice on getting more inbound links, or being patient, etc. :) )
My question is purely academic. To restate it: Why does a Googlebot pick one of my pages rather than another?
It is clearly *not* following links because of the number and variety of pages it has taken. If all these pages had inbound links I would be a very happy Harry, but unfortunately they haven't.
Dirkj has come up with one likely solution, but being an optimist, I was hoping somebody would have a definitive answer.
Harry
As for an academic answer there are only a few people who could answer it (Larry Page, Sergey Brin, GoogleGuy etc.), but of those who know NO ONE will.
From the sometimes very "drunken" behavior of Googlebot you can quite safely conclude there are many of them, and they are autonomous to a certain degree. There are more posts in this forum, you could search for plasma's last-modified experiments.
If you want a qualified answer, get yourself many sites with many pages on different servers and analyze the log files re. Googlebot behavior :)
It could be that the PR flow of your pages is different by deep linking, meaning that the pages visited earlier by Googlebot are indeed linked more to or more directly or even more prominent
No, the way the pages are taken is purely random as far as the site is concerned. There is no correlation with inbound links, internal links, or site structure.
Personally I think it's because Google is fouling up. Earlier this year almost all my traffic came from Google, but now MSN contributes about 40% and growing. Inktomi takes all my pages frequently, but in the Google databases most of my cached pages are out of date.
Googleguy waffles on about the virtues of site maps, but keeping site maps up to date is of no benefit if the bot never visits them - yet they are the first links the bot sees when it visits my home page.
Just my 2c's worth. Rant over. :)