Forum Moderators: open

Message Too Old, No Replies

Googlebot and Deep Spidering

Large site not fully indexed.

         

George

12:03 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Every day or every couple of days, googlebot pops in, takes a few hits, and then disappears again.
The site is dynamic, but it can be seen I think by s/e. (wisenut and Alexa have been through it.)

Normally I would not be concerned about this, but it has been going on since before December now. The PR has jumped from grey, to white, to PR3 and now PR4, so there are no problems with the links in.
I have today put in a deep link to see if it makes a difference, and if that page gets picked up. Anyone a suggestion please?

I still largely think it is Googlebot normal behaviour, but normal behaviour for me would mean 200 pages spidered by now at least!

ciml

12:39 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wouldn't expect Googlebot to index very deep if the highest PageRank on the site is 4. On the other hand, PR4 would normally be enough to get 200 pages indexed with static-looking URLs.

If the URLs are quite complex (eg. /index.pl?1=foo&2=bar&3=yin&4=yang ) then Google isn't likely to crawl so deep. If the URLs have a CGI parameter called id then I don't think they'll be followed.

johnser

1:53 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Change dynamic to static would be a first (obvious) guess.

After that, maybe buy some links on every page of a big PR7/8 for a month or so for $1-200 - if that doesn't work, you've a server/code problem.

Just had 8k+ pages of a 10 day old site crawled in last 24 hrs. Gbots been really busy in last few weeks so you should be getting some joy...

George

2:09 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



johnser
Had that before, and on this site there are static pages on it too, not being indexed (php with includes for headers etc, but simple stuff, done it before).
The other part of the jigsaw is that I discovered the domain was purchased, and might have previously expired.

Could that be an issue do you think? My thought was that if it is showing PR on the index page, then it should be OK. Again, if googlebot is seeing two pages, then should it not continue through the other.

ThomasB

9:09 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think PR is the problem. I have a site with NO backlinks and PR 0. The site was just submitted to G using the "add url" form.

There are more than 20.000 sites indexed by Google. I'd say PR is nice to have and helps indexing sites, but it's not necessary.

ciml

9:31 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thomas, could it be that the sites you're seeing deep crawled quickly have links from quite high PR pages, but that the Toolbar doesn't reflect it yet?

Nowadays, people are seeing pages behave as if links are counted much sooner than the backlink/PR updates.

andrew_m

9:38 pm on Jan 27, 2004 (gmt 0)

10+ Year Member



To me that's not even a question any more -- number of pages googlebot will index is proportional to PR.

Plus, if you read the original google whitepaper -- that is basically what it says. That the probability a bot is going to follow a link on a page is proportional to the page's PR.

Global Wayne

9:48 pm on Jan 27, 2004 (gmt 0)

10+ Year Member



Hint ¦:)- if you have a BIG site and want to be deep spidered within 48 hours (but not necessarily show up in the serps until the next major update) consider setting up Google as your site search tool!

/Wayne

ThomasB

11:05 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ciml, sorry for not supporting your theory, but there are no links to the domain since at least 8 months (was a PR 5 before I think). So PR is definetly no factor to get sites indexed. It definetly depends on your needs, but if it's less than 10k sites just try to have a good internal linking structure and all sites should be indexed.
btw more than 100k GB hits this month

Just made another check and it has 3 backlinks from guestbooks, but PR is definetly 0. (white bar)

Sharper

12:00 am on Jan 28, 2004 (gmt 0)

10+ Year Member



I've got a site that has 1.6 million static .html pages linked in a pyramid structure. The home page is PR 7. It has a few dozen deeplinks from other sites, but it's mostly got tons of homepage links.

On a monthly basis, Google went from homepage only, to ~8K pages, to ~30K pages, to ~50K pages, to ~100K pages, to ~250K pages, but has been hovering around ~300 - 330K pages in the Google index for the last two months. This last month Google's spidering of the site has really slowed down and the total pages it's going to grab appears to have leveled off.

One of the things I did back when it was at about 30K pages in Google was to create a set of 12 site map pages linked from the bottom of the home page and then directly to pages in the "middle" of the site structure so that pages at the bottom would be closer link-distance from the home page. That seems to have helped.

Anyone else been in this situation? I'd really like to break out of the plateau with this site, but all I can think to do is set up an additional group of site-mappish pages after researching what sections of the site are the ones mostly not in Google.

For new sites, I've been trying to set them up and structure them so that their natural largest size will be no more than 150-200K pages, just to avoid this sort of problem in the future, but I'd still like to figure out how to get the rest of this site indexed completely.

George

12:23 am on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nowadays, people are seeing pages behave as if links are counted much sooner than the backlink/PR updates.

Agree. I had one site doing 3000 visitors per day on a white bar last year as an example of how PR can be delayed on the toolbar.

Just made another check and it has 3 backlinks from guestbooks, but PR is definetly 0. (white bar)

Have you Checked on alltheweb? Any other links there? I see PR linked to pages spidered too.

Anyone else been in this situation? I'd really like to break out of the plateau with this site

Sharper, sounds like you have two options. Site maps as you say to improve the linking structure. I find it helps to link them in a spiral structure, you can make good spider food that way. The other is to get more links :)

Global Wayne. Sorry cannot run with that idea.
Not seen any perceptable difference in any of the examples I have been involved with.

Interesting idea about the site maps. I have the /site_map.php pages at the bottom of the page too, as the menu is all in a menu.js file, so I know this will not be spidered. Thing is Googlebot has not been near them. Still left with either:

Dozy bot.
Penalised URL (anyone seen this?)
server problem ( 1 other page picked up so can it be?)

None seem likely. What am I missing?

ciml

12:41 am on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> PR is definetly no factor to get sites indexed

Well Thomas, that comment certainly defies conventional thinking.

If you can get 10k pages indexed from a couple of guestbooks giving PR<1 then something interesting is happening.

ThomasB

7:15 am on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ciml, that's true I see a very aggressive GB since a few weeks. Maybe they want to have the by far biggest index for their IPO?

johnser

12:43 pm on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



George - Thats an interesting thought re old domain you bought. Last year I had a simple site not crawled for 3 months due to Google's DNS database tardiness.

Perhaps this is the reason for your problems?
Your Gbot's behaviour sounds like what I had on my site.

Before doing anything radical there, I'd stongly recommend getting some very high PR links for 1-2 months.

Then you'll know for definite if its a PR issue or not.
J

George

1:19 pm on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



johnser,
What did you do to get it straight again? Just wait?
It changed hands about 10 or 11 months ago, and was held by the domain company since about Dec 2000. (according to the wayback machine.)
Does this sound similar?
Your thoughts appreciated :)

johnser

1:41 pm on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could it have been part of a "bad neighbourhood" in some way even while with the registrar?

I just changed all links to point to the site I wanted crawled and after what seemed like forever, it did finally and the pages ranked well from last May until last Sunday!
:(

You could always get a brand new domain & start from scratch?

George

2:39 pm on Jan 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>>>>You could always get a brand new domain & start from scratch?

Ho Ho Ho... the chap who owns it paid money for it. Not sure he would be delighted. He also bought his own server, so is stuck with an odd ball one, it was cheaper :(.

Sharper

11:17 pm on Jan 28, 2004 (gmt 0)

10+ Year Member



George,

By spiral structure, do you mean sort-of offset partially meshed? Could you give me (or point me to) a good description of a spiral structure?

I'm not going to go back and mess that much with the above site's architecture, but I have a site still in the planning stages that has a really flat structure I could turn into a spiral if I could come up with a good rule of thumb for creating the spiral. I'm not adverse to trying out a new structure for a site to see what the results are. :)

George

12:11 pm on Jan 30, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For those interested, I think this is the problem:

Googlbot still looking for the old IP Address [webmasterworld.com]
a snippet from the thread:


Someone left me sticky mail suggesting that the original server be resurrected (with its original DNS). I took their advice, and observed the traffic. Google begin spidering it like crazy within minutes, even though the server had been gone for a year. It would seem that Google thought that the old and new servers were the same; the presence of the new server kept Google coming back, but it would continue to try to hit the old, and failing that, would never add pages to its index. ie. it would look to see if the new site was there by hitting the root index page, and then try to access pages on the old server.

Anyone had other successful ways for getting it to kick start? Like Putting up a robots.txt for a couple of weeks to disallow, and then opening it up again?