When does Google bot comes and what crawls?

Forum Moderators: open

Message Too Old, No Replies

When does Google bot comes and what crawls?

Can someone explain the google bot behaviour?

silverbytes

3:43 pm on Sep 22, 2003 (gmt 0)

I´ve got daily vists of google bot, most of all crawls, index and robots.txt (allow all bots).

But I can´t get my new documents crawled very often... in fact some html are uploaded since a week and google doens´t seem to access those resources at least in last 2 weeks...

Will it be?

May someone explain about the google bot behaviour? Visits, crawls, deep...

Mike12345

4:54 pm on Sep 23, 2003 (gmt 0)

Cant really anser your question directly, but there are things you can do to make your site more spider friendly

Use absolute links rather than relative.

Get links from external sources to the afore mentioned pages.

Make sure your pages are linked from more than one page on your site, sitemap helps a bit.

Doing the above might help a bit. Sometimes its a waiting game, but increasing your links will definately increase your chances.

silverbytes

5:06 pm on Sep 23, 2003 (gmt 0)

Good tips, thank you!

Perhaps someone could explain the deep crawl, and fresh bot behaviours...

Does google crawls the entire site monthly?
Why spider indexes index.htm twice a day, but leaves remaning pages?

dougmcc1

9:02 pm on Sep 23, 2003 (gmt 0)

Does your robots.txt file validate? Where are your documents located and how are they being linked to? How long has your site been up?

silverbytes

8:46 pm on Sep 24, 2003 (gmt 0)

Validate = yes it does

This Robots.txt validates to the robots exclusion standard!

8 months, 3 months listed in google...
documents in root and 1 folder deep.. linked with standard html,
links using absolute paths from index to internal pages, and viceversa.

some backward links to home recognized by google up to 7

karmov

3:00 pm on Sep 25, 2003 (gmt 0)

Use absolute links rather than relative.

Is this true or just superstition?

Mike12345

3:02 pm on Sep 25, 2003 (gmt 0)

Is this true or just superstition?

- It doesnt help your ranking afaik, but it seems to help spiders around your site more easily. Im sure theres a technical explanation somewhere.....

dougmcc1

3:17 pm on Sep 25, 2003 (gmt 0)

Robots.txt

Not sure if this is a problem, but just to be safe you might want to use all lowercase letters - robots.txt instead of Robots.txt.

Also silverbytes, you don't need to use a robots.txt file to allow all bots. That's the default setting. If you're only using the robots.txt file to allow bots, then remove it.

Is this true or just superstition?

True. Maybe not necessary, but definately recommended.

dougmcc1

3:31 pm on Sep 25, 2003 (gmt 0)

Im sure theres a technical explanation somewhere.....

Absolute links tell the bot exactly where on your site a file is located whereas relative links kind of push it along which makes it easier for the bot to get lost.

richmc

4:07 pm on Sep 25, 2003 (gmt 0)

Absolute links tell the bot exactly where on your site a file is located whereas relative links kind of push it along which makes it easier for the bot to get lost.

I can only see this being true if:

a) The spider somehow "forgets" which page it is on.
b) The link is a 404 anyway.

I can't really see it being likely that absolute/relative URLs affect spidering at all unless the spider is buggy.

Maybe MSNs new bot? ;)

silverbytes

5:37 pm on Sep 25, 2003 (gmt 0)

I agree, bots should be able to find both of them. However I changed to absolute urls because seems to be safer. Or at least to discard that's what's happening to me.

I just wonder if waiting a month will make my site to be deep crawled because some frequency of google bot or if I'm having a problem really.

Is that myth or not: googlebot crawls your site in deep once a month, and several times or daily does a light scan.

May someone explain?

cdog863

6:20 pm on Sep 25, 2003 (gmt 0)

I'm looking to get this same question explained. I also get hit by the fresh bot daily, yet the deep bot has not hit my site yet.

It's only been up and going for about 3 weeks, but already has about 15 pages listed in their search engine, and I have about 400 that should be spidered. Each of with have their own meta description and unique title.

Does the deep bot come on a "normal schedule" or is it different for every site.

Also I do not have a robots.txt, but i do have the meta tag on all my pages meta robots=all. Should I change this?

p.s. most of those pages are dynamic pages... ex:
?whatever=whatever after the page file... will this effect the bot in anyway, all of the pages link back to other pages, and every "content page" links back to the main page.

I have a jokes page that has for the jokes a <previous next> link for the jokes... do you think this would confuse the bot?

pchristensen

7:24 pm on Sep 25, 2003 (gmt 0)

I am glad this question was asked today as this has been on my mind.

My index page is being crawled every 1-2 days. I know because every day, I hard code a line that says "Site updated on mm/dd/yy). Then, I simply look at Google's cache for the last bot visit. However, pages one and two layers deep have dates that goes back several weeks, maybe longer.

I thought that Google did away with the deep crawl and only makes adjustments based upon periodic fresh bots. Or, is it only the monthly "dance" that has disappeared, but deep crawls still occur on weekly or monthly basis? If so, how often does Google run the deep crawl? Perhaps only my deep-layered pages are getting updated by the deep crawl only.

Fruit and Veg

9:53 pm on Sep 25, 2003 (gmt 0)

dougmcc1 = 'If you're only using the robots.txt file to allow bots, then remove it.'

Are you absolutely sure about this?

I've always used the robots.txt file to allow everything so the bot can a) see that the file exists, and b) it knows for sure that it's allowed to crawl (ie. I haven't just forgotton to put one up).

If it should be removed, then why?

silverbytes

10:15 pm on Sep 25, 2003 (gmt 0)

Well, will someone answer about what we all want to know: When does Google bot comes and what crawls?

I've tried to post this in Google news but seems like wasn't successful...

Cmon experts!

dougmcc1

10:21 pm on Sep 25, 2003 (gmt 0)

I agree, bots should be able to find both of them.

I agree as well, but you never know and it's good to be safe. And there are other bots out there besides Googlebot believe it or not :)

Also, and I was hesitant to say this before because I whole-heartedly disagree, but I heard that absolute links might be treated as external links by some bots because the bot actually calls your domain again from the outside, whereas with relative links the bot never leaves your site. And we all know external links are good. But like I said, I have no reason to believe such a statement, but what the hell, I use external links just in case ;) What's it hurt?

a) see that the file exists

A visible, crawlable link to the file tells the bot that the file exists.

b) it knows for sure that it's allowed to crawl

It knows it's allowed to crawl unless you tell it explicitly not to via by the robots meta tag or the robots.txt file.

If it should be removed, then why?

It only poses a risk to your site if all you are using it for is the default settings. One simple, stupid mistake can keep your whole site from being crawled. Why chance it?

wkitty42

10:59 pm on Sep 25, 2003 (gmt 0)

silverbytes,

googlebot has switched to a rolling update... its nothing like it was a few months ago... my site gets hit every day on many various pages... i see no pattern to what the bot requests, when it requests, or what triggers it to drop by... i've been updating existing pages and they are in the index and cache within days...

Dave_Hawley

3:00 am on Sep 26, 2003 (gmt 0)

On the subject of Absolute vs Relative links, the question for me is 'why wouldn't you use Absolute links'?

Dave

kwasher

3:30 am on Sep 26, 2003 (gmt 0)

A long time ago (1996?) I read that absolute links take longer to load than relative links.

dougmcc1

3:48 am on Sep 26, 2003 (gmt 0)

why wouldn't you use Absolute links

They're more time consuming, especially for bigger sites.

BlueSky

4:07 am on Sep 26, 2003 (gmt 0)

On the subject of Absolute vs Relative links, the question for me is 'why wouldn't you use Absolute links'?

Absolute links that start with http:// use up an http socket on your server when accessed. Relative links and absolute links without http:// do not. Unless the server gets a decent volume of traffic, the difference is probably not very noticable. There's only so many sockets on it though. When they are all used up at the exact same time, the next requester will either get an error message (like a 404, 500, etc) or experience slowness until one is freed up. Those who use overloaded budget hosts servers often experience this condition and they'll see their pages hang and/or time out.

Those who opt to use relative links really ought to use a spidering software to make sure the bot will be feed the correct page and not a 404.

Dave_Hawley

5:01 am on Sep 26, 2003 (gmt 0)

Interesting!

A long time ago (1996?) I read that absolute links take longer to load than relative links

I guess that would be true as more text = larger page. However, I would like to think that most (buyers) now have better connections. As my site is mainly business software and the vast mojority of sales occur during US work hours I suspect most have a high speed connection.

They're more time consuming, especially for bigger sites.

One thing I never do anymore is type links. I always copy/paste (from a link that I works)so this doesn't apply in my case.

BlueSky, your comments are of great interest to me. I would like to think that my host isn't a "Budget host" as it costs me plenty! How would one find out?

Dave

BlueSky

6:49 am on Sep 26, 2003 (gmt 0)

I tried posting a link to where you can look up how many sites are on your server, but it's apparently banned here. Not sure why. So, I sticky mailed it to you instead.

Some hosts will oversell their servers counting on the fact that most sites will stay very tiny and only use a small fraction of bandwidth as well as other services they purchased. These hosts will often put 400, 500, or more domains per server. I know of one that puts in excess of 2,000. What usually happens is a few sites will start to significantly grow at the same time. That's when everyone on the server will see the problem I previously described until they're moved off. Then it will stabilize until the next wave starts growing. Although the number of domains may give an indication of oversoldness, it really boils down to the amount of traffic that is hitting the server at the same time. You can have 400 low traffic sites and never run out of http sockets or have five busy sites or even one and regularly run out.

If your server has been running fine and you don't see periodic page timeouts and/or errors, then I don't think you should worry about it. When/if your site starts monopolizing the sockets, your host will come knocking then and say fix your site.

kwasher

7:00 am on Sep 26, 2003 (gmt 0)

Can you sticky that to me too! THANKS!

silverbytes

11:35 pm on Sep 26, 2003 (gmt 0)

Hi!
Do you prefer I rename the starting thread as absolute vs relative?

silverbytes

1:34 am on Sep 28, 2003 (gmt 0)

I would feel very good if someone can explain to me why google is crawling just index and robots.txt and goes away... the site is ok, links are ok pr is ok but spider goes away without looking at my .htm docs...

Why?

claus

3:04 am on Sep 28, 2003 (gmt 0)

Silverbytes, the same question came up a few days ago (or at least i think it's the same). You'll find some of the information in this thread:

How long before new pages show up in Google?
[webmasterworld.com...]

/claus

silverbytes

4:14 pm on Sep 29, 2003 (gmt 0)

Yes thanks, there is not satisfying answer in that thread either. (getting high PR in all internal pages is not a true solution for making a bot crawl...)