Welcome to WebmasterWorld Guest from 54.147.217.76

Forum Moderators: open

Message Too Old, No Replies

Googlebot: Deepbot and Freshbot FAQ and Information

FAQ and general information regarding Google and it's spiders

   
6:39 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Googlebot: Deepbot and Freshbot

If you are completely new to Webmasterworld:
Welcome to Webmaster World [webmasterworld.com]

Official page: [google.com...]
Robotstxt information regarding Googlebot: [robotstxt.org...]

1. What's the name of The Google Spider.
2. What's the difference between Deepbot and Freshbot.
3. What's the User Agent for Googlebot?
4. How do you tell the difference between the deep crawl and the fresh crawl.
5. When does Google use the different spiders?
6. How do i see if i have been spidered?
7. I haven't got access to my logfile, how do i then see if Googlebot spider my pages.
8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
9. How do i get Freshbot to visit my site?
10. Freshbot has been to my pages what happens then?
11. My site was down during some parts of the deep crawl what happens now?
12. What is a spidertrap and how do i prevent it?
13. Should i include/exclude Googlebot in my trafficreports?
14. How do i get Googlebot to spider my dynamic (url) pages?
15. How do i prevent Googlebot from spidering my site/page/graphics?
16. I've been deepspidered what now?
17. Which Ip does Froogle spider from?
18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
19. Does Googlebot crawl Adwords?

______________________________________________________________________________________

1. What's the name of The Google Spider.
Google calls its spider "Googlebot" whether it's a male of female we don't know.

2. What's the difference between Deepbot and Freshbot.
This is very well described in:
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
This is a really good thread if you are not used to the concepts of Deep Crawl and Fresh Crawl

3. What's the User Agent for Googlebot?
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
This appears for both Fresh Crawls and Deep Crawls.

4. How do you tell the difference between the deepbot and the freshbot.
The deepbot and the freshbot uses different IPs.
The Deepbot uses IPs which run from 216.*
and the Freshbot uses IPs which start with 64.*

5. When does Google use the different spiders?
The deepbot is sent out after each update, it normally takes a few days before it appears.
It can continue to spider your site for many days afterwards, for most sites it visits within a 2-7 days period.
The first thing it will request if the Robots.txt file, it may take days before it comes back, this is because
Google uses schemes when they spider and crawl, which is mostly to put off
the heavy load which Googlebot can cause when it requests pages.

Page Rank vs. number of deep crawl listings. [webmasterworld.com]

6. How do i see if i have been spidered?
The easiest way is to do a search for "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" in your logfile.
Since both deepbot and freshbot uses this User Agent you will need to look at the ip to see if it's the deepbot or freshbot.

7. I haven't got access to my log file, how do I see if Googlebot spiders my pages.
Use a server side include (eg. Apache XSSI, PHP or ASP) to embed or call a script that checks for:
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)" as the USER_AGENT
or "crawl*.googlebot.com" or "crawler*.googlebot.com" as the HOST. You cannot use an image or Javascript based tracker because Googlebot won't trigger it.
Also take a look the different threads in the the "Google News" forum, it's often mentioned when they start deep spidering.

8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
Recent thread about this: google not liking new IPs? [webmasterworld.com]
Also from the Google Knowledge base:
Googlebot How long does Google cache IP's? [webmasterworld.com]
From it:


Personally I prefer to keep a site on both IPs for a month. This
required a helpful Web host if you don't run your own equipment. (-ciml)

I think this is a very good tip.

9. How do i get Freshbot to visit my site?
This is also stated in the excellent
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
The easiest way is to have good inbound links, and it will help if they have a higher PageRank.
Also if you change the content on your site it will help, I would say these are
the 2 most important factors for getting the freshbot to visit your site.

10. Freshbot has been to my pages what happens then?
Freshbot visits for a numerous of different reasons.
They best way to get pages spidered by the Freshbot is stated in the

Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
Another tip was suggested by GoogleGuy:
Are you using If Modified Since? [webmasterworld.com]
If your site is completly new and gets spidered by the Freshbot before it's indexed in the mothly update,
it may fall out of the index, from my experience if you haven't changed anything on the pages
it normally drops out after around 5-7 days.
This is not a "general" rule though, it deppends on a lot of other factors too.

11. My site was down during some parts of the Deep crawl what happens now?
GoogleGuy reported last year that they increased support for re-spidering of sites which
had problems during the "normal" deep spidering. Vitaplease made a very interesting comment in:
Google page dropping question [webmasterworld.com]


I think basically I'm asking... does Google only drop a page if it can't reach it?


I asked a Google rep what would happen if a site was down for a day during the deep crawl. Would Googlebot come back?
He asked what Pagerank? I said 6, he said no problem, the site should be revisited.
(that was last October during a pubconference)

12. What is a spider trap and how do i prevent it?
A spider trap is when a spider re-spiders the same page over
and over again, you can compare it to a maze (labyrinth).
The biggest problem with spider traps is often the amount of bandwidth and server load it puts on the site which is spiders.
It also creates a problem for Googlebot which spiders the same page over
and over again even though the content is almost always the same.
The most known spider trap is Session ID's, a Session ID is often used to keep track
of the visitors, and some sites puts a unique ID in the URL:
An example is www.webmasterworld.com/page.php?id=264684413484654
(Note this URL doesn't exist).
Each user gets a unique ID and it's often requested from each page.
The problem here is when Googlebot comes to the page, it spiders the page and
then leaves, it comes back to another page and it finds a link to the same page but since it has been given
a different session id now, the link shows up as another URL. This is one of the reasons
why Googlebot is very very carefully when it spiders pages which uses the querystring "ID=".
I've seen and heard many cases where the same page have been spidered over
1000 times, and sometimes it's been indexed the same amount as it's been spidered, most
search engines have very advanced duplicate filters which removes the duplicates and selets one url.

13. Should i include Googlebot in my traffic reports?
The general suggestion is NO, Googlebot is not a real human being visiting your site.
One quote which i often use is "Don't build pages for search engines, build them for users, it's
the user who will buy thing off your website, not search engines, and search engines wants to
generate the best results for the user, and therefore tries, to think as a user, when it ranks its results"

14. How do i get Googlebot to spider my dynamic (url) pages?
First thing is, do you need to have a Dynamic URL?
There is very many things you can do to get a dynamic site spidered, the support for spidering
dynamic URL's seem to get bigger and bigger each day.
Always try to stay out of using Session Id's in the URL, this is the ultimate killer when
it comes to prevent Googlebot from spidering your dynamic URL'S.
Also try to stay out of using the query string "ID=" Since this is the most common
used query string when it comes to presenting Session Id's googlebot seem to
put a "flag" each time it sees it in the URL. This question have been coveded quite a
few times, but it seem to change over time:
If I Use PHP Will Google Still Like Me? [webmasterworld.com]
Googlebot & Dynamic Pages [webmasterworld.com]
Does Google index dynamic content? [webmasterworld.com]

PageRank seem to play a major roll when it comes to dynamic url's and spidering, the more
PageRank the more chance there is that the url's will be spidered, and the more PageRank the deeper googlebot will go.

15. How do i prevent Googlebot from spidering my site/page/graphics?
Googlebot obey's the Robots.txt standard, to prevent Googlebot to spider your site, you
can put this in your Robots.txt file (Should be placed in the HTTP root category)

User-agent: Googlebot
Disallow: /

To prevent Google from Indexing your images, use:

User-agent: Googlebot-Image
Disallow: /

It's also described officialy at Google.com: No Index tags [google.com]
Also try using the Search Engine World Robots.txt Validator [searchengineworld.com]

16. I've been deep spidered what now?
If you have been deep spidered the biggest chance is that you will appear in the
upcomming "Google Update", This happens about once a month:
Google Update Chart [webmasterworld.com].
Be advised though that your site may not appear in the next update for a lot of reasons, most
of them which only Google knows, and will probably not tell you.
For the PageRank to be correctly calculated from the incomming links i advise to wait up to 2 updates.
And all inbound links may not show up with the LINK: commando
(Rule seem to be that only pages which have a PageRank of 4 and above will show up).

17. Which Ip does Froogle spider from?
From what i know Froogle uses the 64.* ip, and the same User Agent.

18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
The best thing to do is to pick one of the url's and use a 301 redirect (Permananet redirect)
If you don't know what or how to use it try this

Google search for: 301 redirect [google.com]

19. Does Googlebot crawl Adwords?
From: Google crawls URLs in adwords? [webmasterworld.com]

I was talking with Google and they do not crawl the adwords links. ~allanp73

It doesn't seem to help buying adword listings, and getting spidered due to it.

Note: The information stated above may not be correct! and remember things change

Thanks to ciml and Mike_Mackin for editing
Edit: spelling

[edited by: lazerzubb at 6:49 pm (utc) on Feb. 10, 2003]

6:43 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member jeremy_goodrich is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Nice post, Laz - one for the bookmarks.
6:45 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yeah, great stuff Lazerzubb!
Nice work ;)

Nick

[edited by: Nick_W at 7:03 pm (utc) on Feb. 10, 2003]

6:57 pm on Feb 10, 2003 (gmt 0)

10+ Year Member



Excellent thread :)

Thanks very much lazerzubb

CK

7:08 pm on Feb 10, 2003 (gmt 0)

10+ Year Member



".... its spider "Googlebot" whether it's a male or female we don't know...."

I sent two message to Google and asked them, never got an answer :(

7:34 pm on Feb 10, 2003 (gmt 0)

10+ Year Member



Very helpful. Thanks.
7:45 pm on Feb 10, 2003 (gmt 0)



the man does it again :)

Shak

8:03 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member lorax is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Lovely post!

It should be noted that the spider trap mentioned above is an unintentional one. There are intentional spider traps which are used to trap bad bots and other agents a webmaster does not want crawling thier website. You'll find both are talked about here on WebmasterWorld.

8:12 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good stuff, lazerzubb. I expect we'll see it show up as a link in a lot of threads.
I didn't know about #11... glad I have a PR6.
8:17 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A classic right from the start, lazerzubb. The answer to the hopes I asked you to fulfill with your post #1000, remember? Certainly worth waiting for. Now we know where to point the next newbie asking about bots. We also know where to go ourselves when in doubt. Thanks a bunch.
10:18 pm on Feb 10, 2003 (gmt 0)

10+ Year Member



If you have not seen this on google answers you may want to look at where they link to for informaion.

[answers.google.com...]

10:55 pm on Feb 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nice summary! Hopefully this will quiet down the new posts asking the same questions all over again.

Andreas

11:43 pm on Feb 10, 2003 (gmt 0)

10+ Year Member



lazerzubb said:

7. I haven't got access to my log file, how do I see if Googlebot spiders my pages.

For those without access to SSI or the background to implement it, an easy way to check for Google visits is to do a Google search for allinurl:www.yourdomain.tld . It will indicate any fresh crawls by Google on the bottom line of each result with the date of the fresh crawl. Granted this only applies to freshbot visits, but better than not knowing at all.

lazerzubb...great post!

Ted

12:05 am on Feb 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm surprised that this hasn't been mentioned before but the biggest difference between the two bots is in the MIME types they accept.

Deepbot indexes:
text/html
text/plain
application/pdf
application/x-shockwave-flash
application/vnd.ms-excel
application/rtf
application/msword
application/vnd.ms-powerpoint
application/postscript
application/x-gzip
application/octet-stream
application/*

Freshbot indexes:
text/html
text/plain
12:07 am on Feb 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Key_Master very true, i totally forgot to add that, a very good tip.
12:51 am on Feb 11, 2003 (gmt 0)

10+ Year Member



Way to go Lazerzubb,

Did you mention about the updates?

Well done....

1:31 pm on Feb 11, 2003 (gmt 0)

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Well done lazerzubb, I'm sure that this is going to be useful to a lot of people.

yasunglass, lazerzubb wrote a Google Update FAQ [webmasterworld.com] a while ago.

11:17 am on Feb 12, 2003 (gmt 0)

WebmasterWorld Senior Member vitaplease is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Nice one laz.
3:43 am on Feb 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Maybe also mention that Google adword editors can trigger a Googlebot visit to your site but it doesn't go into the cache or anything. It will look like a regular freshbot visit but it's not. There have been a bunch of those questions.
8:39 am on Feb 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



owww nice information :)

Craig

10:03 am on Feb 17, 2003 (gmt 0)

10+ Year Member



Thanks very much for excellent work, another WebmasterWorld golden nugget.
12:07 pm on Feb 17, 2003 (gmt 0)

10+ Year Member



Wow .. fantastic information resource. Googlebot in a nutshell.

Very commendable effort lazerzubb.

1:18 pm on Feb 17, 2003 (gmt 0)

10+ Year Member



Well done!

Tor

2:00 pm on Feb 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks very much for excellent work, another WebmasterWorld golden nugget.

I completely agree to this statement. You certainly know this stuff lazer. Thank`s for sharing. ;)

3:02 am on Feb 18, 2003 (gmt 0)

10+ Year Member



Nicely done, lazerzubb! Thanks for the thorough FAQ.
5:44 am on Mar 26, 2003 (gmt 0)

10+ Year Member



Very nice summary :)