homepage Welcome to WebmasterWorld Guest from 54.196.24.103
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Googlebot: Deepbot and Freshbot FAQ and Information
FAQ and general information regarding Google and it's spiders
lazerzubb




msg:131081
 6:39 pm on Feb 10, 2003 (gmt 0)

Googlebot: Deepbot and Freshbot

If you are completely new to Webmasterworld:
Welcome to Webmaster World [webmasterworld.com]

Official page: [google.com...]
Robotstxt information regarding Googlebot: [robotstxt.org...]

1. What's the name of The Google Spider.
2. What's the difference between Deepbot and Freshbot.
3. What's the User Agent for Googlebot?
4. How do you tell the difference between the deep crawl and the fresh crawl.
5. When does Google use the different spiders?
6. How do i see if i have been spidered?
7. I haven't got access to my logfile, how do i then see if Googlebot spider my pages.
8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
9. How do i get Freshbot to visit my site?
10. Freshbot has been to my pages what happens then?
11. My site was down during some parts of the deep crawl what happens now?
12. What is a spidertrap and how do i prevent it?
13. Should i include/exclude Googlebot in my trafficreports?
14. How do i get Googlebot to spider my dynamic (url) pages?
15. How do i prevent Googlebot from spidering my site/page/graphics?
16. I've been deepspidered what now?
17. Which Ip does Froogle spider from?
18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
19. Does Googlebot crawl Adwords?

______________________________________________________________________________________

1. What's the name of The Google Spider.
Google calls its spider "Googlebot" whether it's a male of female we don't know.

2. What's the difference between Deepbot and Freshbot.
This is very well described in:
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
This is a really good thread if you are not used to the concepts of Deep Crawl and Fresh Crawl

3. What's the User Agent for Googlebot?
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
This appears for both Fresh Crawls and Deep Crawls.

4. How do you tell the difference between the deepbot and the freshbot.
The deepbot and the freshbot uses different IPs.
The Deepbot uses IPs which run from 216.*
and the Freshbot uses IPs which start with 64.*

5. When does Google use the different spiders?
The deepbot is sent out after each update, it normally takes a few days before it appears.
It can continue to spider your site for many days afterwards, for most sites it visits within a 2-7 days period.
The first thing it will request if the Robots.txt file, it may take days before it comes back, this is because
Google uses schemes when they spider and crawl, which is mostly to put off
the heavy load which Googlebot can cause when it requests pages.

Page Rank vs. number of deep crawl listings. [webmasterworld.com]

6. How do i see if i have been spidered?
The easiest way is to do a search for "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" in your logfile.
Since both deepbot and freshbot uses this User Agent you will need to look at the ip to see if it's the deepbot or freshbot.

7. I haven't got access to my log file, how do I see if Googlebot spiders my pages.
Use a server side include (eg. Apache XSSI, PHP or ASP) to embed or call a script that checks for:
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)" as the USER_AGENT
or "crawl*.googlebot.com" or "crawler*.googlebot.com" as the HOST. You cannot use an image or Javascript based tracker because Googlebot won't trigger it.
Also take a look the different threads in the the "Google News" forum, it's often mentioned when they start deep spidering.

8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
Recent thread about this: google not liking new IPs? [webmasterworld.com]
Also from the Google Knowledge base:
Googlebot How long does Google cache IP's? [webmasterworld.com]
From it:

Personally I prefer to keep a site on both IPs for a month. This
required a helpful Web host if you don't run your own equipment. (-ciml)

I think this is a very good tip.

9. How do i get Freshbot to visit my site?
This is also stated in the excellent
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
The easiest way is to have good inbound links, and it will help if they have a higher PageRank.
Also if you change the content on your site it will help, I would say these are
the 2 most important factors for getting the freshbot to visit your site.

10. Freshbot has been to my pages what happens then?
Freshbot visits for a numerous of different reasons.
They best way to get pages spidered by the Freshbot is stated in the

Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
Another tip was suggested by GoogleGuy:
Are you using If Modified Since? [webmasterworld.com]
If your site is completly new and gets spidered by the Freshbot before it's indexed in the mothly update,
it may fall out of the index, from my experience if you haven't changed anything on the pages
it normally drops out after around 5-7 days.
This is not a "general" rule though, it deppends on a lot of other factors too.

11. My site was down during some parts of the Deep crawl what happens now?
GoogleGuy reported last year that they increased support for re-spidering of sites which
had problems during the "normal" deep spidering. Vitaplease made a very interesting comment in:
Google page dropping question [webmasterworld.com]

I think basically I'm asking... does Google only drop a page if it can't reach it?


I asked a Google rep what would happen if a site was down for a day during the deep crawl. Would Googlebot come back?
He asked what Pagerank? I said 6, he said no problem, the site should be revisited.
(that was last October during a pubconference)

12. What is a spider trap and how do i prevent it?
A spider trap is when a spider re-spiders the same page over
and over again, you can compare it to a maze (labyrinth).
The biggest problem with spider traps is often the amount of bandwidth and server load it puts on the site which is spiders.
It also creates a problem for Googlebot which spiders the same page over
and over again even though the content is almost always the same.
The most known spider trap is Session ID's, a Session ID is often used to keep track
of the visitors, and some sites puts a unique ID in the URL:
An example is www.webmasterworld.com/page.php?id=264684413484654
(Note this URL doesn't exist).
Each user gets a unique ID and it's often requested from each page.
The problem here is when Googlebot comes to the page, it spiders the page and
then leaves, it comes back to another page and it finds a link to the same page but since it has been given
a different session id now, the link shows up as another URL. This is one of the reasons
why Googlebot is very very carefully when it spiders pages which uses the querystring "ID=".
I've seen and heard many cases where the same page have been spidered over
1000 times, and sometimes it's been indexed the same amount as it's been spidered, most
search engines have very advanced duplicate filters which removes the duplicates and selets one url.

13. Should i include Googlebot in my traffic reports?
The general suggestion is NO, Googlebot is not a real human being visiting your site.
One quote which i often use is "Don't build pages for search engines, build them for users, it's
the user who will buy thing off your website, not search engines, and search engines wants to
generate the best results for the user, and therefore tries, to think as a user, when it ranks its results"

14. How do i get Googlebot to spider my dynamic (url) pages?
First thing is, do you need to have a Dynamic URL?
There is very many things you can do to get a dynamic site spidered, the support for spidering
dynamic URL's seem to get bigger and bigger each day.
Always try to stay out of using Session Id's in the URL, this is the ultimate killer when
it comes to prevent Googlebot from spidering your dynamic URL'S.
Also try to stay out of using the query string "ID=" Since this is the most common
used query string when it comes to presenting Session Id's googlebot seem to
put a "flag" each time it sees it in the URL. This question have been coveded quite a
few times, but it seem to change over time:
If I Use PHP Will Google Still Like Me? [webmasterworld.com]
Googlebot & Dynamic Pages [webmasterworld.com]
Does Google index dynamic content? [webmasterworld.com]

PageRank seem to play a major roll when it comes to dynamic url's and spidering, the more
PageRank the more chance there is that the url's will be spidered, and the more PageRank the deeper googlebot will go.

15. How do i prevent Googlebot from spidering my site/page/graphics?
Googlebot obey's the Robots.txt standard, to prevent Googlebot to spider your site, you
can put this in your Robots.txt file (Should be placed in the HTTP root category)

User-agent: Googlebot
Disallow: /

To prevent Google from Indexing your images, use:

User-agent: Googlebot-Image
Disallow: /

It's also described officialy at Google.com: No Index tags [google.com]
Also try using the Search Engine World Robots.txt Validator [searchengineworld.com]

16. I've been deep spidered what now?
If you have been deep spidered the biggest chance is that you will appear in the
upcomming "Google Update", This happens about once a month:
Google Update Chart [webmasterworld.com].
Be advised though that your site may not appear in the next update for a lot of reasons, most
of them which only Google knows, and will probably not tell you.
For the PageRank to be correctly calculated from the incomming links i advise to wait up to 2 updates.
And all inbound links may not show up with the LINK: commando
(Rule seem to be that only pages which have a PageRank of 4 and above will show up).

17. Which Ip does Froogle spider from?
From what i know Froogle uses the 64.* ip, and the same User Agent.

18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
The best thing to do is to pick one of the url's and use a 301 redirect (Permananet redirect)
If you don't know what or how to use it try this

Google search for: 301 redirect [google.com]

19. Does Googlebot crawl Adwords?
From: Google crawls URLs in adwords? [webmasterworld.com]
I was talking with Google and they do not crawl the adwords links. ~allanp73

It doesn't seem to help buying adword listings, and getting spidered due to it.

Note: The information stated above may not be correct! and remember things change

Thanks to ciml and Mike_Mackin for editing
Edit: spelling

[edited by: lazerzubb at 6:49 pm (utc) on Feb. 10, 2003]

 

jeremy goodrich




msg:131082
 6:43 pm on Feb 10, 2003 (gmt 0)

Nice post, Laz - one for the bookmarks.

Nick_W




msg:131083
 6:45 pm on Feb 10, 2003 (gmt 0)

Yeah, great stuff Lazerzubb!
Nice work ;)

Nick

[edited by: Nick_W at 7:03 pm (utc) on Feb. 10, 2003]

curlykarl




msg:131084
 6:57 pm on Feb 10, 2003 (gmt 0)

Excellent thread :)

Thanks very much lazerzubb

CK

WindSun




msg:131085
 7:08 pm on Feb 10, 2003 (gmt 0)

".... its spider "Googlebot" whether it's a male or female we don't know...."

I sent two message to Google and asked them, never got an answer :(

atadams




msg:131086
 7:34 pm on Feb 10, 2003 (gmt 0)

Very helpful. Thanks.

Shakil




msg:131087
 7:45 pm on Feb 10, 2003 (gmt 0)

the man does it again :)

Shak

lorax




msg:131088
 8:03 pm on Feb 10, 2003 (gmt 0)

Lovely post!

It should be noted that the spider trap mentioned above is an unintentional one. There are intentional spider traps which are used to trap bad bots and other agents a webmaster does not want crawling thier website. You'll find both are talked about here on WebmasterWorld.

Stefan




msg:131089
 8:12 pm on Feb 10, 2003 (gmt 0)

Good stuff, lazerzubb. I expect we'll see it show up as a link in a lot of threads.
I didn't know about #11... glad I have a PR6.

troels nybo nielsen




msg:131090
 8:17 pm on Feb 10, 2003 (gmt 0)

A classic right from the start, lazerzubb. The answer to the hopes I asked you to fulfill with your post #1000, remember? Certainly worth waiting for. Now we know where to point the next newbie asking about bots. We also know where to go ourselves when in doubt. Thanks a bunch.

wasmith




msg:131091
 10:18 pm on Feb 10, 2003 (gmt 0)

If you have not seen this on google answers you may want to look at where they link to for informaion.

[answers.google.com...]

andreasfriedrich




msg:131092
 10:55 pm on Feb 10, 2003 (gmt 0)

Nice summary! Hopefully this will quiet down the new posts asking the same questions all over again.

Andreas

nativenewyorker




msg:131093
 11:43 pm on Feb 10, 2003 (gmt 0)

lazerzubb said:

7. I haven't got access to my log file, how do I see if Googlebot spiders my pages.

For those without access to SSI or the background to implement it, an easy way to check for Google visits is to do a Google search for allinurl:www.yourdomain.tld . It will indicate any fresh crawls by Google on the bottom line of each result with the date of the fresh crawl. Granted this only applies to freshbot visits, but better than not knowing at all.

lazerzubb...great post!

Ted

Key_Master




msg:131094
 12:05 am on Feb 11, 2003 (gmt 0)

I'm surprised that this hasn't been mentioned before but the biggest difference between the two bots is in the MIME types they accept.

Deepbot indexes:
text/html
text/plain
application/pdf
application/x-shockwave-flash
application/vnd.ms-excel
application/rtf
application/msword
application/vnd.ms-powerpoint
application/postscript
application/x-gzip
application/octet-stream
application/*

Freshbot indexes:
text/html
text/plain

lazerzubb




msg:131095
 12:07 am on Feb 11, 2003 (gmt 0)

Key_Master very true, i totally forgot to add that, a very good tip.

yasunglass




msg:131096
 12:51 am on Feb 11, 2003 (gmt 0)

Way to go Lazerzubb,

Did you mention about the updates?

Well done....

ciml




msg:131097
 1:31 pm on Feb 11, 2003 (gmt 0)

Well done lazerzubb, I'm sure that this is going to be useful to a lot of people.

yasunglass, lazerzubb wrote a Google Update FAQ [webmasterworld.com] a while ago.

vitaplease




msg:131098
 11:17 am on Feb 12, 2003 (gmt 0)

Nice one laz.

amznVibe




msg:131099
 3:43 am on Feb 17, 2003 (gmt 0)

Maybe also mention that Google adword editors can trigger a Googlebot visit to your site but it doesn't go into the cache or anything. It will look like a regular freshbot visit but it's not. There have been a bunch of those questions.

creative craig




msg:131100
 8:39 am on Feb 17, 2003 (gmt 0)

owww nice information :)

Craig

Susanne




msg:131101
 10:03 am on Feb 17, 2003 (gmt 0)

Thanks very much for excellent work, another WebmasterWorld golden nugget.

jaski




msg:131102
 12:07 pm on Feb 17, 2003 (gmt 0)

Wow .. fantastic information resource. Googlebot in a nutshell.

Very commendable effort lazerzubb.

Torben Lundsgaard




msg:131103
 1:18 pm on Feb 17, 2003 (gmt 0)

Well done!

Tor




msg:131104
 2:00 pm on Feb 17, 2003 (gmt 0)

Thanks very much for excellent work, another WebmasterWorld golden nugget.

I completely agree to this statement. You certainly know this stuff lazer. Thank`s for sharing. ;)

WebRookie




msg:131105
 3:02 am on Feb 18, 2003 (gmt 0)

Nicely done, lazerzubb! Thanks for the thorough FAQ.

Bigwebmaster




msg:131106
 5:44 am on Mar 26, 2003 (gmt 0)

Very nice summary :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved