Page is a not externally linkable
lazerzubb - 6:39 pm on Feb 10, 2003 (gmt 0)
If you are completely new to Webmasterworld:
Welcome to Webmaster World [webmasterworld.com]
Official page: [google.com...]
Robotstxt information regarding Googlebot: [robotstxt.org...]
1. What's the name of The Google Spider.
2. What's the difference between Deepbot and Freshbot.
3. What's the User Agent for Googlebot?
4. How do you tell the difference between the deep crawl and the fresh crawl.
5. When does Google use the different spiders?
6. How do i see if i have been spidered?
7. I haven't got access to my logfile, how do i then see if Googlebot spider my pages.
8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
9. How do i get Freshbot to visit my site?
10. Freshbot has been to my pages what happens then?
11. My site was down during some parts of the deep crawl what happens now?
12. What is a spidertrap and how do i prevent it?
13. Should i include/exclude Googlebot in my trafficreports?
14. How do i get Googlebot to spider my dynamic (url) pages?
15. How do i prevent Googlebot from spidering my site/page/graphics?
16. I've been deepspidered what now?
17. Which Ip does Froogle spider from?
18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
19. Does Googlebot crawl Adwords?
______________________________________________________________________________________
1. What's the name of The Google Spider.
Google calls its spider "Googlebot" whether it's a male of female we don't know.
2. What's the difference between Deepbot and Freshbot.
This is very well described in:
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
This is a really good thread if you are not used to the concepts of Deep Crawl and Fresh Crawl
3. What's the User Agent for Googlebot?
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
This appears for both Fresh Crawls and Deep Crawls.
4. How do you tell the difference between the deepbot and the freshbot.
The deepbot and the freshbot uses different IPs.
The Deepbot uses IPs which run from 216.*
and the Freshbot uses IPs which start with 64.*
5. When does Google use the different spiders?
The deepbot is sent out after each update, it normally takes a few days before it appears.
It can continue to spider your site for many days afterwards, for most sites it visits within a 2-7 days period.
The first thing it will request if the Robots.txt file, it may take days before it comes back, this is because
Google uses schemes when they spider and crawl, which is mostly to put off
the heavy load which Googlebot can cause when it requests pages.
Page Rank vs. number of deep crawl listings. [webmasterworld.com]
6. How do i see if i have been spidered?
The easiest way is to do a search for "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" in your logfile.
Since both deepbot and freshbot uses this User Agent you will need to look at the ip to see if it's the deepbot or freshbot.
7. I haven't got access to my log file, how do I see if Googlebot spiders my pages.
Use a server side include (eg. Apache XSSI, PHP or ASP) to embed or call a script that checks for:
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)" as the USER_AGENT
or "crawl*.googlebot.com" or "crawler*.googlebot.com" as the HOST. You cannot use an image or Javascript based tracker because Googlebot won't trigger it.
Also take a look the different threads in the the "Google News" forum, it's often mentioned when they start deep spidering.
8. I have changed my DNS/Ip and Googlebot doesn't come anymore.
Recent thread about this: google not liking new IPs? [webmasterworld.com]
Also from the Google Knowledge base:
Googlebot How long does Google cache IP's? [webmasterworld.com]
From it:
|
I think this is a very good tip.
9. How do i get Freshbot to visit my site?
This is also stated in the excellent
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
The easiest way is to have good inbound links, and it will help if they have a higher PageRank.
Also if you change the content on your site it will help, I would say these are
the 2 most important factors for getting the freshbot to visit your site.
10. Freshbot has been to my pages what happens then?
Freshbot visits for a numerous of different reasons.
They best way to get pages spidered by the Freshbot is stated in the
Google Updates and Everflux, the Monthly Mid-Cycle Changes [webmasterworld.com]
Another tip was suggested by GoogleGuy:
Are you using If Modified Since? [webmasterworld.com]
If your site is completly new and gets spidered by the Freshbot before it's indexed in the mothly update,
it may fall out of the index, from my experience if you haven't changed anything on the pages
it normally drops out after around 5-7 days.
This is not a "general" rule though, it deppends on a lot of other factors too.
11. My site was down during some parts of the Deep crawl what happens now?
GoogleGuy reported last year that they increased support for re-spidering of sites which
had problems during the "normal" deep spidering. Vitaplease made a very interesting comment in:
Google page dropping question [webmasterworld.com]
|
12. What is a spider trap and how do i prevent it?
A spider trap is when a spider re-spiders the same page over
and over again, you can compare it to a maze (labyrinth).
The biggest problem with spider traps is often the amount of bandwidth and server load it puts on the site which is spiders.
It also creates a problem for Googlebot which spiders the same page over
and over again even though the content is almost always the same.
The most known spider trap is Session ID's, a Session ID is often used to keep track
of the visitors, and some sites puts a unique ID in the URL:
An example is www.webmasterworld.com/page.php?id=264684413484654
(Note this URL doesn't exist).
Each user gets a unique ID and it's often requested from each page.
The problem here is when Googlebot comes to the page, it spiders the page and
then leaves, it comes back to another page and it finds a link to the same page but since it has been given
a different session id now, the link shows up as another URL. This is one of the reasons
why Googlebot is very very carefully when it spiders pages which uses the querystring "ID=".
I've seen and heard many cases where the same page have been spidered over
1000 times, and sometimes it's been indexed the same amount as it's been spidered, most
search engines have very advanced duplicate filters which removes the duplicates and selets one url.
13. Should i include Googlebot in my traffic reports?
The general suggestion is NO, Googlebot is not a real human being visiting your site.
One quote which i often use is "Don't build pages for search engines, build them for users, it's
the user who will buy thing off your website, not search engines, and search engines wants to
generate the best results for the user, and therefore tries, to think as a user, when it ranks its results"
14. How do i get Googlebot to spider my dynamic (url) pages?
First thing is, do you need to have a Dynamic URL?
There is very many things you can do to get a dynamic site spidered, the support for spidering
dynamic URL's seem to get bigger and bigger each day.
Always try to stay out of using Session Id's in the URL, this is the ultimate killer when
it comes to prevent Googlebot from spidering your dynamic URL'S.
Also try to stay out of using the query string "ID=" Since this is the most common
used query string when it comes to presenting Session Id's googlebot seem to
put a "flag" each time it sees it in the URL. This question have been coveded quite a
few times, but it seem to change over time:
If I Use PHP Will Google Still Like Me? [webmasterworld.com]
Googlebot & Dynamic Pages [webmasterworld.com]
Does Google index dynamic content? [webmasterworld.com]
PageRank seem to play a major roll when it comes to dynamic url's and spidering, the more
PageRank the more chance there is that the url's will be spidered, and the more PageRank the deeper googlebot will go.
15. How do i prevent Googlebot from spidering my site/page/graphics?
Googlebot obey's the Robots.txt standard, to prevent Googlebot to spider your site, you
can put this in your Robots.txt file (Should be placed in the HTTP root category)
User-agent: Googlebot
Disallow: /
To prevent Google from Indexing your images, use:
User-agent: Googlebot-Image
Disallow: /
It's also described officialy at Google.com: No Index tags [google.com]
Also try using the Search Engine World Robots.txt Validator [searchengineworld.com]
16. I've been deep spidered what now?
If you have been deep spidered the biggest chance is that you will appear in the
upcomming "Google Update", This happens about once a month:
Google Update Chart [webmasterworld.com].
Be advised though that your site may not appear in the next update for a lot of reasons, most
of them which only Google knows, and will probably not tell you.
For the PageRank to be correctly calculated from the incomming links i advise to wait up to 2 updates.
And all inbound links may not show up with the LINK: commando
(Rule seem to be that only pages which have a PageRank of 4 and above will show up).
17. Which Ip does Froogle spider from?
From what i know Froogle uses the 64.* ip, and the same User Agent.
18. Googlebot spiders both my [domain.com...] and [domain.com,...] what should i do?
The best thing to do is to pick one of the url's and use a 301 redirect (Permananet redirect)
If you don't know what or how to use it try this
Google search for: 301 redirect [google.com]
19. Does Googlebot crawl Adwords?
From: Google crawls URLs in adwords? [webmasterworld.com]
|
Note: The information stated above may not be correct! and remember things change
Thanks to ciml and Mike_Mackin for editing
Edit: spelling
[edited by: lazerzubb at 6:49 pm (utc) on Feb. 10, 2003]