Toolbar indexing

Forum Moderators: open

Message Too Old, No Replies

Toolbar indexing

whatson

5:35 am on Apr 25, 2003 (gmt 0)

Ive heard that the Google Toolbar can also index sites into Google without being crawled and before the update. I was wondering how long does it take to get a site indexed by the Toolbar?

deejay

5:37 am on Apr 25, 2003 (gmt 0)

Hi whatson

I think someone's put you wrong there... Doesn't happen to the best of my knowledge.

If a page is added between updates I think you'll find freshbot is responsible.

werty

3:18 pm on Apr 25, 2003 (gmt 0)

I think what you may be thinking of is the little smile or frown images on the toolbar. There is an unconfirmed rumor that by voting with a smile for a site that FreshBot may come and visit, and add you to the index.

This was just a rumor/theory though.

I did some updates to a site I work on yesterday and voted for the pages, still not in the index and no fresh tags...So I believe this rumor to be false.

willybfriendly

3:45 pm on Apr 25, 2003 (gmt 0)

I have had sites under development, with absolutely zero inbound links and no history of submission, end up in the Google index. The only explanation that I could come up with was that they had been viewed on a browser with the toolbar installed. As a result, I never view sites under development with a toolbar equipped browser anymore. Maybe I am paranoid, but the problem has not repeated itself since I started this practice.

WBF

Spica

4:00 pm on Apr 25, 2003 (gmt 0)

I had the same experience. An unfinished site with no inbound links, which I visited with the toolbar on, had a few pages that were indexed after the last deep crawl.

whatson

2:39 am on Apr 26, 2003 (gmt 0)

Yes, I have also experienced this, I did a site and about 1 week later it was in Google, before the update, and it wasnt visited by Googlebot. So how do explain that?

PatrickDeese

2:44 am on Apr 26, 2003 (gmt 0)

I had the same experience. An unfinished site with no inbound links, which I visited with the toolbar on, had a few pages that were indexed after the last deep crawl.

referral statistics pages. perhaps your site has outgoing links, you click one to test it. Many sites keep their stats page in a place where they can be crawled. The log is crawled by google, which then follows the "backlink" to your site.

whatson

3:14 am on Apr 26, 2003 (gmt 0)

Sites with no Pagerank can rank in the Google search results without having to be crawled or wait for an update, and I think they get indexed via the toolbar

GoogleGuy

3:20 am on Apr 26, 2003 (gmt 0)

We're not doing this, or at least not currently. Anytime you're crawling 3B pages and you also have millions of toolbar users surfing around, it's inevitable that someone with a toolbar will visit a page, and then Googlebot will visit as well at around the same time.. :)

RawAlex

4:43 pm on Apr 26, 2003 (gmt 0)

"at least not currently"... heh-heh :-)

Alex

alxdean

5:37 pm on Apr 26, 2003 (gmt 0)

Just imagine the mere implications of toolbar indexing. I can already hear hundreds of cloakers moan and scream in despair!

whatson

2:27 am on Apr 27, 2003 (gmt 0)

So how else can a site get indexed before the update without being crawled?

jdMorgan

4:06 am on Apr 27, 2003 (gmt 0)

However it happens, it's easy to prevent.

If you don't want your pages indexed, put up a robots.txt file in the web root directory of your site with


User-agent: *
Disallow: /

in it. Google will obey that, and not fetch the pages.

When you're ready to go live, remove the robots.txt file, or better yet, change it to


User-agent: *
Disallow:

This disallows nothing - in other words, it allows all robots to spider the entire site. It will prevent you from getting a ton of 404 errors in your logs, since all good robots will try to fetch robots.txt before spidering a site. Even if you do not wish to block any robots from any pages, it's a really good idea to have at least this default robots.txt on any site - whether in development or live - to cut down on these 404 errors.

However, even with a Disallow: / directive in robots.txt, if Googlebot finds a link to a page, it may list the page by URL in the search results. No title, no description, just the URL. The page won't come up for any keyword searches, but it will come up in the "More results from <yourdomain>" listing.

If you want to stop that, you can do one of two things:
1) Put a <meta name="robots" content="noindex,nofollow"> tag on each page you don't want indexed or followed (You can also put a variant of that tag up for the various index/follow combinations).

2) Where that would be unwieldy - say for a large site under development - a relatively simple solution is to make a special page for robots, and put the <meta name="robots" content="noindex,nofollow"> tag on it. Then transparently redirect all robot requests for all pages on your site to that special robots page. When you're ready to go live, remove the redirect, and remove the special robots page. Note for the suspicious: Yes, this is cloaking, but there is no intent to deceive visitors - since there shouldn't be any visitore yet!

The meta robots tag approach is required to tell Google and Ask Jeeves/Teoma "don't mention this page at all." In my experience, most other 'bots will treat a robots.txt Disallow as a "don't mention it at all" directive, but Google and AJ/Teoma intepret the Standard for Robots Exclusion literally; As directed, they don't fetch the page, but if they find a link to it, the URL of the page will be listed in their results. There are good and bad points to either approach. But like it or not, that's how it works.

The specification: A standard for Robot Exclusion [robotstxt.org].
Validate your robots.txt file here [searchengineworld.com].

HTH,
Jim

GoogleGuy

4:56 am on Apr 27, 2003 (gmt 0)

whatson, if we see a reference to a site, we may be able to return that link even if we didn't crawl the page.

Hope that helps,
GoogleGuy