Report Crawler Bug

Forum Moderators: open

Message Too Old, No Replies

Report Crawler Bug

AthlonInside

6:57 pm on May 10, 2003 (gmt 0)

Although I am not bothering how aggresive the fresh bot is but why would it crawl the same files 2-3 times a day?

Crawled 5 files yesterday, 1 file crawled 3 times and 2 other files were crawled twice each. I would be more happy if they crawl all different/unique pages instead. :)

WarmGlow

7:32 pm on May 10, 2003 (gmt 0)

...2 other files were crawled twice each.

Just a guess . . .
One of the requests to www.example.com/file_name.html and the other request to example.com/file_name.html.

AthlonInside

7:41 pm on May 10, 2003 (gmt 0)

Stuff Happens! why can't they just assume www is same as non-www!

[edited by: Woz at 3:39 am (utc) on May 14, 2003]
[edit reason] er, language ... [/edit]

GoogleGuy

7:43 pm on May 10, 2003 (gmt 0)

Because sometimes it isn't. :(

AthlonInside

7:48 pm on May 10, 2003 (gmt 0)

Good if GG here,

The crawle is indeed indexing both my site, www and non-www version because I saw the index file for the non-www version appearing in www3.

What should I do? Nothing?

WarmGlow

7:55 pm on May 10, 2003 (gmt 0)

Because sometimes it isn't.

...and Google then sorts it out by applying a duplicate content filter. (Just another guess.)

takagi

8:04 pm on May 10, 2003 (gmt 0)

If most links are to www.mydomain.com and only a few are to mydomain.com locate the links to mydomain.com and (make tehm) point them to www.mydomain.com. Internal links are easy, external links are more of a problem. You need to ask the other webmaster to change the link, but with a good explanation, they usually will do so.

Oaf357

8:20 pm on May 10, 2003 (gmt 0)

Freshbot likes pages that are updated often. I have pages that get added and don't get crawled to the deep crawl. Now, instead of waiting I could add links to those new pages on a frequently freshbotted page. But, freshbot might follow the link and it might not, depends how determined it is. But, from what I understand freshbot is still a work in progress (just like the new index).

abcdef

8:40 pm on May 10, 2003 (gmt 0)

Athon

How do you have a www. version, and non www. version of your web site? You mean you have two different web sites? One for www. and the other without it?

And, why? Because of Google?

GoogleGuy

2:51 am on May 11, 2003 (gmt 0)

Hmm. I would start with backlinks, which probably link to non-www instead of www. Next, check your site to make sure you've always got the www. In an extreme case, you might consider making any non-www page do a permanent (301) redirect to the www version. Those are the things that come to mind off-hand..

rfgdxm1

2:57 am on May 11, 2003 (gmt 0)

>Because sometimes it isn't. :(

Right. It is technically possible to have a completely different site at root than what is on the www subdomain.

DavidT

5:23 am on May 11, 2003 (gmt 0)

GoogleGuy, is there a policy out your way biased against using non-www addresses? The 64.68 crawlers just won't play ball with my 301 redirect to non-www. They refuse to follow it except with robots.txt requests. Seems a shame.

quotations

6:16 am on May 11, 2003 (gmt 0)

I have a similar problem.

All backlinks to my domain are either

www.mydomain.com/ or

www.mydomain.com/index.html

Both of those resolve okay.

In the appropriate category in ODP the site is listed without the trailing slash as

www.mydomain.com

As a result, the Google Directory has decided that the site is not listed in ODP.

I have requested that the url be changed to

www.mydomain.com/ in ODP but that request has been ignored.

Can anything be done on the Google side to fix this?

mcavic

8:45 pm on May 13, 2003 (gmt 0)

The 64.68 crawlers just won't play ball with my 301 redirect to non-www.

Freshbot doesn't seem to follow 301 redirects. It sees the redirect, then stops. But Deepbot does follow them for me (though I haven't tried non-www).

kpaul

9:51 pm on May 13, 2003 (gmt 0)

Although I am not bothering how aggresive the fresh bot is but why would it crawl the same files 2-3 times a day?

I always thought it was checking for updates on those pages...

GoogleGuy

11:35 pm on May 13, 2003 (gmt 0)

Hmm. I'll ask about the 301's with freshbot.

DavidT

7:38 am on May 17, 2003 (gmt 0)

Any news on this one big guy?

I'd love to show you my log files from the past few days. I have a small site by the standards here, about 600 pages. The 64.68 crawlers have requested about half of them over the last three days.

Talking server codes the results have been:

200: about 15 times, all for robots.txt except once index.html, once an internal page, twice my custom error page.

302: 2, see above.

301: the remaining few hundreds, I'm not counting them.

The 301s are for requesting pages with 'www.'. Other search engine robots have no trouble with it apart from periodic hiccups but they learn to deal with it. Again it just seems a shame or a waste of time if it comes and seems to want to pull the pages but then won't follow a simple redirect.

WarmGlow

11:44 pm on May 20, 2003 (gmt 0)

GoogleGuy on May 13, 2003:

Hmm. I'll ask about the 301's with freshbot.

Please... Any news yet? Thank you.