Deep crawl came again... Only index page and Robots.txt. Why?

Forum Moderators: open

Message Too Old, No Replies

Deep crawl came again... Only index page and Robots.txt. Why?

This is the 2nd month in a row.

derekwit

7:22 pm on Jan 4, 2003 (gmt 0)

This is the 2nd month in a row when deep crawl comes by that he only gets my index page and robots.txt. I know my code is correct I have validate it the an html validator. What could be the problem. Please help.

2003-01-04 04:38:53 216.239.46.172 - 216.119.105.20 GET /robots.txt - 200 299 395 16 HTTP/1.0 www.XXXXXX.com Googlebot/2.1+(+http://www.googlebot.com/bot.html) - -
2003-01-04 04:38:53 216.239.46.172 - 216.119.105.20 GET /index.cfm - 200 15164 385 94 HTTP/1.0 www.XXXXXX.com Googlebot/2.1+(+http://www.googlebot.com/bot.html) - -

ciml

7:48 pm on Jan 4, 2003 (gmt 0)

This can happen if you don't have enough PageRank to encourage Google to crawl you. I'm not saying that it's exactly the reason in your case, but it's an area to look at. Normally, ODP inclusion should get preferential spidering, but with the RDF dump problems I guess that new sites need to wait until Google see the links from the ODP pages or other pages with PageRank.

Grumpus

10:13 pm on Jan 4, 2003 (gmt 0)

Why are you using CFM pages? I believe that google (and most spiders) look at those as dynamic pages. Google is learning to crawl dynamic pages, but it's doing the popular types first - PHP, CGI, PL, ASP, etc. I'd imagine that CFM is pretty well down the list - probaby even after learning to spider Flash and Javascript menus...

I dunno, but if I were you, I'd convert those page names to HTML instead of CFM and make sure there isn't so much whitespace at the top of the sourcecode. Those extra blank lines might be killing you off, too.

Your ROBOTS meta tag might be messing things up. I didn't go to my handy-dandy guide, but I don't think "Robots" Content="all" is valid. You'll have to check that one.

You've also got some DIV "layers" in there there. Not sure how the bots like those. Someone who uses them may be able to shed some light on that one for ya.

Hope that helps and at least gets you rolling.

lazerzubb

10:16 pm on Jan 4, 2003 (gmt 0)

Grumpus, CFM isn't a bigger problem than .php and the rest from what i know, Google spider most extenstions, it's when you add something after the extension that it becomes a problem.

Like page.cfm?id=1

Example of indexed .cfm url's in Fast [alltheweb.com]

Also i though it was interesting how many .gov sites that had .cfm pages [alltheweb.com]

Grumpus

10:26 pm on Jan 4, 2003 (gmt 0)

Yup Lazer - just did a check on google and there are 3 million plus index.cfm pages in there.

It's going to be one of the other things I mentioned then. :)

(The more I think about it, the more I think it's a bum robots meta tag...)

derekwit

11:06 pm on Jan 4, 2003 (gmt 0)

What should I have in my robots.txt file?

lazerzubb

11:12 pm on Jan 4, 2003 (gmt 0)

derekwit, read more about robots.txt at.
[robotstxt.org...]

The Robots.txt file for webmasterworld [webmasterworld.com]
Threads about Robots.txt
Robots.txt [webmasterworld.com]
How important is the Robots.txt file now? [webmasterworld.com]
Robots.txt Tutorial [searchengineworld.com]

derekwit

11:32 pm on Jan 4, 2003 (gmt 0)

From all of those threads it looks like the only reason to have a robots.txt file is to tell bots not to go to certain places. But it appears that bots don't even listen to them anyways. I have checked out the top 20 competitors for my keyword and only can find 1 robots.txt file. But I still do not see how that would affect how Gbot deep crawls my site. My other ideas?

derekwit

11:49 pm on Jan 4, 2003 (gmt 0)

I have just gone back through all of my log files for the last 3 months. In October Deepcrawl visited me (3) times 10/6/02, 10/8/02, 10/09/02. In November Deepcrawl visited me (2) 11/05/02 and 11/11/02. In December Deepcrawl visited me (3) times 12/03/02, 12/05/02 and 12/11/02. Now this year he visited me today. So he is finding my site and I am assuming he will probably come later on so I need some advice being that this time was like the last 8 times and he did not deep crawl past my index page. Any advice would be great.

Thanks in advance for your help!

Grumpus

3:02 am on Jan 5, 2003 (gmt 0)

It's not your robots.txt file (I don't think - I didn't look at it) that's screwing you up. You've got a Meta Tag on every page called "robots" with a value of "all". I don't think that's right. Again - like with robots.txt, that tag really is only used to tell a robot NOT to do something. By having it there at all, you might be scaring the bot away.

Please don't make me look up the right values for you. I'm WAY to lazy for that. :)

rfgdxm1

3:23 am on Jan 5, 2003 (gmt 0)

>Normally, ODP inclusion should get preferential spidering, but with the RDF dump problems I guess that new sites need to wait until Google see the links from the ODP pages or other pages with PageRank.

No. Google crawls the ODP directly. I've seen actual evidence of that from the backlinks. In fact, look at link:www.webmasterworld.com:

dmoz.org/Computers/Internet/Web_Design_and_Development/ Authoring/Webmaster_Resources/Chats_and_Forums/

Is the first listed. If Google doesn't spider a site, it won't show on the backlinks.

rfgdxm1

3:27 am on Jan 5, 2003 (gmt 0)

>Please don't make me look up the right values for you. I'm WAY to lazy for that.

Unless he wants to keep bots out of his site, his best best is to have no robots.txt at all. That allows full spidering.

derekwit

3:30 am on Jan 5, 2003 (gmt 0)

My site is in ODP and was spidered this month. So should this help me for the months crawl?

rfgdxm1

3:32 am on Jan 5, 2003 (gmt 0)

Yes derekwit. Google will count your ODP listing. In fact, if the ODP cat you are in has a PR of 4 or more, it should show up with the link: command next month.

derekwit

3:43 am on Jan 5, 2003 (gmt 0)

It has a PR4 thats good. Any suggestions on why Gbot won't deep crawl me?

Key_Master

4:01 am on Jan 5, 2003 (gmt 0)

HTTP/1.1 400 Bad Request
Server: Microsoft-IIS/5.0
Date: Sun, 05 Jan 2003 03:50:23 GMT
Content-Type: text/html
Content-Length: 87
(This server violates the HTTP standards by
returning content after the header in a HEAD request:)
<html><head><title>Error</title></head><body>The parameter is incorrect. </body></html>

That's the error I got. Your server isn't returning a proper head request (it's giving both the server headers and the page contents). Oddly enough, on every tool I ran, the last line of HTML was missing from your home page. The only time I've seen this happen is when an improper content-length was given by the server. Viewing from a normal browser doesn't cause the error to happen. Some very weird stuff is going on there.

"ALL" is equivalent to "follow,index" so your robots meta tag is not your problem.

amznVibe

7:54 am on Jan 5, 2003 (gmt 0)

This is the 2nd month in a row when deep crawl comes by that he only gets my index page and robots.txt.
GET /robots.txt - 200 299 395 16 HTTP/1.0 www.XXXXXX.com Googlebot/2.1+(+http://www.googlebot.com/bot.html) - -

you sure this isn't an adword editor checking your site after an adword is placed? I've seen this single page crawling on my sites after placing an adword... -aV-

derekwit

8:28 pm on Jan 5, 2003 (gmt 0)

The only thing is I have not run a ad with Adwords in about 4 months.

derekwit

8:29 pm on Jan 5, 2003 (gmt 0)

Should I contact my hosting company. If so, what exactly should I say. What tools did you use to get the error?