Forum Moderators: open

Message Too Old, No Replies

Enticing a spider to crawl the site

Robots.txt or no Robots.txt

         

SinclairUser

7:03 pm on May 9, 2003 (gmt 0)

10+ Year Member



I use a robots.txt file which excludes a small number of sub-directories to robots. Example below:

User-agent: *
Disallow: /cgi-bin/
Disallow: /scripts/
Disallow: /news/
Disallow: /images/
Disallow: /templates/

I know this file is being found (and respected by most bots).

My problem is that some bots find it and leave, some crawl a little, some crawl more. I read some information (in this forum) from Brett Tabke on robots.txt where he states that "some bots find it and leave". Another SEO (for a leading SEO firm) stated "to get crawled properly you must have a robots.txt" instructing the bots what to grab. Having respect for both sources of information I am now confused. My primary aim is to get bots in and crawling as much REAL content as possible.

Without the robots.txt they are going to waste time crawling irrelevant stuff like images - but they will come.

With the robots.txt they may leave without looking at anything.

Obviously, there will be opinions either way, but what do I do to get the bots crawling?

Opinions please...

wilderness

9:38 pm on May 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a NO NO at google

<!-- begin content for Search Engine Optimization -->

SinclairUser

10:08 pm on May 9, 2003 (gmt 0)

10+ Year Member



wilderness,

I thought that you could include (a few - not spamming )comment tags in the HTML. But I take your point and will remove them if you think thats what is freaking googlebot when it visits.

jdMorgan

10:30 pm on May 9, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SinclairUser,

Your robots.txt looks valid, so I doubt that it is the problem.

New sites, or sites with few incoming links may have trouble getting crawled.

Sites which return redirects or don't validate may also have problems.

Here's a checklist:

Validate robots.txt [searchengineworld.com]
Check the server headers [webmasterworld.com] for robots.txt, your home page, and a few others.
Validate the code [w3.org]

HTH,
Jim

SinclairUser

10:38 pm on May 9, 2003 (gmt 0)

10+ Year Member



JD,

Thanks for the links - the XHTML and CSS should be compliant as I use the W3C HTML and CSS validators. Inktomi slurp crawls the site okay.

Chris.

wilderness

1:01 am on May 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thought that you could include (a few - not spamming )comment tags

The following from a google websmaters page (near end of page) about why "my pages aren't getting indexed.

Your page was manually removed from our index, because it did not conform with the quality standards necessary to assign accurate PageRank. We will not comment on the individual reasons a page was removed and we do not offer an exhaustive list of practices that can cause removal. However, certain actions such as cloaking, writing text that can be seen by search engines but not by users, or setting up pages/links with the sole purpose of fooling search engines may result in permanent removal from our index.
end of quote

With all the aside. . .
I have a dime-store bot trap on three pages of my hundreds that works off remarks :(

I was just trying to offer a pointer. I have no way of knowing for sure if that's your problem. I've also read that the SE's hold an indifference to search optimazation site.
Although I advertise a similar service on a very small scale (mywidgets,) I have termed it promotion rather than optimazation. Even though the latter is more accepted.

Don

<BTW Jim, hardly any of my pages validate since I use FP in quite a unique way. That hasn't stopped many of my pages from being number one. In one categroy I even have the top four :) >