SEs Check robots.txt and just go away?

Forum Moderators: goodroi

Message Too Old, No Replies

SEs Check robots.txt and just go away?

webtamers

10:59 pm on Nov 4, 2000 (gmt 0)

Help, guys! I submitted my site a month ago. I just checked my logs, and I can see that
there have been 40 requests for robots.txt file, where they check that file, and nothing else.
After checking the file, they go away. Is this normal? Do search engines just check it, then come back later to crawl?

Here's my robots.txt file:

User-agent: *
Disallow: /hid/
Disallow: /images/
Allow: /
Disallow:

Anything wrong with this?

Note:
The "Allow" is there to invite the engines to crawl everything.
Anybody know if this works?

Thanks in advance,

Air

11:34 pm on Nov 4, 2000 (gmt 0)

The "allow" should not be there, there is no "allow" directive for the robots.txt file. Likely they are ignoring the "allow" and interpreting the "/" (root) as disallow everything.

I would change it to just read:

User-agent: *
Disallow: /hid/
Disallow: /images/

NFFC

12:05 am on Nov 5, 2000 (gmt 0)

What Air said.
You also have other stuff in there that needs to be taken out. See here [info.webcrawler.com] for more info on robots.txt and here [rietta.com] for a no-brainer solution.

Air

3:41 am on Nov 5, 2000 (gmt 0)

webtamers,

checked out your site, I thought someone was messing with my cat, I've seen your site before, just can't remember where ....

DaveAtIFG

6:40 am on Nov 5, 2000 (gmt 0)

There's a syntax checker for robots.txt files here [tardis.ed.ac.uk] that's never let me down.

webtamers

10:16 am on Nov 5, 2000 (gmt 0)

Thanks Guys,

Still curious id that's standard procedure for the robots
to just check the robots.txt and come back later. If anybody knows please enlighten us all!

Just FYI, there IS an "Allow" -it's something new in the standards, don't know if they all read it tho'

Following is an explanation I found somewhere, in case
it helps someone:

User-agent: *
Disallow: /org/plans.html
Allow: /org/
Allow: /serv
Disallow: /

The following shows what robots are allowed to access:

[url.com...] No
[url.com...] No
[url.com...] Yes
[url.com...] Yes
[url.com...] Yes
[url.com...] No
[url.com...] No

A note to "AIR" - I'm so embarrassed! My site is awful.
It's 5 years old, from when I first learned HTML. The amazing part is that it works. I get a lot of replies
from my "free estimate" form. Been meaning to update it,
but have been too busy with my client's sites. Like the mechanic with the broken down car...

Air

5:27 pm on Nov 5, 2000 (gmt 0)

>A note to "AIR" - I'm so embarrassed!

Don't be, it's funny and memorable, I can see why you would get people's attention with it.

Thanks for the info on "allow" I didn't know that, do you recall where you saw it? I'd like to read up on it.

webtamers

9:19 pm on Nov 5, 2000 (gmt 0)

To AIR,

Well, thanks I feel a little better now...

Regarding the "Allow" here is the link:

[info.webcrawler.com...]

Here's an excerpt:
"Previous of this specification didn't provide the Allow line. The
introduction of the Allow line causes robots to behave slightly
differently under either specification:

If a /robots.txt contains an Allow which overrides a later occurring
Disallow, a robot ignoring Allow lines will not retrieve those
parts. This is considered acceptable because there is no requirement
for a robot to access URLs it is allowed to retrieve, and it is safe,
in that no URLs a Web site administrator wants to Disallow are be
allowed. It is expected this may in fact encourage robots to upgrade
compliance to the specification in this memo."

I'm so confused! Maybe we should start a new thread about this Allow thing. Nobody really seems to understand it. Whe I first ran into this it was on a site about promotion that claimed it would force some robots crawl your whole site.

As a matter of fact I'm going to do that now, I think it merits some discussion, don't you agree?

Air

6:14 am on Nov 6, 2000 (gmt 0)

I think I know the paper you are referring to, it was a draft spec form a few years ago. The author proposes "Allow" but it never got anywhere. So for now I would stick with "disallow" as the only valid directive for those bots still respecting the robots.txt.

There really isn't anyway that I've found to force spiders to do anything. A new thread may be a good Idea, I suspect it will evolve along these lines.