Forum Moderators: open

Message Too Old, No Replies

Robots.txt

Does it help

         

jsnow

3:09 pm on Oct 26, 2003 (gmt 0)

10+ Year Member



Maybe a bit of a silly question but does having a robots.txt file help?

I have no areas of the site that I don't want crawled so never bothered to have one, although I read something that said the first thing a bot requests is this file

onfire

9:02 pm on Oct 26, 2003 (gmt 0)

10+ Year Member



Well i don't use one as i have 2 sites where everything can be crawled. My sites are now some 7 months old, and once there was some decent inbound links to them and Google Bot came looking, they are now crawled daily and both are completely indexed and with no robots.txt.

Its my understanding that the robots.txt is used to stop some bots crawling all or some parts of your site, and as i want all to be crawled i have never made one and its not done my sites any harm at all.

Would like to hear other reasons for needing one, i.e. that it helps to get them crawled

Mohamed_E

9:11 pm on Oct 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are two reasons why some people put an empty robots.txt file when they do not want to exclude anything:

1. The first thing a robot requests is indeed that file, and if it does not find it you will see a 404 mesage in your logs.

2. Badly configured web servers sometimes send a 403 code (access forbidden) when there is no robots.txt file instead of the correct 404 (file not found). The standard "suggests" (it does not require) that a site not be visited under those conditions, so you might be accidentally keeping robots out.

About a year ago Google decided to treat 403 like 404 and spider the site anyway.

All in all I see no good reason for telling a robot that it can do what it planned to do. The robots.txt file is designed to keep robots out :)

jsnow

9:40 pm on Oct 26, 2003 (gmt 0)

10+ Year Member



That was what I thought and thank you for confirming. I just wanted to make sure I wasn't missing something

onedumbear

12:13 am on Oct 27, 2003 (gmt 0)

10+ Year Member



The robots.txt has other handy uses beyond just keeping specific robots out.
If you have original images on your site you can keep them from being harvested.
For example, if i don't want images from a site to be collected and listed in google "images", the robots.txt will allow me to tell googlebot not to take my images.
It would also allow me to tell all robots that are asking for the text not to take images.
There are other things you can do with the robots.txt as well.
When a robot does ask for the robots.txt, it is usually because they intend to obay it and want to know if you have specific directions for them.

humpingdan

10:43 am on Oct 27, 2003 (gmt 0)

10+ Year Member



i waited months to get my site indexed by google, in my log files google had been looking for a a robots.txt file and was logging the standard 404 error, while this was happening alothough google seemed to be crawling my site(index.htm only) until i put a blank robots.txt file in root i didnt appear in any serps?

just my observations!

also tested with three other websites!
humpingdan-

jsnow

11:34 am on Oct 27, 2003 (gmt 0)

10+ Year Member



Dan,

Did you find this to be the case with other search engines too or just G, as I do pretty well there but not good anywhere else.

I believe there is a single line to tell Robots theu can index all. Can somebody tell me what that is as I might as well put one up

John

dirkz

11:35 am on Oct 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



and was logging the standard 404 error

This is perfectly valid for sites that allow all bots unlimited access. Your experience must have been coincidence. It's also very common for Google to just visit "/" and do a deep crawl only periodically (especially for lower PR or new sites).

humpingdan

12:57 pm on Oct 27, 2003 (gmt 0)

10+ Year Member



coincidence i dont agree,
based on expereince through a number of domains, ive noticed that im only getting listed once a robots.txt is inplace!

jdMorgan

2:00 pm on Oct 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jsnow,

> Can somebody tell me what that is as I might as well put one up.

robots.txt to allow all robots to access all files:

# Robots exclusion file for widgets.com (comment lines start with "#" character)
User-agent: *
Disallow:


Note the blank line at the end - some robots have been reported to require it.

Ref: A Standard for Robot Exclusion [robotstxt.org]

Jim

jsnow

2:22 pm on Oct 27, 2003 (gmt 0)

10+ Year Member



jd

Thanks for that, just uploaded and hopefully should have some impact

John

onedumbear

4:58 pm on Oct 27, 2003 (gmt 0)

10+ Year Member



based on expereince through a number of domains, ive noticed that im only getting listed once a robots.txt is inplace!

I have launched 4 small sites in the last 4 months. The most recent was 5 days ago.
2 of the sites have a robots.txt, but were picked up and listed by g before the robots.txt was added.
The other 2 sites have no txt. file and were picked up and listed just as quickly.

tschild

5:12 pm on Oct 27, 2003 (gmt 0)

10+ Year Member



One thing that I notice quite often:

Some sites are configured to not return an error page but redirect to the home page when a nonexistent page is requested (not a good idea IMO but apparently done by some webmasters in order to retain traffic from broken links).

A request to /robots.txt serves a redirect to a HTML page for these servers. What the spider's robots.txt parser makes of this unexpected input is left to the spider's implementation - it might default to 'everything allowed' or to 'everything denied', or if badly programmed just crash.

dirkz

6:11 pm on Oct 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ive noticed that im only getting listed once a robots.txt is inplace!

No single domain of mine actually has a robots.txt. Some exist for years, some are a few months old. All are indexed.

If your experience is as you said, maybe you have the problem tschild stated.