Welcome to WebmasterWorld Guest from 54.226.110.143

Forum Moderators: goodroi

Message Too Old, No Replies

Quick tip

Check out your robots.txt

     
5:39 pm on Jun 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


I've seen this hit a couple webmasters lately and wanted to mention it. If someone requests a robots.txt file, don't return some custom 404 page or strange page with HTML, graphics, etc. This can happen if your webserver is configured to return a pretty page for requests when the page doesn't exist.

Just a straight text robots file or no robots.txt at all is preferred. Just lookin' out for things the sometimes catch people..

5:44 pm on June 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2001
posts:3805
votes: 2


Thanks for the heads up, GoogleGuy.

I've tended to assume that unless the 404 page uses the word "disallow" or "allow" it would be OK. Maybe that's not quite right; it's easy enough to upload an empty robots.txt file anyway.

People have asked about 404 redirects. I couldn't decipher any indication from the Robots Exclusion Protocol, would you expect to follow the redirect or justs treat it as "Not Found" and ignore?

5:49 pm on June 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


What about a blank robots.txt file?
5:56 pm on June 11, 2002 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38063
votes: 13


That would be find KM. It's just that there is so much junk in a html page that even the best robots parser can get confused.

If nothing else, make sure that your 404 redirect is generating a true 404 header first (before the redirect).

Stickysauce

7:05 pm on June 11, 2002 (gmt 0)

Inactive Member
Account Expired

 
 


Hi Guys I am new to the forums but am a regular visitor.

With regards to the 404 issue I do not understand I use a custom 404 which can be seen in my profile, what am i doing wrong, also do i need to use a robots text file if so where can i get one and where do i put it. Sorry for being a pain. Thanks guys Irv

7:21 pm on June 11, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


redirect is generating a true 404 header first (before the redirect

Are you referring to the title of the page? I have custom 404's on all my sites. I haven't seen any problems for the most part, but most don't redirect, some do. Then again I have some sites that aren't getting into Google for some reason and I want to cover all my bases.

7:24 pm on June 11, 2002 (gmt 0)

Senior Member

joined:June 27, 2000
posts:1548
votes: 0


Welcome to the forums, StickySauce.

Brett wrote an excellent article on how to write your own robots.txt file. It is located here [searchengineworld.com]

The robots.txt file is separate from your 404 file Robots.txt is for search spiders that crawl your site and tells them what parts of your site is allowed to be crawled and not allowed to be crawled.

7:33 pm on June 11, 2002 (gmt 0)

Full Member

10+ Year Member

joined:Jan 28, 2002
posts:240
votes: 0


I think I'm in the same situation as Jill. I don't have any robots.txt files. I have a custom 404 file named, 404.html. When Googlebot comes around, is it picking up that 404.html when it does not find the robots.txt? That would really stink...
7:37 pm on June 11, 2002 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38063
votes: 13


Many sites will not return an error header:

domain.com/lasjfljsldjsljfsljdf

Will just pump your error page to them with no redirect and no actual error header.

You need to generate a 404 NOT FOUND header. If you don't know, what your's is saying:

- go to StickyMail (login if you aren't).
- click on "headers" on the left menu.
- put in an address to a bogus file on your site (full http url)
- see what the header response is.

It should say:

HTTP/1.1 404 Not Found

7:41 pm on June 11, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


Mine is returning this:
HTTP/1.1 302 Found
but the page is 404.html hmmm what am I doing wrong?
7:41 pm on June 11, 2002 (gmt 0)

Full Member

10+ Year Member

joined:Jan 28, 2002
posts:240
votes: 0


HTTP/1.1 404 Not Found
Date: Tue, 11 Jun 2002 19:40:41 GMT
Server: Apache/1.3.19 (Unix) PHP/4.1.1 PHP/3.0.18 FrontPage/4.0.4.3
Last-Modified: Fri, 17 May 2002 18:30:19 GMT
ETag: "7eae26-7b3-3ce54c3b"
Accept-Ranges: bytes
Content-Length: 1971
Connection: close
Content-Type: text/html

Thanks Brett - I feel much better now! Time for a beer.

7:46 pm on June 11, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Oct 8, 2001
posts:2882
votes: 0


Blank is fine too. Also http:www.robotstxt.org is a good site about robots in addition to Brett's page.
8:29 pm on June 11, 2002 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38063
votes: 13


HTTP/1.1 302 Found

Figuring out how your server software fits together under those circumstances can be difficult. If you don't know your server software very well, this would be the opportunity to test out the tech support at your host.

9:17 pm on June 11, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:May 29, 2000
posts:649
votes: 0


Testing them as we speak Brett. ;)
9:23 pm on June 11, 2002 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38063
votes: 13


GG, what percentage of robots.txt that GoogleBot runs in to, are invalid?
10:01 pm on June 11, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 3, 2002
posts:482
votes: 0


do i need a robot.txt for each directory or subdomain?
10:02 pm on June 11, 2002 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4843
votes: 2


Hi Martin,

You only need a robots.txt in your root directory, meaning the uppermost directory (usually called www) that is visible on your site.

10:16 pm on June 11, 2002 (gmt 0)

New User

joined:June 11, 2002
posts:13
votes: 0


Speaking of stuff like this, perhaps GoogleGuy can answer this question:

On the robots.txt standards site to which GG refers, the syntax for the META shows no space between the "noindex,nofollow"

On www.google.com/remove.html the syntax shows a space after the comma when describing the same META commands.

Does it make a difference? I understand that upper/lower case doesn't matter, but what about a space as in "NOINDEX, NOFOLLOW"?

10:43 pm on June 11, 2002 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts:12169
votes: 56


Whew!

HTTP/1.1 404 Object Not Found
Server: Microsoft-IIS/5.0
Date: Tue, 11 Jun 2002 22:39:38 GMT
Content-Length: 28197
Content-Type: text/html

I took this as a little hint and first thing I did was check our 404 setup. It is doing what it should, thank you GoogleGuy for the heads up.

10:47 pm on June 11, 2002 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts:12169
votes: 56


> Does it make a difference? I understand that upper/lower case doesn't matter, but what about a space as in "NOINDEX, NOFOLLOW"?

My understanding is that it does not matter. As long as that comma is in there separating the robots-terms, you are okay.

11:02 pm on June 11, 2002 (gmt 0)

Full Member from US 

10+ Year Member

joined:July 12, 2000
posts:323
votes: 4


Am I assuming correctly, if you have "robots.txt" file, a redirect on a custom 404 would not be a problem.
1:56 am on June 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Herb,

Yes, if you have *any* robots.txt file in your root directory (blank or not), then your custom 404 page will not be invoked when Googlebot asks for robots.txt.

If robots.txt is missing completely *and* you have a custom 404 page, then GoogleGuy says this can cause problems for Googlebot.

For those of you who have found a problem using Brett's response header checker (see first page of this thread) and are using the Apache ErrorDocument directive, and getting 301 redirects instead of the desired 404 error code, make sure that the path you specify in your ErrorDocument directive is a *relative* path, i.e.

ErrorDocument /404file.html

If you use a remotely-hosted URL, or *any* URL starting with "http:" such as

ErrorDocument [yourdomain.com...]

then Apache will do an external permanent redirect, and return a 301 code instead of a 404. Therefore Googlebot and other robots may never remove your dead pages from their indices.

See the Apache Core Features ErrorDocument directive documentation for the official word.

Thanks for the server response checker, BT!

Jim

3:18 am on June 12, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 3, 2002
posts:482
votes: 0


thank you LAN...

on this site i use the
ErrorDocument /

to redirect every file that didn't exists to the root. so if google requested the robot.txt file it gets the main page.

the same page is in a pr6 dmoz-category with domain.com in the anchortext and has lot of quality links.

it droped from the top 10 on the top 30 in ranking on all keywords. the pagerank remains. even if i type domain.com alone or with other keywords the site gets a start=20 ranking with 30 pages above me that link to mine. :(

was this causing the drop and how long will it remain when i add a robot.txt file ?

ann

3:29 am on June 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 25, 2002
posts:2605
votes: 0


This seems right...I hope.

HTTP/1.1 404 Not Found
Date: Wed, 12 Jun 2002 03:26:47 GMT
Server: Apache/1.3.22 (Unix) PHP/4.1.0 mod_ssl/2.8.5 OpenSSL/0.9.6b
Connection: close
Content-Type: text/html; charset=iso-8859-1

Ann

8:18 am on June 12, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 18, 2001
posts:889
votes: 0


Neat tool! Thanks, Brett
10:35 am on June 12, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 29, 2002
posts:423
votes: 0


A useful tip for those using IIS and a custom 404.asp page...

Just add

<% Response.Status = "404 Not Found" %>
to your 404.asp code. The headers must be written before any output to the client occurs, so just make sure all your HTML and
Response.Write
's are after that line.

This way, if you fail to include a robots.txt in your root dir, your custom error page will correctly return a "404 Not Found" status code.

3:48 pm on June 13, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 18, 2002
posts:83
votes: 0


Hi,

I did the test on our site and got an (feared) error http/1.1 302 on an Apache server. I checked a site of a friend of mine hosted at the same hosting company (not necessarely on the same server), and it shows the same error code.

What should I do?
What can I do?
Write my hosting company asking what exactly?
Should I add a robots.txt file myself to the root directory of my site to neutralize this?

I am a novice regarding those more technical aspects.

Help!

TIA

3:54 pm on June 13, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 5, 2001
posts:2728
votes: 8


pvdm, if you want to setup a custom 404 page create the page and tell your ISP (hosting company) you want to use that as your 404 page, ask them where it will be stored, how to change it, and they will set it up in their web server settings for you :)
8:32 am on June 18, 2002 (gmt 0)

Junior Member

10+ Year Member

joined:May 18, 2002
posts:83
votes: 0


Hi Elite,

Sorry for the late reply, but thanks for the tip!

I'll do that.

Greetings!

9:07 am on June 18, 2002 (gmt 0)

Full Member

10+ Year Member

joined:Apr 5, 2002
posts:210
votes: 0


If I want to exclude a particular bot from crawling my site would this be the correct way to do it?

User-agent: BadBot
Disallow: /

Thanks

This 31 message thread spans 2 pages: 31