homepage Welcome to WebmasterWorld Guest from 54.161.240.10
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Google and having *no* robots.txt file
could this be hurting your site?
PFOnline




msg:176604
 3:51 am on Nov 7, 2002 (gmt 0)

Hi, I want ALL pages and directories on my site to be indexed by google, so I have no robots.txt file, because I didn't find it neccessary, because I don't need to block any files or directories.

But my questions are:

Could this hurt your site somehow by having no robots.txt file?

Should I make some sort of a "blank" robots.txt file? to maybe please Googlebot?

Thanks

 

WebGuerrilla




msg:176605
 4:04 am on Nov 7, 2002 (gmt 0)

You shouldn't have a problem as long as you are returning a standard 404 when the file can't be found.

If you are on Apache and you are using a custom 404 page that uses an absolut path, a bot will get a 302 instead of a 404.

You can use the header check tool [webmasterworld.com] in the WebmasterWorld control panel to check.

PFOnline




msg:176606
 4:09 am on Nov 7, 2002 (gmt 0)

Hi WebGuerilla, thanks for the response and for clarifying that it wouldn't be a good idea to have no robots.txt file but HAVING a custom redirect page!

A while ago, i made a custom redirect page but had no robots.txt file, but I removed the custom redirect, and boy am i glad i did as soon as i did. (Only had it for a few days) Would have been bad if i still had this combo the last Google update.

I'll just keep having no robots.txt file and no custom redirect page. :)

Thanks again

PFOnline




msg:176607
 4:15 am on Nov 7, 2002 (gmt 0)

Yikes, just did the server header check for mydomain.com/robots.txt like u suggested and it returned this:

HTTP/1.1 301 Moved Permanently
Date: Thu, 07 Nov 2002 04:10:01 GMT
Server: Apache/1.3.27 (Unix) FrontPage/5.0.2.2510 mod_jk/1.1.0
Location: [mydomain.com...]
Connection: close
Content-Type: text/html; charset=iso-8859-1

Is that bad? or is that ok? Kind of scares me because shouldnt it read "404" rather than "301"?

I do use a 301 redirect on my site, because i have another similar domain I want to redirect to my main domain.

Does that mean Googlebot is reading my robots.txt file as a 301 error rather than 404?

PFOnline




msg:176608
 4:21 am on Nov 7, 2002 (gmt 0)

I forgot to mention:

when you manually type in mydomain.com/robots.txt it returns the standard 404 error and not a 301 like the webmasterworld header check tool returns.

WebGuerrilla




msg:176609
 4:42 am on Nov 7, 2002 (gmt 0)


No, that would be the bad one. :)

If you are using your .htaccess to serve your custom 404, it should look like this:

ErrorDocument 404 /error.html

That will return a 404

If it looks like this:

ErrorDocument 404 [yourdomain.com...]
it will return a 301

jdMorgan




msg:176610
 4:51 am on Nov 7, 2002 (gmt 0)

PFOnline,

It's a good idea to have a robots.txt file on your site, even if it is just a blank file. This prevents Google and the other 'bots from filling your log files with 404 errors, and functions just like a "real" non-blank robots.txt file containing

User-agent: *
Disallow:

would. That is, it allows robots to fetch all linked pages, scripts, and graphics if they want to.

There is a good robots.txt tutorial [searchengineworld.com] and robots.txt checker [searchengineworld.com] over on the WebmasterWorld sister site at Search Engine World.

Jim

PFOnline




msg:176611
 4:53 am on Nov 7, 2002 (gmt 0)

Wait wait, im confused. Let me please try to clarify.

Currently i'm not using any sort of customer error 404 pages, and i have no robots.txt file.

When you manually go to mydomain.com/robots.txt it gives the "HTTP 404 not found"...

The only thing that said "301" was the WebmasterWorld header check tool.

Sorry for the confusion.

Should I be ok if when i manually type in mydomain.com/robots.txt it gives the "404 not found" page? even if the Webmaster header check tool says "301"?

Thanks

PFOnline




msg:176612
 4:56 am on Nov 7, 2002 (gmt 0)

thanks Jim.

So just put:

User-agent: *
Disallow:

in a blank note pad file, save as robots.txt, and upload to main directory?

Sorry im such a newbie on this! Never messed with robots.txt

Thanks!

WebGuerrilla




msg:176613
 5:10 am on Nov 7, 2002 (gmt 0)


>>The only thing that said "301" was the WebmasterWorld header check tool.

If you make the request using HTTP 1.0 you get the 301

If you make it using HTTP 1.1 you get the 404

Bretts tool is using 1.0

msr986




msg:176614
 5:14 am on Nov 7, 2002 (gmt 0)

See this right from GoogleGuy's mouth:

[webmasterworld.com...]

PFOnline




msg:176615
 5:21 am on Nov 7, 2002 (gmt 0)

thanks WebGuerilla, msr986, I've made a robots.txt file that looks like this:

User-agent: *
Disallow:

But ONE LAST question:

What should the robots.txt file CHMOD be? 777?

Just me being paranoid, wanna make sure i do this right!

Thanks!

jdMorgan




msg:176616
 5:37 am on Nov 7, 2002 (gmt 0)

PFOnline,

The safest way to do this is to upload your file to your server, but name it robots.temp, or something other than robots.txt. Then use the Search Engine World tool I cited above to check it. If the tool says it is OK, then it's OK. Rename it to robots.txt on your server, and that's it.

The advantages are:
1) Search engines don't get a 404 or a 301, they get a robots.txt. This makes them happy (and generous). :)
2) Your log files will contain less 404 error lines, because the robots.txt file is present when requested.

Jim

coosblues




msg:176617
 7:28 am on Nov 7, 2002 (gmt 0)

I just ran my robots.txt through the validator - are these the results I want?
http status:
200 OK
Syntax check robots.txt on [mydomain.net...] (27 bytes)
Line Severity Code
No errors detected! This Robots.txt validates to the robots exclusion standard!

robots.txt source code for [mydomain.net...]
Line Code
1 User-agent: *
2 Disallow:

przero2




msg:176618
 6:24 pm on Nov 22, 2002 (gmt 0)

No, that would be the bad one.
If you are using your .htaccess to serve your custom 404, it should look like this:

ErrorDocument 404 /error.html

That will return a 404

If it looks like this:

ErrorDocument 404 [yourdomain.com...]
it will return a 301

Valuable tip out there!. Thanks, WG. - przero2

przero2




msg:176619
 7:21 pm on Nov 22, 2002 (gmt 0)

I am seeing the following with the header check tool on WebmasterWorld ...

[domain.com...] returns a 302 in the header

and

[domain.com...] returns a 404 header?

domain.com redirects to www.domain.com. given that, is the above okay or if not what need to be done to return 404 for both [domain.com...] and [domain.com...]

BTW, I am using a custom htaccess ErrorDocument directives.

jady




msg:176620
 8:11 pm on Nov 22, 2002 (gmt 0)

We tested this theroy - one site with PR5 and no robots.txt and another site with a PR5 with a robots.txt. Both got just about equal visits from Googlebot even thought a 404 was being returned on the robots.txt. No noticable change in rankings - BUT --- it takes 2 seconds to create so we keep them on all of our Client's sites..

Believe it or not, we actually use this one site with a good rank JUST for testing purposes.. :)

Hoople




msg:176621
 4:29 am on Nov 28, 2002 (gmt 0)

I added my robots.txt file when this thread started and have seen no reduction in visits of googlebot or slurp.

annej




msg:176622
 5:13 am on Nov 28, 2002 (gmt 0)

I followed msr986s link to what Google Guy said and I can see I need to make some changes. >>This can happen if your webserver is configured to return a pretty page for requests when the page doesn't exist.<<

Yep I've got those "pretty pages" and Google is picking them up as regular HTML pages.

I had put <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> on the 404 page on one site and Google still has it listed in the serps. Is it not picking up the meta instructions? Or perhaps I just haven't waited long enough.

I liked the idea of being able to customize the 404 page as I had just moved some articles to new URLs for organization reasons.

What I am wondering is if I should just go back to the basic non HTML 404 page?

Anne

jdMorgan




msg:176623
 7:14 am on Nov 28, 2002 (gmt 0)

All,

I'd like to reiterate what a few others have said, and clarify some issues in order to dispel some of the fear and doubt becoming apparent here. I also want to make a correction to some code I've posted here on WebmasterWorld in the past.

First, if you want to allow all spiders to access all of your files, you can:

  • Place no robots.txt file on your site
  • Place an empty file on your site, named robots.txt
  • Place a robots.txt file on your site with the following contents:

    User-agent: *
    Disallow:

    If you want all spiders to access all of your pages, then the only advantage to having a robots.txt file is that you won't get hundreds of 404-Not Found errors on your site each day caused by robots trying to fetch your non-existent robots.txt. Personally, I like to keep my error log file as small as possible because it makes finding real errors easier, so having a robots.txt file is a big advantage for me.

    If you do put up a robots.txt file, it's a very good idea to validate it with the robots.txt validator [searchengineworld.com]. Check the instructions on that page before you upload your file, because it offers a useful option of checking your file before you name it robots.txt and risk having a robot read it when it is invalid.

    Second - and this applies to Apache servers only - if you use a custom 404 error page, review the Apache Server Core ErrorDocument documentation [httpd.apache.org] carefully.

    What it says is that an ErrorDocument directive to implement a custom 404 page should look like this:

    ErrorDocument 404 /mycustomerrorpage.html

    and not:

    ErrorDocument 404 http://www.mydomain.com/mycustomerrorpage.html

    If you provide a full URL, as in the second example above, Apache will return a 301-Moved Permanently server response code instead of a 404-Not Found response code, and that will cause trouble - even though the proper custom error document will be served. It is likely that you will find one of your old and now-missing pages indexed by a search engine, but listed with the title and description of your custom error page. To avoid this, use the local path only, not the full URL.

    Third, again referring to Apache server only... If you have redirects in place in your httpd.conf or .htaccess files, these redirects may be invoked when a any request is made for any file, including the robots.txt validator and the server header checker [webmasterworld.com]. I had a problem with the header checker awhile ago, and finally figured out that what was causing the problem was a mod_rewrite [httpd.apache.org] domain redirect I was using to "merge" my .com and .org TLDs. I had:

    RewriteCond %{HTTP_HOST} !^www\.mydomain\.org$
    RewriteCond %{HTTP_HOST} !^123\.45\.67\.89$
    RewriteRule ^(.*)$ http://www.mydomain.org/$1 [R=permanent,L]

    The intent was to redirect requests for any domain other than my .org domain or my IP address to my .org domain.

    It turns out that with the server header checker, this caused a problem. For some reason, all requests from the server header checker were being redirected, even when I gave it the correct [mydomain.org...] path. After some hair-pulling, I found that the code needed
    Correction:
    Remove the end-anchors from the RewriteCond patterns, leaving:

    RewriteCond %{HTTP_HOST} !^www\.mydomain\.org
    RewriteCond %{HTTP_HOST} !^123\.45\.67\.89
    RewriteRule ^(.*)$ http://www.mydomain.org/$1 [R=permanent,L]

    After this correction, the problem disappeared. The conclusion is that the server header checker is appending something to the HTTP_HOST field that caused the first-example code to perform a redirect. Whether this was an HTTP port number, an extra end-of-line, or what, I don't know. Also, I don't know if this modification should have really been necessary - that is, whether it's really a needed "correction" or just a "good-idea modification". But I prefer a robust site to a fragile site any day, so there it is. I have posted the "incorrect" variant of this code on WebmasterWorld, and if it has caused this same problem for you, I apologize... Those old posts can't be fixed now, because owner edit only works for 30 minutes. :(

    I hope this post is useful.

    Jim

  • Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Google / Google News Archive
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved