Forum Moderators: goodroi

Message Too Old, No Replies

google hits robots.txt but does not index anything?

am i doing something wrong?

         

mbatta

12:30 am on Apr 12, 2005 (gmt 0)

10+ Year Member



my robots.txt file is pretty basic and only disallows a handful of folders (including cgi-bin). Google has visited the file on two occasions over the past 6 months and each time just hit the file and did not index any pages. Does anyone know why this might happen? MSN and Yahoo and others successfully index many pages.

Jer1024

12:56 pm on Apr 12, 2005 (gmt 0)

10+ Year Member



Can we see a link to your site?
Or an example of your robots.txt file?

mbatta

3:12 pm on Apr 13, 2005 (gmt 0)

10+ Year Member



Here it is:

User-agent: *
Disallow: /phpform/
Disallow: /cgi-script/
Disallow: /cgi-bin/
Disallow: /Library/
Disallow: /Templates/
Disallow: /Testing/
Disallow: /UserAdmin
Disallow: /Recent_Uploads
Disallow: /Submit
Disallow: /Research

my pages all sit on the root.

mbatta

10:43 pm on Apr 19, 2005 (gmt 0)

10+ Year Member



well G hit the robots.txt again today, but still did not index anything (or hit any other pages).

Jer1024

1:30 am on Apr 20, 2005 (gmt 0)

10+ Year Member



Do you have the following in your meta tags?

<meta name="ROBOTS" content="Index,Follow">

Reid

9:10 am on Apr 24, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



how come theres no trailing slashes after /testing/

you say all files are in the root but you are disallowing directories.
If you are disallowing specific files you need the file extension.

mbatta

4:49 pm on Apr 28, 2005 (gmt 0)

10+ Year Member



>>>Do you have the following in your meta tags?
<meta name="ROBOTS" content="Index,Follow">

No, what does this do?

Jer1024

5:30 pm on Apr 28, 2005 (gmt 0)

10+ Year Member



it tells the spider/robot/crawler etc...

to index the page and to follow the links on your page to other pages on your website.

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Would tell the bot not to index this page or follow the links.

I'm guessing in most cases the bot would index the page and follow links by default BUT, it probably doesn't hurt to have it on there anyways.

I think the solution given before about not having the trailing '/' will probably do the trick though.

Reid

2:15 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Googlebot (and most bots) will obey robots.txt OR META tags, not sure how it works with both though, wether one over-rides the other or not.
A flawless robots.txt file will work just fine.

disallow: /directory/

will disallow all files in that directory.

disallow: /example

will only disallow the index of that directory?
means disallow /example/? or example.htm?
(depends on the bot or wether there is a directory and a file with the same name)

disallow: /click.php

means disallow click.php?id=2 id=3 ect. (or you could use /click.* )

and while were at it we might as well

disallow: /

means disallow everything

disallow:

means disallow nothing

jdMorgan

2:46 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Googlebot (and most bots) will obey robots.txt OR META tags, not sure how it works with both though, wether one over-rides the other or not.

If a page is disallowed in robots.txt, then no robots.txt-compliant robot will ever fetch that page. Therefore, the Robots meta-tag on that page is irrelevant.

<meta name="robots" content="Index,Follow"> is not needed. Index,Follow is the default if the tag is not present, and using it only wastes bandwidth and pushes your content down in your file.

The Robots standard is based on prefix-matching. Therefore:

Disallow: /directory/

will disallow anything that starts with "/directory/" - It will disallow all files in a subdirectory named "/directory".

Disallow: /example

will disallow anything that starts with "/example" - It will disallow all files in a directory named "/example" and it will disallow any file in the root of the site whose name starts with "/example", e.g. /example.php or /example.gif

Disallow: /click.php

means disallow anything that starts with "/click.php" such as /click.php?id=2

You can use "Disallow: /click.*" only in robots.txt records specific to Google or another search engine that supports this non-standard extension to the robots.txt "standard" (It's not really a standard, because it was never formally adopted, there is no sanctioning body for it, and compliance is purely voluntary). Do not use this wildcard construct in a catch-all record that starts with "User-agent: *" -- It is invalid for most robots.

Jim

[edited by: jdMorgan at 3:26 am (utc) on April 29, 2005]

jdMorgan

2:55 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mbatta,

Your robots.txt looks valid to me. Check it here [searchengineworld.com] just to be sure.

Make sure it is a plain-text file using LF or CR,LF line-enders. Edit it in NotePad or some simple editor, not a fancy word processing program or HTML editor. The file should contain only those lines you posted above.

You might also want to request robots.txt and a few of your pages manually with a server headers checker [webmasterworld.com], and make sure they return a 200-OK server response code.

If your site is new, and you have only a few incoming links from moderate-to-high PageRank sites, it may just take awhile for Googlebot to get interested in it. Incoming links and patience are required.

Jim

Reid

3:31 am on Apr 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks for the clarification Jim.
sorry about the confusion mbatta.

mbatta - here's a cool tool to check out.
It gives you 'googlebot view' of your site and includes some diagnostics with headers returned and stuff. (no guarantee that it is exactly as googlebot but if this tool gets hung up, so will googlebot).
I found a lot of 'little errors' with it.

example:
a href="file.htm"
is different than
a href= "file.htm"

Just type a URL into the box and it will spider the page and the pages it links to.

google search 'poodle predictor'

it will show up in your logs as a bot 'poodle predictor'

Jer1024

12:56 pm on Apr 29, 2005 (gmt 0)

10+ Year Member



Cool, thanks for the tips, and the tool.
If Poodle is any indication, then from what I'm seeing Google ignores <meta name="robots" .

I always knew it pretty much seemed to ignore <meta "keywords" and "description" .

But when I ran poodle it seemed to go ahead and spider through any pages that I said "NOINDEX" or "NOFOLLOW".

It may just act that way in Poodle, but I may remove the tags any ways and rewrite my robots.txt accordingly.

Thanks guys ( even if you did mean to give it to Mbatta ) It will be very helpful for me.

Reid

2:34 am on Apr 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thats OK Jer these threads are for anyone to learn and benefit from. hey I learned something too - from Jim's post.

mbatta

4:03 pm on May 3, 2005 (gmt 0)

10+ Year Member



thanks for all the help everyone. fyi - i now get a "under construction" when I do a url search for my site in G. I guess this means they are working on indexing it? Its been that way for a week or so.

Jer1024

5:28 pm on May 3, 2005 (gmt 0)

10+ Year Member



That's odd...

( Unless it says "Under Construction" on your Home Page )

mbatta

7:54 pm on May 3, 2005 (gmt 0)

10+ Year Member



no, sure doesn't say that on my homepage. Here is the text:

Under Construction
...
2002 examplehost. All Rights Reserved.

I dont know what examplehost is and my homepage in no way says any of this. I realize this is now getting off topic so Im going to move it to the general G board.

[edited by: ThomasB at 1:28 pm (utc) on May 4, 2005]
[edit reason] examplified [/edit]

mbatta

8:08 pm on May 3, 2005 (gmt 0)

10+ Year Member



actually, i just realized there is no general google forum that this would fit so if anyone doesnt mind any input since we're on the issue here, it would be appreciated. when i type www.mysite.com into google search, i get the following:

Under Construction
...
2002 examplehost. All Rights Reserved.

Again, my home page in no way says anything about being under construction. very strange.

[edited by: ThomasB at 1:28 pm (utc) on May 4, 2005]
[edit reason] examplified [/edit]

Reid

5:38 am on May 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



google doesn't use 'under construction'
it sounds to me that
google has a cache of your homepage from when you launched or before you launched your website.
this is typical for a site google is having trouble crawling.
did you try poodle predictor on your home page?