google hits robots.txt but does not index anything?

Forum Moderators: goodroi

Message Too Old, No Replies

google hits robots.txt but does not index anything?

am i doing something wrong?

mbatta

12:30 am on Apr 12, 2005 (gmt 0)

my robots.txt file is pretty basic and only disallows a handful of folders (including cgi-bin). Google has visited the file on two occasions over the past 6 months and each time just hit the file and did not index any pages. Does anyone know why this might happen? MSN and Yahoo and others successfully index many pages.

Jer1024

12:56 pm on Apr 12, 2005 (gmt 0)

Can we see a link to your site?
Or an example of your robots.txt file?

mbatta

3:12 pm on Apr 13, 2005 (gmt 0)

Here it is:

User-agent: *
Disallow: /phpform/
Disallow: /cgi-script/
Disallow: /cgi-bin/
Disallow: /Library/
Disallow: /Templates/
Disallow: /Testing/
Disallow: /UserAdmin
Disallow: /Recent_Uploads
Disallow: /Submit
Disallow: /Research

my pages all sit on the root.

mbatta

10:43 pm on Apr 19, 2005 (gmt 0)

well G hit the robots.txt again today, but still did not index anything (or hit any other pages).

Jer1024

1:30 am on Apr 20, 2005 (gmt 0)

Do you have the following in your meta tags?

Reid

9:10 am on Apr 24, 2005 (gmt 0)

how come theres no trailing slashes after /testing/

you say all files are in the root but you are disallowing directories.
If you are disallowing specific files you need the file extension.

mbatta

4:49 pm on Apr 28, 2005 (gmt 0)

>>>Do you have the following in your meta tags?
<meta name="ROBOTS" content="Index,Follow">

No, what does this do?

Jer1024

5:30 pm on Apr 28, 2005 (gmt 0)

it tells the spider/robot/crawler etc...

to index the page and to follow the links on your page to other pages on your website.

Would tell the bot not to index this page or follow the links.

I'm guessing in most cases the bot would index the page and follow links by default BUT, it probably doesn't hurt to have it on there anyways.

I think the solution given before about not having the trailing '/' will probably do the trick though.

Reid

2:15 am on Apr 29, 2005 (gmt 0)

Googlebot (and most bots) will obey robots.txt OR META tags, not sure how it works with both though, wether one over-rides the other or not.
A flawless robots.txt file will work just fine.

disallow: /directory/

will disallow all files in that directory.

disallow: /example

will only disallow the index of that directory?
means disallow /example/? or example.htm?
(depends on the bot or wether there is a directory and a file with the same name)

disallow: /click.php

means disallow click.php?id=2 id=3 ect. (or you could use /click.* )

and while were at it we might as well

disallow: /

means disallow everything

disallow:

means disallow nothing

jdMorgan

2:46 am on Apr 29, 2005 (gmt 0)

> Googlebot (and most bots) will obey robots.txt OR META tags, not sure how it works with both though, wether one over-rides the other or not.

If a page is disallowed in robots.txt, then no robots.txt-compliant robot will ever fetch that page. Therefore, the Robots meta-tag on that page is irrelevant.

The Robots standard is based on prefix-matching. Therefore:

Disallow: /directory/

will disallow anything that starts with "/directory/" - It will disallow all files in a subdirectory named "/directory".

Disallow: /example

will disallow anything that starts with "/example" - It will disallow all files in a directory named "/example" and it will disallow any file in the root of the site whose name starts with "/example", e.g. /example.php or /example.gif

Disallow: /click.php

means disallow anything that starts with "/click.php" such as /click.php?id=2

You can use "Disallow: /click.*" only in robots.txt records specific to Google or another search engine that supports this non-standard extension to the robots.txt "standard" (It's not really a standard, because it was never formally adopted, there is no sanctioning body for it, and compliance is purely voluntary). Do not use this wildcard construct in a catch-all record that starts with "User-agent: *" -- It is invalid for most robots.

Jim

[edited by: jdMorgan at 3:26 am (utc) on April 29, 2005]

jdMorgan

2:55 am on Apr 29, 2005 (gmt 0)

mbatta,

Your robots.txt looks valid to me. Check it here [searchengineworld.com] just to be sure.

Make sure it is a plain-text file using LF or CR,LF line-enders. Edit it in NotePad or some simple editor, not a fancy word processing program or HTML editor. The file should contain only those lines you posted above.

You might also want to request robots.txt and a few of your pages manually with a server headers checker [webmasterworld.com], and make sure they return a 200-OK server response code.

If your site is new, and you have only a few incoming links from moderate-to-high PageRank sites, it may just take awhile for Googlebot to get interested in it. Incoming links and patience are required.

Jim

Reid

3:31 am on Apr 29, 2005 (gmt 0)

thanks for the clarification Jim.
sorry about the confusion mbatta.

mbatta - here's a cool tool to check out.
It gives you 'googlebot view' of your site and includes some diagnostics with headers returned and stuff. (no guarantee that it is exactly as googlebot but if this tool gets hung up, so will googlebot).
I found a lot of 'little errors' with it.

example:
a href="file.htm"
is different than
a href= "file.htm"

Just type a URL into the box and it will spider the page and the pages it links to.

google search 'poodle predictor'

it will show up in your logs as a bot 'poodle predictor'

Jer1024

12:56 pm on Apr 29, 2005 (gmt 0)

Cool, thanks for the tips, and the tool.
If Poodle is any indication, then from what I'm seeing Google ignores <meta name="robots" .

I always knew it pretty much seemed to ignore <meta "keywords" and "description" .

But when I ran poodle it seemed to go ahead and spider through any pages that I said "NOINDEX" or "NOFOLLOW".

It may just act that way in Poodle, but I may remove the tags any ways and rewrite my robots.txt accordingly.

Thanks guys ( even if you did mean to give it to Mbatta ) It will be very helpful for me.

Reid

2:34 am on Apr 30, 2005 (gmt 0)

thats OK Jer these threads are for anyone to learn and benefit from. hey I learned something too - from Jim's post.

mbatta

4:03 pm on May 3, 2005 (gmt 0)

thanks for all the help everyone. fyi - i now get a "under construction" when I do a url search for my site in G. I guess this means they are working on indexing it? Its been that way for a week or so.

Jer1024

5:28 pm on May 3, 2005 (gmt 0)

That's odd...

( Unless it says "Under Construction" on your Home Page )

mbatta

7:54 pm on May 3, 2005 (gmt 0)

no, sure doesn't say that on my homepage. Here is the text:

I dont know what examplehost is and my homepage in no way says any of this. I realize this is now getting off topic so Im going to move it to the general G board.

[edited by: ThomasB at 1:28 pm (utc) on May 4, 2005]
[edit reason] examplified [/edit]

mbatta

8:08 pm on May 3, 2005 (gmt 0)

actually, i just realized there is no general google forum that this would fit so if anyone doesnt mind any input since we're on the issue here, it would be appreciated. when i type www.mysite.com into google search, i get the following:

Again, my home page in no way says anything about being under construction. very strange.

[edited by: ThomasB at 1:28 pm (utc) on May 4, 2005]
[edit reason] examplified [/edit]

Reid

5:38 am on May 4, 2005 (gmt 0)

google doesn't use 'under construction'
it sounds to me that
google has a cache of your homepage from when you launched or before you launched your website.
this is typical for a site google is having trouble crawling.
did you try poodle predictor on your home page?