An unresolved indexing problem

Forum Moderators: open

Message Too Old, No Replies

An unresolved indexing problem

I have two sites Google refuses to index

allanp73

8:18 pm on Dec 8, 2002 (gmt 0)

Hi,
For the past 6-8 months, I've been trying to get Google to index two sites. Each site has about 30-40 links pointing to them from sites indexed by Google. One of the two sites has a DMOZ link and has appeared in Google's directory for about 4 months. However, it and the other have never been indexed. I can't figure out why Google refuses to visit the sites (I have checked my log reports the Googlebot will make 1 hit then leaves never coming back.) I reviewed the code 20 times and can't figure out what the problem is. I posted this problem on Webmaster World in September and was told to create a robots.txt file, this has not fixed the problem either.

Please help!

[edited by: Marcia at 4:14 pm (utc) on Dec. 12, 2002]
[edit reason] Please see forum charter [/edit]

jimbeetle

8:54 pm on Dec 8, 2002 (gmt 0)

Hi allanp73,

I take it that the two sites are not the ones I can reach by the site in your e-mail address as these appear to be well-indexed with good PR.

Sticky the URLs. I'll take a shot as will probably others.

Jim

allanp73

9:15 pm on Dec 8, 2002 (gmt 0)

You are correct. My email is my personal business site. The two sites that I'm having problems with, I act as the webmaster for them. I have design dozens of sites and never had this problem before. Usually the sites I create rank well and are easy to index. I design to make them as crawler friendly as possible. I also make sure to avoid spamming techniques. Even when clients ask me to try some spammy code, I outright refuse and explain to them why in the long run (and sometimes in short run) it is not a good thing to do.
This is why I am completely baffled that these two sites remain unindexed.

jimbeetle

10:06 pm on Dec 8, 2002 (gmt 0)

Very interesting. If you have the links Google should be finding it. One of the sites is in DMOZ. On the one site where you use javascript navigation you also have straight html links on the page for bots to follow. Everything straightforward in keywords and description. Looks good.

Except, maybe?

The searchengineworld robots.txt validator results says everything is okay and shows your robot.txt file as two lines:

1 User-agent: *
2 Disallow:

Notepad also shows it as two lines.

When I access the robots.txt files through the browser they look like this:

User-agent: * Disallow:

Noticed that Alta Vista also only indexed the home page of each and isn't going any deeper.

Can't see any other reason other than the robots.txt anomaly but it's going to take someone with more savvy on that than me to say that that's the problem.

Do your logs show other spiders crawling deeper?

allanp73

10:37 pm on Dec 8, 2002 (gmt 0)

It seems that both are poorly indexed by other search engines as well. I checked lycos and alltheweb. My weblogs are currently down so I couldn't give a precise answer from them.
I hope there is someone who can figure out this mystery.

martinibuster

12:58 am on Dec 9, 2002 (gmt 0)

Whoa, jimbeetle has something there.

Take that robots.txt out of there.

If you want to be spidered, it's okay to have no robots.txt in there. Only use it when you are specifying something to hide.

The wildcard (*) you are using it is probably instructing all bots to not index your sites.

jdMorgan

1:06 am on Dec 9, 2002 (gmt 0)


User-agent: *
Disallow:

Allows all robots to index all files and avoids a ton of 404 errors in your logs.


User-agent: *
Disallow: /

Disallows all robots from all files.

The existing robots.txt should not be a problem, and it validates.

Jim

allanp73

1:28 am on Dec 9, 2002 (gmt 0)

Hi,

I don't think it's the robots.txt file that is creating the problem. The sites were having problems even when I didn't use the file. Actually, I only started using the file after my post in September and this is what I was advised at that time. I didn't believe that it would solve the problem, but I figured it was worth trying.
Actually, the file is configured correctly to allow robots to index the sites.

martinibuster

1:29 am on Dec 9, 2002 (gmt 0)

I have my doubts. The search engine validator has been known to be inaccurate [webmasterworld.com] in the past.

Correct me if I'm wrong, but I believe that the formatting of the robots.txt is incorrect.

The robots.txt in question is written on a single line.

The specifications say that it should be written on two lines.

martinibuster

1:32 am on Dec 9, 2002 (gmt 0)

Uh-oh, started using robots.txt in september.

I guess that would seem to rule that out. But not entirely. I just did a search on your url on FAST and it didn't return any pages, only pages that link to you.

jdMorgan

1:39 am on Dec 9, 2002 (gmt 0)

If you want to rule out a formatting problem, use a unix-type end-of-line. One way to do this is to use M$-Word, and do a "save-as" with file type "ASCII" and "LF-only" end-of-lines.

I use a large and complex robots.txt, and this method works well for me.

There was a report of a french(?) search engine that followed the robot exclusion standard to the letter, and required a blank line at the end of robots.txt, but Google doesn't require it, and will even accept a blank file for robots.txt if you want to try that.

Jim

jimbeetle

2:32 am on Dec 9, 2002 (gmt 0)

Got it!

Knock the robots meta tag out:

It's incorrect syntax. The robots meta content accepts either one or two values, not three.

"All" or "None" are simply shorthand:

All equals "index,follow"
None equals "noindex,nofollow"

Think that's it,

Jim

allanp73

2:42 am on Dec 9, 2002 (gmt 0)

Strangely my word program doesn't allow me to save in this format. I tried uploading a changed version but still get the robots.txt as a one liner.
Can someone email me a robots.txt file?

allanp73

2:48 am on Dec 9, 2002 (gmt 0)

>>jimbeetle

The <META NAME="ROBOTS" CONTENT="All, Index, Follow"> I added this about a month or so ago and only tried it out of desparation. I noticed in on a site which is listed on Google with no problems. I don't think this line of code is causing the problem. I also don't believe the robots.txt file is causing the problem. There has to be something else.

Key_Master

2:50 am on Dec 9, 2002 (gmt 0)

allanp73,

Are you parsing txt files for SSI? The "one liner" problem you reported is a common side effect of this. There is a work around. Either way, it shouldn't matter to the spider.

allanp73

2:53 am on Dec 9, 2002 (gmt 0)

Oh I just wanted to mention that the robots.txt file doesn't even seem to be hit by the crawlers. It is like they are avoiding the site altogether.

martinibuster

3:01 am on Dec 9, 2002 (gmt 0)

I did a url search on google and fast, and both showed no results. Except for fast, which automatically started showing results that contained the url within their pages. Google likewise showed the sites that contained the term "urlthatwontgetindexed" when you clicked on a link for that type of search.

Obviously, the symptom is that the search engines are pulling a blank for your url, like it doesn't exist. Exactly the type of response you would expect if you were banning the bots from your web site.

What this possibly points to is that either the robots meta is getting in the way, or the robots.txt is doing it. I would advise you to ditch the both of them, and wait two months for a new spidering and indexing cycle.

allanp73

3:11 am on Dec 9, 2002 (gmt 0)

I don't mind removing the robots.txt file. Yet, why did I experience this problem before I even had a robots file. This has been going for a long time. I only added the silly robots file because I was advised to so.
There must be something else. I deleted the file and the meta code. Does the site seem fine now? Am I going to have to write in another three months that the sites are still not getting indexed. Where's Googleguy when you need him :)?

jdMorgan

3:37 am on Dec 9, 2002 (gmt 0)

allanp73,

You must select File->Save As->Other encoding->US ASCII + End Lines with: LF only to get this option. I'm not sure if the option is available in all versions of Word, though.

What is the history of your domain names? Is it possible that they were previously-owned and therefore could have been blacklisted for past offenses before you got them?

Jim

allanp73

3:52 am on Dec 9, 2002 (gmt 0)

I tried to save it as you suggested but I do not seem to have this option.

I wondered if a previous owner could have gotten the sites blacklisted. However, the sites didn't seem to have previous owner.

tedster

4:31 am on Dec 9, 2002 (gmt 0)

the robots.txt file doesn't even seem to be hit by the crawlers

I assume you had no 404's before you put up the robots.txt? As a rule, Googlebot asks for and obeys robots.txt files, so the fact that it doesn't come looking for it is mighty peculiar.

Krapulator

5:52 am on Dec 9, 2002 (gmt 0)

You could try browsing your site using a text only browser like Lynx. It sounds like there is a fundamental problem somewhere, since none of the bots seem to be crawling.

solution

2:58 am on Dec 10, 2002 (gmt 0)

allanp73

Are there any 404s in your logs prior to you adding your robots.txt?

If so, it could mean that the SE can't understand your robots.txt

Go to [textpad.com...] and download this shareware editor. Find out what type of server the robots.txt file is on (NT, MAC, or UNIX). Open the file in textpad and push F12 (save as) It will give you the choice to save the file in either PC MAC or UNIX format. Save it and upload it and cross your fingers.

Nick.

allanp73

8:14 am on Dec 12, 2002 (gmt 0)

This is an update. I just got access to my logs.
The robots.txt does get requested but only by AltaVista. Lycos and Google refuse even visit the site. I desparately need help. These are two legitimate sites worthy of being listed but Google still doesn't visit even though one has a DMOZ link and both have over 30 links.
What can I do to get Google to index these sites?
PLEASE HELP!

msr986

8:44 am on Dec 12, 2002 (gmt 0)

Although rare, maybe these sites are being hosted in a 'bad neighborhood'. The IP addresses may be blacklisted.

Key_Master

8:47 am on Dec 12, 2002 (gmt 0)

allanp73,

You might get more help if you put the domain in your profile.

jimbeetle

3:48 pm on Dec 12, 2002 (gmt 0)

Hi again Allan,

It looks like the two sites you stickied to me and all the rest of the "Networked" sites that are interlinked -- as well as the host -- are hosted on the same server. It might be possible that you finally hit "spider overload" and they refuse to index any more interlinked sites on that server.

You might take a closer look at all of that linking before any more of your sites get hurt. There's a few discussions of interlinking among sites on the same server on this board. My own observation is that corporate whales can get away with it while us small fry have to be very, very careful.

Other than that this is still very perplexing.

Jim

allanp73

7:44 pm on Dec 12, 2002 (gmt 0)

This is not the problem. These two sites were having difficulties long before they were linked to the other sites. Actually, I just added 10 new sites to the network and they were indexed fine.
Also, the sites aren't really crosslinked they all mention each other on their links page. It is not straight on crosslinking.
I just want to rule out some possibilities:
- not crosslinking
- not robots file
- not DNS
- not lack of indexed links

Could be:
- Source code problem (though I checked it)
- A previous owner of the URL could have gotten the site blacklisted (However I don't think there was a previous owner)
- Act of God

I really don't know. :(

hetzeld

4:21 am on Dec 13, 2002 (gmt 0)

Hi Allanp73,

When checking your 2 domains, it appears that none of them have a valid DOCTYPE declaration or character encoding labeling... I think this could be it as the W3C validator gives a fatal error on the first one and complains about the character encoding for the second.

Greetings from France, although I'm not a frenchie ;-)

Dan

hetzeld

4:31 pm on Dec 13, 2002 (gmt 0)

Hi again,

On the first web, there's a missing <tr> opening tag for the table.
It is possible that search engines are far less forgiving than NN or IE... In that case, they don't see any text :-((

Dan

This 40 message thread spans 2 pages: 40