Forum Moderators: open

Message Too Old, No Replies

Google refuses to spider site. It has been more than a year!

Google hits the index page and goes no further.

         

Crow_Song

5:34 pm on Sep 8, 2003 (gmt 0)

10+ Year Member



I am the web developer for the Faculty of Applied Science at a Canadian university. About a year ago, I redesigned one of our department's sites. It was not the smoothest transition: we moved it to a different server, I changed the structure of the directories and the names of pages, and we even changed the domain name and IP (it was using two domain names, and we dropped one). I expected Google to take a few months to re-index the site, but now a year later, I have a lot of angry profs blaming me for Google not listing their research.
I can't figure it out. I am at a loss. The code itself is not the problem...I am using a template that I also use on several other departments. There is some server-side scripting and includes, but nothing weird. Google has recently finally assigned a PR of 6 to the homepage, and a PR of 5 to several pages one level in. But it will not spider the site, nor will it assign a page rank of more than 0 to any other pages on the site (at least 0 is an improvement! Until a few weeks ago, there was no rank at all). I watch the logs every day, and Google hits the robots.txt page, and then the index page...and that's it. It's been doing that for months and months, never going to another page. The robots page excludes only a testing directory, and nothing more (I've even tried removing it altogether).

There are tons of links to the site, and I have painstakingly contacted webmasters from sites who listed the old url. They have updated their links, but it doesn't seem to matter.

Anyone have any ideas? I have tried submitting the site to DMOZ, but they have not listed it.

Thanks for any help. I am at my wit's end.

claus

3:37 pm on Sep 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> piece of software is sending this?

This is NOT a Google error, it's an error within the samspade tool that occurs when you try to GET a web page using HTTP/1.0. I've been looking through all i could find on the subject and it's not really documented. The built in spider, however, also uses HTTP/1.0 (at least i found a reference to this somewhere) and this one is able to fetch the pages, so there's no real problem here, it's a samspade bug.

/claus

Crow_Song

3:56 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



Besides, I am not seeing any 404s in the logfiles. Googlebot gets a 200 on the index page. You had me thinking we were on to something here...
;)

wkitty42

4:42 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



plumsauce, my apologies... i was being too terse... here are the actual attempts using the windows samspade client from samspade.org... i'm only including up to the headers...

http 0.9


09/12/03 12:33:45 Browsing http*//appsci.queensu.ca/
Fetching http*//appsci.queensu.ca/ ...
GET /

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 12 Sep 2003 16:36:51 GMT
MicrosoftOfficeWebServer: 5.0_Pub
Connection: Keep-Alive
Content-Length: 23638
Content-Type: text/html
Expires: Fri, 12 Sep 2003 16:36:51 GMT
Set-Cookie: ASPSESSIONIDSADQCQTC=JOBJNJPAFNJALKFHMKJBLAKC; path=/
Cache-control: private

http 1.0


09/12/03 12:35:44 Browsing http*//appsci.queensu.ca/
Fetching http*//appsci.queensu.ca/ ...
GET /appsci.queensu.ca/ HTTP/1.0
User-Agent: Sam Spade 1.14

HTTP/1.1 404 Object Not Found
Server: Microsoft-IIS/5.0
Date: Fri, 12 Sep 2003 16:38:50 GMT
Content-Length: 4040
Content-Type: text/html

http 1.1


09/12/03 12:36:38 Browsing http*//appsci.queensu.ca/
Fetching http*//appsci.queensu.ca/ ...
GET / HTTP/1.1
Host: appsci.queensu.ca
Connection: close
User-Agent: Sam Spade 1.14

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 12 Sep 2003 16:39:43 GMT
MicrosoftOfficeWebServer: 5.0_Pub
Connection: close
Content-Length: 23638
Content-Type: text/html
Expires: Fri, 12 Sep 2003 16:39:43 GMT
Set-Cookie: ASPSESSIONIDSADQCQTC=CPBJNJPAFFPPPPKMGFKHMNAO; path=/
Cache-control: private

it does appear to be a bug in samspade... now to find try to find another safe (ie: non-rendering) browser that allows for the different http versions and client ids :(

[edit] delinked urls [/edit]

matuloo

4:53 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



I for one believe that the problems you have are caused by the coding of your pages. It seems to complicated for me and I wouldnt be surprised if the spider just can't figure out what to do with the pages.

I suggest you to make a simple test.

redesign the site map in this way :
get rid of all css, stylesheets, onmouseovers and all that stuff. Make a screenshot of the header of your sitemap, turn it into a .jpg or .gif and place it on the top of the page so it keeps the uniform design of the whole site.
List all those links to other pages within the site using only simple html, and use absolute links with "http://..."
No onmouseovers, no css, no other stuff, just simple html.
Get freshbot to the site and watch the logs to see what happens, I am almost sure it will follow the links and in that case you know which path to choose with the site.

Crow_Song

5:01 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



Hi matuloo

I appreciate the suggestions. What makes me feel that it is not the code, however is:

a. I already tried your suggestion - straight html. No javascript at all, no css. No robots.txt either. Link to a site map. I left it like that for two months and watched the logs. The Googlebot came to the index page almost every day, but never went to another page.

b. The other 9 sites that are all using the same template are unaffected by this phenomenon.

Can anyone think of any server config issues that might cause this? We have tried to compare the server to the others and haven't found anything so far, but...

Cheers,
Crow

Josefu

6:43 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



A bit of good news concerning spidering - I got crawled for the first time today. Google's grabby groping hands were all over me... and I liked it : )

...I hope the googlebot is female...

brass monkey

7:15 pm on Sep 12, 2003 (gmt 0)

10+ Year Member



"Finally...How does one go about hiring an SEO?"

I recommend that when you talk to potential SEO companies, you keep the following in mind:

1) Make sure the people that you contact actually understand what you are trying to accomplish with your site. There is not a one size fits all system that is applicable for search engine optimization.

2) Ask for references. Not just a list with a description of what they did for a client, but phone numbers to actually contact their clients and learn about the successes from the client first-hand. Speaking to a reference should build a lot of credibility for a SEO firm.

3) Make sure they fully explain what they are going to be doing to your site and how they are going to accomplish the goals of the site. Be wary of any SEO firm that keeps you in the dark on their practices or claims to have any proprietary knowledge. It pretty cut and dry if you are an SEO and there is really nothing to hide (no secrets).

Just some friendly advice...

BrAsS_mOnKeY

Patrick Taylor

10:12 pm on Sep 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Freshman: Please somebody explain the 301/302 issue to me.

I too would appreciate a pointer on this.

claus

11:54 pm on Sep 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> the 301/302 issue

Take a look at the posts in:

1) Website Technology Issues: [webmasterworld.com...]

2) Webmaster General: [webmasterworld.com...]

You'll find plenty of threads with titles like "redirect...", "301 redirect...", "302...", "301...", "htaccess redirect" and the like.

Basically it's all about letting the server tell the browser (or whatever user-agent is visiting) that a certain file is moved from one location to another. There are two ways of doing it, a 301-way and a 302-way.

If the file is moved temporarily and will come back, you should use a 302 and if the file is moved permanently and it will not come back you should use a 301. Both are usually controlled by means of a special file called ".htaccess" on the Apache *nix server. The MS IIS server has other methods for doing the same thing.

Neither of these "status codes" has any relation to the problem Crow_Song is reporting, as his server on the problem site (IIS) returns a code 200 which means "no problems, here's the file you wanted".

/claus

plumsauce

4:02 am on Sep 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



can anyone confirm or deny that G consistently
sends the host header?

I know the usual answer that a dedicated ip is not
needed to be listed in G, which implies that the
host header is sent.

However, a site was recently moved to a dedicated
ip on the same physical server with no other changes.
It's down to one page in a site: -xxxx query.
G had worked it's way up to 140+ pages, while fast.no
had and still has all 170+ pages. This also coincided
with the implementation of 304 if-modified processing
as per G's recent recommendation. The 304 response
was checked using the tools at squid.org.

Is it possible that G notices that a specific ip
is used for only one site and decides not to send
a host header? This site depended on the host
header to gen the right content. Otherwise,404.

+++

Patrick Taylor

4:55 am on Sep 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



/claus: >> the 301/302 issue

Thanks claus. I'll go and read.

Regards,

Patrick

Yidaki

8:23 am on Sep 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>can anyone confirm that G consistently sends the host header?

Yes - always! Allthough GoogleBot makes a HTTP/1.0 [webmasterworld.com] request, it always sends the host header. Doing HTTP/1.0 instead of HTTP/1.1 doesn't mean that it's forbidden to send the host header request. It just means that HTTP/1.1 compliant clients MUST send the header.

Crow_Song

1:04 pm on Sep 18, 2003 (gmt 0)

10+ Year Member



It has been suggested that I add meta tags to the home page to encourage Google's spider - does this work? I had been under the belief that Google ignores such tags...

Powdork

4:24 pm on Sep 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google does not ignore meta tags but they are of limited value with Google. They will not encourage spidering.

Crow_Song

5:13 pm on Sep 18, 2003 (gmt 0)

10+ Year Member



I'm still trying to explore reasons why the site might be ignored. Another suggestion was made to me that other servers using names above our parent domain might pose a problem. If our site is for example,
parent.domain.ca and there are other servers using
word.parent.domain.ca
then that might be a problem. Can anyone comment on this?

Also, I'm curious as to why links to the old name/old server still remain in Google's index. The old domain was retired almost a year ago, replaced by the new. The new hasn't been indexed, and links to the old are STILL kicking around. Why haven't they been removed by now? Would it be worthwhile to resurrect the old name and put redirects in place? And perhaps to monitor whether or not Google is still trying to visit the old site?

Thanks again guys, for all of the help

claus

8:11 pm on Sep 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I noticed that you had some sub-subs in the serps which seemed not to be active (don't recall if it was Google or Alltheweb that showed them).

If they're still there it's probably because they don't return a 404. It's my only guess right now.

/claus

This 76 message thread spans 3 pages: 76